Skip to content
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.

[REVIEW] Use CuPy v8 FFT cache plan #254

Merged
merged 4 commits into from Oct 5, 2020

Conversation

mnicely
Copy link
Contributor

@mnicely mnicely commented Oct 4, 2020

Closes #253

This PR adds a check for CuPy v7 or v8, and uses version 8's internal FFT cache.

Without cache + CuPy v7.8
-------------------------------------------------------------------------------- benchmark 'FFTConvolve': 3 tests --------------------------------------------------------------------------------
Name (time in ms)                        Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fftconvolve_gpu[same-32768]      0.9265 (1.0)      2.3050 (1.37)     1.0593 (1.0)      0.1013 (1.16)     1.0293 (1.0)      0.0479 (1.0)       103;130  943.9904 (1.0)         967           1
test_fftconvolve_gpu[full-32768]      0.9506 (1.03)     1.9485 (1.16)     1.0928 (1.03)     0.0870 (1.0)      1.0477 (1.02)     0.0912 (1.90)       148;39  915.0630 (0.97)       1039           1
test_fftconvolve_gpu[valid-32768]     0.9488 (1.02)     1.6771 (1.0)      1.1023 (1.04)     0.0895 (1.03)     1.0592 (1.03)     0.1142 (2.38)       182;25  907.2237 (0.96)        963           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With cache + CuPy v7.8
---------------------------------------------------------------------------------------- benchmark 'FFTConvolve': 3 tests ----------------------------------------------------------------------------------------
Name (time in us)                          Min                   Max                Mean             StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fftconvolve_gpu[same-32768]      664.1870 (1.01)     1,199.3530 (1.0)      691.5575 (1.0)      31.8148 (1.0)      688.4840 (1.0)      25.0215 (1.0)         78;65        1.4460 (1.0)        1464           1
test_fftconvolve_gpu[full-32768]      675.1820 (1.02)     1,245.5840 (1.04)     707.1642 (1.02)     33.9103 (1.07)     697.7085 (1.01)     40.2040 (1.61)       113;17        1.4141 (0.98)       1394           1
test_fftconvolve_gpu[valid-32768]     659.9210 (1.0)      1,539.0550 (1.28)     708.1542 (1.02)     52.8239 (1.66)     691.9645 (1.01)     27.5100 (1.10)        55;56        1.4121 (0.98)       1460           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Without cache + CuPy v8.0
----------------------------------------------------------------------------------------- benchmark 'FFTConvolve': 3 tests ----------------------------------------------------------------------------------------
Name (time in us)                          Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fftconvolve_gpu[same-32768]      404.8580 (1.01)     1,034.3980 (1.19)     423.5631 (1.0)      26.2704 (1.0)      419.7810 (1.0)       11.3210 (1.0)       183;262        2.3609 (1.0)        2450           1
test_fftconvolve_gpu[valid-32768]     417.4070 (1.04)       869.7810 (1.0)      436.5062 (1.03)     28.0625 (1.07)     430.5310 (1.03)      13.3815 (1.18)      175;314        2.2909 (0.97)       2308           1
test_fftconvolve_gpu[full-32768]      400.9510 (1.0)      1,124.1380 (1.29)     458.4183 (1.08)     66.2477 (2.52)     423.2620 (1.01)     140.0130 (12.37)       476;1        2.1814 (0.92)       1798           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With cache + CuPy v8.0
----------------------------------------------------------------------------------------- benchmark 'FFTConvolve': 3 tests ----------------------------------------------------------------------------------------
Name (time in us)                          Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fftconvolve_gpu[same-32768]      473.9490 (1.0)      1,200.2100 (1.39)     491.7297 (1.0)      26.6978 (1.24)     481.9450 (1.0)       15.4428 (1.14)      238;245        2.0336 (1.0)        2073           1
test_fftconvolve_gpu[valid-32768]     489.1830 (1.03)       866.0560 (1.0)      507.5161 (1.03)     21.5673 (1.0)      498.9190 (1.04)      13.5545 (1.0)       275;267        1.9704 (0.97)       1968           1
test_fftconvolve_gpu[full-32768]      475.2700 (1.00)     1,209.9800 (1.40)     540.0543 (1.10)     61.9996 (2.87)     511.7405 (1.06)     106.6135 (7.87)        314;3        1.8517 (0.91)       1592           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@mnicely mnicely added the 3 - Ready for Review Ready for review by team label Oct 4, 2020
@mnicely mnicely added this to the 0.16 milestone Oct 4, 2020
@mnicely mnicely requested a review from awthomp October 4, 2020 22:56
@mnicely mnicely requested a review from a team as a code owner October 4, 2020 22:56
@mnicely mnicely self-assigned this Oct 4, 2020
@mnicely mnicely added this to PR-WIP in v0.16 Release via automation Oct 4, 2020
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@leofang
Copy link
Member

leofang commented Oct 5, 2020

Hi @mnicely, thanks for the benchmark data. I am a bit confused --- any chance the outcomes for v8 + cache and v8 no cache are swapped? The cached performance seems to be worse if I read it right.

@mnicely
Copy link
Contributor Author

mnicely commented Oct 5, 2020

Hi @mnicely, thanks for the benchmark data. I am a bit confused --- any chance the outcomes for v8 + cache and v8 no cache are swapped? The cached performance seems to be worse if I read it right.

Not swapped, just bad wording! That is the cuSignal cache I created. When I turn it off and use CuPy’s, the FFT is faster!

@leofang
Copy link
Member

leofang commented Oct 5, 2020

Ahh I see, thanks for clarifying, Matt. The number looks very good then! I wonder if all can be attributed to CuPy's cache, or there are additional nice changes made to v8?

@mnicely
Copy link
Contributor Author

mnicely commented Oct 5, 2020

I believe the speedups between cuSignal’s cache + CuPy v7.8 and cuSignal (with no cache) + CuPy v8.0 is solely the cache.

And the differences cuSignal (with no cache) between v7.8 and v8.0 is attributed to many improvements.

v0.16 Release automation moved this from PR-WIP to PR-Reviewer approved Oct 5, 2020
@awthomp awthomp merged commit 9f6eab0 into rapidsai:branch-0.16 Oct 5, 2020
v0.16 Release automation moved this from PR-Reviewer approved to Done Oct 5, 2020
@leofang
Copy link
Member

leofang commented Oct 5, 2020

Thanks, @mnicely! I wonder if you could do one additional test for me when you have time: Use CuPy v8, but turn off all caches (either cuSignal's or CuPy's). The latter can be turned off this way:

import cupy as cp

cache = cp.fft.config.get_plan_cache()
cache.set_size(0)

Note the cache object is per thread & per device, so if your tests span over threads and/or devices, you need to turn them all off in the proper context. For confirmation, if you do cache.show_info(), you'd see a line saying cache enabled? False.

@leofang leofang mentioned this pull request Oct 5, 2020
8 tasks
@mnicely mnicely deleted the global_fft_cache branch October 5, 2020 18:12
@mnicely
Copy link
Contributor Author

mnicely commented Oct 5, 2020

Thanks, @mnicely! I wonder if you could do one additional test for me when you have time: Use CuPy v8, but turn off all caches (either cuSignal's or CuPy's). The latter can be turned off this way:

import cupy as cp

cache = cp.fft.config.get_plan_cache()
cache.set_size(0)

Note the cache object is per thread & per device, so if your tests span over threads and/or devices, you need to turn them all off in the proper context. For confirmation, if you do cache.show_info(), you'd see a line saying cache enabled? False.

Sure, I'll try to have you something by the end of this week!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3 - Ready for Review Ready for review by team
Projects
No open projects
v0.16 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[BUG] Fix FFT cache w/ CuPy v8
4 participants