[PR-REVIEW] Load fatbin at runtime #148

mnicely · 2020-07-06T21:05:56Z

This PR is to investigate the use of loading kernels via PTX and cubins. Instead of compiling them at runtime.

The idea is to skip the process from Source Code to PTX or cubin. This should eliminate the need to precompile desired kernels, making the UI a little more friendly.

There are pros and cons to both PTX and cubin.

PTX

Pro: We can choose a single architecture (default 3.0) and any hardware will JIT based on Compute Capability.
Con: This can leave performance on the table and can be slower than cubins

Cubin

Pro: Optimal performance and (slightly) faster load because it skip the JIT
Con: Requires a cubin for each supported architecture (i.e. sm30, sm35, sm50, sm52, sm60, sm62, sm70, sm72, sm75, sm80)

Each methode required more space for files and cu file (if we decide to load it)

Currently, using PTX method.

Anecdotal results. PTX and cubin is ~18x faster on first pass

SOURCE CODE
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   78.3       259557232         100       2595572.3         2163073         2954536  gpu_lombscargle_4   
   18.9        62547903           1      62547903.0        62547903        62547903  gpu_lombscargle_0 <-- 
    1.0         3339638           1       3339638.0         3339638         3339638  gpu_lombscargle_1   
    0.9         2960195           1       2960195.0         2960195         2960195  gpu_lombscargle_2   
    0.9         2951219           1       2951219.0         2951219         2951219  gpu_lombscargle_3
PTX
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   95.2       247349403         100       2473494.0         2121631         2904074  gpu_lombscargle_4   
    1.3         3447313           1       3447313.0         3447313         3447313  gpu_lombscargle_0 <--  
    1.2         3234977           1       3234977.0         3234977         3234977  gpu_lombscargle_1   
    1.1         2904661           1       2904661.0         2904661         2904661  gpu_lombscargle_2   
    1.1         2902191           1       2902191.0         2902191         2902191  gpu_lombscargle_3
CUBIN
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   95.2       239095813         100       2390958.1         2065998         2840041  gpu_lombscargle_4   
    1.3         3325468           1       3325468.0         3325468         3325468  gpu_lombscargle_0 <--  
    1.3         3163933           1       3163933.0         3163933         3163933  gpu_lombscargle_1   
    1.1         2828210           1       2828210.0         2828210         2828210  gpu_lombscargle_2   
    1.1         2823180           1       2823180.0         2823180         2823180  gpu_lombscargle_3

GPUtester · 2020-07-06T21:06:22Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

mnicely · 2020-07-07T16:31:01Z

Although not explicitly documented in CuPy documentation. I have been able to load a fatbin into a RawModule in accordance with the CUDA Driver API.

Doing so allows for the following:

A fatbin requires less space than multiple cubins and ptx files
One single fatbin is easier to maintain than multiple cubins and ptx files
With a fatbin, no additional logic is required to select appropriate binary or ptx based on compute capability

Next steps.

Build fatbins, manual and perform benchmarks.
Create cmake to automatically build fatbins during cuSignal build process.

Notes:

Using cuModuleLoad doesn't allow for C++ name-mangling. Therefore, CUDA kernels will need to be wrapped with extern "C". Example - Allow C++ Templating Functionality cupy/cupy#3185 (comment)
CuPy RawModule code will be moved to cu files, which will reside in a cpp folder. Similar to other RAPIDS APIs
Fatbins will be sister folders under a fatbin folder.
For cuSignal 0.15, binares for CC3.5 through CC7.5 and a PTX for CC7.5 will reside in the fatbin.
Once CuPy is compatible with CUDA 11.0 (v8) and cuSignal syncs with CuPy (v8), a binary will be added for CC8.0 and the PTX for CC7.5 will be replace with CC8.0.
With CUDA 11.0, CC35, CC37, and CC50, are deprecated. When they are removed from CUDA, they will be removed from cuSignal.

…nger needed

awthomp · 2020-07-07T20:41:36Z

CC @leofang

Leo -- Wanted to give you a heads up on some of the great work @mnicely has done loading cubins at runtime and avoiding a JIT overhead. Maybe there's something here that we can encapsulate and upstream to cupy?

mnicely · 2020-07-07T20:48:04Z

@awthomp With the move to loading fatbins at runtime. I'm not sure there's a need for cusignal.precompile.
Let me know what you think.
Instead of removing it in 0.15, we can deprecate and get feedback.
Remove in 0.16 if there are no issues.

… use_cubins

… upfirdn

… use_cubins

leofang · 2020-07-15T00:21:20Z

I myself have not tried this, but I guess it's doable. I've added the support of separate compilation to CuPy, and there the key is to call cuLinkAddFile to add the device runtime before completing the linking. What you need is to call it one more time to add your fatbin -- note there is an option CU_JIT_INPUT_FATBINARY for this purpose.

(Sorry, it's hard to add links on my phone, let me know if you need pointers😅)

mnicely · 2020-07-15T03:35:15Z

What you described is very similar to this I think... https://parallel-computing.pro/index.php/9-cuda/40-use-cuda-7-0-nvrtc-with-thrust

But I'm guessing CuPy doesn't have an abstract layer to do this 😄

rlratzel

Once we get this working, maybe someone from the RAPIDS team can help move my build process for fatbins to CMake?

@mnicely this PR LGTM (although I admit I'm not familiar with PTX, cubin, or fatbins) but I can help with the cmake changes when you're ready.

mnicely · 2020-07-15T17:44:59Z

Thanks @rlratzel!

Here is the code that builds the fatbins. I'm currently running it in build.sh
Do you think this can put in a cmake? And that cmake be called from python/setup.py?

################################################################################
# Build fatbins
SRC="cpp/src"
FAT="python/cusignal"
FLAGS="-std=c++11"

if hasArg -p; then
    FLAGS="${FLAGS} -Xptxas -v -Xptxas -warn-lmem-usage -Xptxas -warn-double-usage"
fi

GPU_ARCH="--generate-code arch=compute_35,code=sm_35 \
--generate-code arch=compute_35,code=sm_37 \
--generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_50,code=sm_52 \
--generate-code arch=compute_53,code=sm_53 \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_61,code=sm_61 \
--generate-code arch=compute_62,code=sm_62 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_72,code=sm_72 \
--generate-code arch=compute_75,code=[sm_75,compute_75]"

echo "Building Convolution kernels..."
FOLDER="convolution"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_convolution.cu -odir ${FAT}/${FOLDER}/ &

echo "Building Filtering kernels..."
FOLDER="filtering"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_upfirdn.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_sosfilt.cu -odir ${FAT}/${FOLDER}/ &

echo "Building IO kernels..."
FOLDER="io"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_reader.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_writer.cu -odir ${FAT}/${FOLDER}/ &

echo "Building Spectral kernels..."
FOLDER="spectral_analysis"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_spectral.cu -odir ${FAT}/${FOLDER}/ &

wait

rlratzel · 2020-07-15T19:14:32Z

Here is the code that builds the fatbins. I'm currently running it in build.sh
Do you think this can put in a cmake? And that cmake be called from python/setup.py?

I think that can be moved out of the build script and into (some part of) cmake, where it will generate the Makefile with those calls. From there, python/setup.py should be able to build those extensions via the updated/generated Makefile. Disclaimer: I'm assuming some of these things based on how other repos build, but I'll find out the specifics when we start the updates.

@mnicely are you thinking you want the cmake updates as part of this PR or a separate followup PR?

mnicely · 2020-07-15T19:33:15Z

@mnicely are you thinking you want the cmake updates as part of this PR or a separate followup PR?

@rlratzel Let's work it in another PR after this is pushed. 😄

mnicely · 2020-07-20T13:41:33Z

After further considerations, we've decided to remove precompile_kernels from 0.15.

This is because, now all kernel calls will work as if precompile_kernels was called at import.

mnicely · 2020-07-20T18:35:58Z

Closes #129

Implementation for lombscargle loading PTX

632acfe

mnicely added the 2 - In Progress Currenty a work in progress label Jul 6, 2020

mnicely added this to the 0.15 milestone Jul 6, 2020

mnicely requested a review from a team as a code owner July 6, 2020 21:05

mnicely self-assigned this Jul 6, 2020

mnicely added this to PR-WIP in v0.15 Release via automation Jul 6, 2020

Moved to cubins. Created cpp file structure.

b64bf55

mnicely added 7 commits July 7, 2020 12:31

Moved to fatbins

0f393c8

Port convolution kernels to fatbin

903f41f

Ported upfirdn to fatbin

9a12d1f

Ported sosfilt fatbin

48f2701

Ported io writer to fatbin

2e16c57

Ported reader to fatbin. Removed dicts from compile_kernels.py, no lo…

c1686d0

…nger needed

Update CHANGELOG.md

23f7191

mnicely changed the title ~~[PR-WIP] Load PTX/Cubin at runtime~~ [PR-WIP] Load fatbin at runtime Jul 7, 2020

Merge branch 'branch-0.15' into use_cubins

1892d0e

mnicely linked an issue Jul 7, 2020 that may be closed by this pull request

[FEA] Restructure code to load cubins. #129

Closed

mnicely added 9 commits July 7, 2020 17:44

Fixed upfirdn errors caused by SciPy 1.5.0 update

e4e2532

Merge branch 'use_cubins' of https://github.com/mnicely/cusignal into…

e004d85

… use_cubins

Fix path issue

c0daf06

Apply clang-format

d9674ee

Added deprecation warning for precompile_kernels

392afe0

Flake8

8c62154

Added build process for fatbins to build.sh. Fixed narrowing issue in…

dee92de

… upfirdn

Updated build.sh

1fd6543

Update build.sh for parallel builds (dirty)

717ade2

mnicely added 2 commits July 14, 2020 19:09

Remove dead code

b67c9c9

Merge branch 'use_cubins' of https://github.com/mnicely/cusignal into…

d1df8ef

… use_cubins

Removed precompile from benchmarks, as no longer needed

d7fedcf

BradReesWork requested a review from rlratzel July 15, 2020 13:29

Merge branch-0.15

afdee59

rlratzel approved these changes Jul 15, 2020

View reviewed changes

mnicely added 4 commits July 19, 2020 23:44

Merge branch 'branch-0.15' into use_cubins

d813b50

Minor changes

c90ea85

fix flake8

243281f

Removed precompile_kernel()

d890edb

mnicely added 8 commits July 20, 2020 16:30

Merge branch-0.15

4ae6f72

Fix Flake8

295e2a1

Merge branch-0.15

36f9fee

Fix flake8

205adc4

Fix CI issues

d7379ca

Added a few consts

e974ea0

Update sosfilt benchmark

22996a7

Added launch bounds to kernels

51c619b

BradReesWork approved these changes Aug 5, 2020

View reviewed changes

v0.15 Release automation moved this from PR-WIP to PR-Reviewer approved Aug 5, 2020

BradReesWork merged commit 2fa9386 into rapidsai:branch-0.15 Aug 5, 2020

v0.15 Release automation moved this from PR-Reviewer approved to Done Aug 5, 2020

mnicely deleted the use_cubins branch August 5, 2020 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PR-REVIEW] Load fatbin at runtime #148

[PR-REVIEW] Load fatbin at runtime #148

mnicely commented Jul 6, 2020

GPUtester commented Jul 6, 2020

mnicely commented Jul 7, 2020 •

edited

awthomp commented Jul 7, 2020

mnicely commented Jul 7, 2020

leofang commented Jul 15, 2020

mnicely commented Jul 15, 2020 •

edited

rlratzel left a comment

mnicely commented Jul 15, 2020

rlratzel commented Jul 15, 2020

mnicely commented Jul 15, 2020

mnicely commented Jul 20, 2020 •

edited

mnicely commented Jul 20, 2020

[PR-REVIEW] Load fatbin at runtime #148

[PR-REVIEW] Load fatbin at runtime #148

Conversation

mnicely commented Jul 6, 2020

PTX

Cubin

GPUtester commented Jul 6, 2020

mnicely commented Jul 7, 2020 • edited

awthomp commented Jul 7, 2020

mnicely commented Jul 7, 2020

leofang commented Jul 15, 2020

mnicely commented Jul 15, 2020 • edited

rlratzel left a comment

Choose a reason for hiding this comment

mnicely commented Jul 15, 2020

rlratzel commented Jul 15, 2020

mnicely commented Jul 15, 2020

mnicely commented Jul 20, 2020 • edited

mnicely commented Jul 20, 2020

mnicely commented Jul 7, 2020 •

edited

mnicely commented Jul 15, 2020 •

edited

mnicely commented Jul 20, 2020 •

edited