Skip to content
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.

[PR-REVIEW] Load fatbin at runtime #148

Merged
merged 46 commits into from Aug 5, 2020

Conversation

mnicely
Copy link
Contributor

@mnicely mnicely commented Jul 6, 2020

This PR is to investigate the use of loading kernels via PTX and cubins. Instead of compiling them at runtime.

The idea is to skip the process from Source Code to PTX or cubin. This should eliminate the need to precompile desired kernels, making the UI a little more friendly.

There are pros and cons to both PTX and cubin.

PTX

Pro: We can choose a single architecture (default 3.0) and any hardware will JIT based on Compute Capability.
Con: This can leave performance on the table and can be slower than cubins

Cubin

Pro: Optimal performance and (slightly) faster load because it skip the JIT
Con: Requires a cubin for each supported architecture (i.e. sm30, sm35, sm50, sm52, sm60, sm62, sm70, sm72, sm75, sm80)

Each methode required more space for files and cu file (if we decide to load it)

Currently, using PTX method.

Anecdotal results. PTX and cubin is ~18x faster on first pass

SOURCE CODE
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   78.3       259557232         100       2595572.3         2163073         2954536  gpu_lombscargle_4   
   18.9        62547903           1      62547903.0        62547903        62547903  gpu_lombscargle_0 <-- 
    1.0         3339638           1       3339638.0         3339638         3339638  gpu_lombscargle_1   
    0.9         2960195           1       2960195.0         2960195         2960195  gpu_lombscargle_2   
    0.9         2951219           1       2951219.0         2951219         2951219  gpu_lombscargle_3
PTX
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   95.2       247349403         100       2473494.0         2121631         2904074  gpu_lombscargle_4   
    1.3         3447313           1       3447313.0         3447313         3447313  gpu_lombscargle_0 <--  
    1.2         3234977           1       3234977.0         3234977         3234977  gpu_lombscargle_1   
    1.1         2904661           1       2904661.0         2904661         2904661  gpu_lombscargle_2   
    1.1         2902191           1       2902191.0         2902191         2902191  gpu_lombscargle_3
CUBIN
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range               
-------  --------------  ----------  --------------  --------------  --------------  --------------------
   95.2       239095813         100       2390958.1         2065998         2840041  gpu_lombscargle_4   
    1.3         3325468           1       3325468.0         3325468         3325468  gpu_lombscargle_0 <--  
    1.3         3163933           1       3163933.0         3163933         3163933  gpu_lombscargle_1   
    1.1         2828210           1       2828210.0         2828210         2828210  gpu_lombscargle_2   
    1.1         2823180           1       2823180.0         2823180         2823180  gpu_lombscargle_3

@mnicely mnicely added the 2 - In Progress Currenty a work in progress label Jul 6, 2020
@mnicely mnicely added this to the 0.15 milestone Jul 6, 2020
@mnicely mnicely requested a review from a team as a code owner July 6, 2020 21:05
@mnicely mnicely self-assigned this Jul 6, 2020
@mnicely mnicely added this to PR-WIP in v0.15 Release via automation Jul 6, 2020
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mnicely
Copy link
Contributor Author

mnicely commented Jul 7, 2020

Although not explicitly documented in CuPy documentation. I have been able to load a fatbin into a RawModule in accordance with the CUDA Driver API.

Doing so allows for the following:

  1. A fatbin requires less space than multiple cubins and ptx files
  2. One single fatbin is easier to maintain than multiple cubins and ptx files
  3. With a fatbin, no additional logic is required to select appropriate binary or ptx based on compute capability

Next steps.

  1. Build fatbins, manual and perform benchmarks.
  2. Create cmake to automatically build fatbins during cuSignal build process.

Notes:

  1. Using cuModuleLoad doesn't allow for C++ name-mangling. Therefore, CUDA kernels will need to be wrapped with extern "C". Example - Allow C++ Templating Functionality cupy/cupy#3185 (comment)
  2. CuPy RawModule code will be moved to cu files, which will reside in a cpp folder. Similar to other RAPIDS APIs
  3. Fatbins will be sister folders under a fatbin folder.
  4. For cuSignal 0.15, binares for CC3.5 through CC7.5 and a PTX for CC7.5 will reside in the fatbin.
  5. Once CuPy is compatible with CUDA 11.0 (v8) and cuSignal syncs with CuPy (v8), a binary will be added for CC8.0 and the PTX for CC7.5 will be replace with CC8.0.
  6. With CUDA 11.0, CC35, CC37, and CC50, are deprecated. When they are removed from CUDA, they will be removed from cuSignal.

@mnicely mnicely changed the title [PR-WIP] Load PTX/Cubin at runtime [PR-WIP] Load fatbin at runtime Jul 7, 2020
@awthomp
Copy link
Member

awthomp commented Jul 7, 2020

CC @leofang

Leo -- Wanted to give you a heads up on some of the great work @mnicely has done loading cubins at runtime and avoiding a JIT overhead. Maybe there's something here that we can encapsulate and upstream to cupy?

@mnicely
Copy link
Contributor Author

mnicely commented Jul 7, 2020

@awthomp With the move to loading fatbins at runtime. I'm not sure there's a need for cusignal.precompile.
Let me know what you think.
Instead of removing it in 0.15, we can deprecate and get feedback.
Remove in 0.16 if there are no issues.

@mnicely mnicely linked an issue Jul 7, 2020 that may be closed by this pull request
@leofang
Copy link
Member

leofang commented Jul 15, 2020

I myself have not tried this, but I guess it's doable. I've added the support of separate compilation to CuPy, and there the key is to call cuLinkAddFile to add the device runtime before completing the linking. What you need is to call it one more time to add your fatbin -- note there is an option CU_JIT_INPUT_FATBINARY for this purpose.

(Sorry, it's hard to add links on my phone, let me know if you need pointers😅)

@mnicely
Copy link
Contributor Author

mnicely commented Jul 15, 2020

What you described is very similar to this I think... https://parallel-computing.pro/index.php/9-cuda/40-use-cuda-7-0-nvrtc-with-thrust

But I'm guessing CuPy doesn't have an abstract layer to do this 😄

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we get this working, maybe someone from the RAPIDS team can help move my build process for fatbins to CMake?

@mnicely this PR LGTM (although I admit I'm not familiar with PTX, cubin, or fatbins) but I can help with the cmake changes when you're ready.

@mnicely
Copy link
Contributor Author

mnicely commented Jul 15, 2020

Thanks @rlratzel!

Here is the code that builds the fatbins. I'm currently running it in build.sh
Do you think this can put in a cmake? And that cmake be called from python/setup.py?

################################################################################
# Build fatbins
SRC="cpp/src"
FAT="python/cusignal"
FLAGS="-std=c++11"

if hasArg -p; then
    FLAGS="${FLAGS} -Xptxas -v -Xptxas -warn-lmem-usage -Xptxas -warn-double-usage"
fi

GPU_ARCH="--generate-code arch=compute_35,code=sm_35 \
--generate-code arch=compute_35,code=sm_37 \
--generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_50,code=sm_52 \
--generate-code arch=compute_53,code=sm_53 \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_61,code=sm_61 \
--generate-code arch=compute_62,code=sm_62 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_72,code=sm_72 \
--generate-code arch=compute_75,code=[sm_75,compute_75]"

echo "Building Convolution kernels..."
FOLDER="convolution"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_convolution.cu -odir ${FAT}/${FOLDER}/ &

echo "Building Filtering kernels..."
FOLDER="filtering"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_upfirdn.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_sosfilt.cu -odir ${FAT}/${FOLDER}/ &

echo "Building IO kernels..."
FOLDER="io"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_reader.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_writer.cu -odir ${FAT}/${FOLDER}/ &

echo "Building Spectral kernels..."
FOLDER="spectral_analysis"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_spectral.cu -odir ${FAT}/${FOLDER}/ &

wait

@rlratzel
Copy link
Contributor

Here is the code that builds the fatbins. I'm currently running it in build.sh
Do you think this can put in a cmake? And that cmake be called from python/setup.py?

I think that can be moved out of the build script and into (some part of) cmake, where it will generate the Makefile with those calls. From there, python/setup.py should be able to build those extensions via the updated/generated Makefile. Disclaimer: I'm assuming some of these things based on how other repos build, but I'll find out the specifics when we start the updates.

@mnicely are you thinking you want the cmake updates as part of this PR or a separate followup PR?

@mnicely
Copy link
Contributor Author

mnicely commented Jul 15, 2020

@mnicely are you thinking you want the cmake updates as part of this PR or a separate followup PR?

@rlratzel Let's work it in another PR after this is pushed. 😄

@mnicely
Copy link
Contributor Author

mnicely commented Jul 20, 2020

After further considerations, we've decided to remove precompile_kernels from 0.15.

This is because, now all kernel calls will work as if precompile_kernels was called at import.

@mnicely
Copy link
Contributor Author

mnicely commented Jul 20, 2020

Closes #129

v0.15 Release automation moved this from PR-WIP to PR-Reviewer approved Aug 5, 2020
@BradReesWork BradReesWork merged commit 2fa9386 into rapidsai:branch-0.15 Aug 5, 2020
v0.15 Release automation moved this from PR-Reviewer approved to Done Aug 5, 2020
@mnicely mnicely deleted the use_cubins branch August 5, 2020 14:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3 - Ready for Review Ready for review by team
Projects
No open projects
v0.15 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] Restructure code to load cubins.
6 participants