Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Although not explicitly documented in CuPy documentation. I have been able to load a fatbin into a RawModule in accordance with the CUDA Driver API. Doing so allows for the following:
Next steps.
Notes:
|
@awthomp With the move to loading fatbins at runtime. I'm not sure there's a need for |
I myself have not tried this, but I guess it's doable. I've added the support of separate compilation to CuPy, and there the key is to call (Sorry, it's hard to add links on my phone, let me know if you need pointers😅) |
What you described is very similar to this I think... https://parallel-computing.pro/index.php/9-cuda/40-use-cuda-7-0-nvrtc-with-thrust But I'm guessing CuPy doesn't have an abstract layer to do this 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we get this working, maybe someone from the RAPIDS team can help move my build process for fatbins to CMake?
@mnicely this PR LGTM (although I admit I'm not familiar with PTX, cubin, or fatbins) but I can help with the cmake changes when you're ready.
Thanks @rlratzel! Here is the code that builds the fatbins. I'm currently running it in ################################################################################
# Build fatbins
SRC="cpp/src"
FAT="python/cusignal"
FLAGS="-std=c++11"
if hasArg -p; then
FLAGS="${FLAGS} -Xptxas -v -Xptxas -warn-lmem-usage -Xptxas -warn-double-usage"
fi
GPU_ARCH="--generate-code arch=compute_35,code=sm_35 \
--generate-code arch=compute_35,code=sm_37 \
--generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_50,code=sm_52 \
--generate-code arch=compute_53,code=sm_53 \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_61,code=sm_61 \
--generate-code arch=compute_62,code=sm_62 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_72,code=sm_72 \
--generate-code arch=compute_75,code=[sm_75,compute_75]"
echo "Building Convolution kernels..."
FOLDER="convolution"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_convolution.cu -odir ${FAT}/${FOLDER}/ &
echo "Building Filtering kernels..."
FOLDER="filtering"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_upfirdn.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_sosfilt.cu -odir ${FAT}/${FOLDER}/ &
echo "Building IO kernels..."
FOLDER="io"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_reader.cu -odir ${FAT}/${FOLDER}/ &
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_writer.cu -odir ${FAT}/${FOLDER}/ &
echo "Building Spectral kernels..."
FOLDER="spectral_analysis"
mkdir -p ${FAT}/${FOLDER}/
nvcc --fatbin ${FLAGS} ${GPU_ARCH} ${SRC}/${FOLDER}/_spectral.cu -odir ${FAT}/${FOLDER}/ &
wait |
I think that can be moved out of the build script and into (some part of) cmake, where it will generate the Makefile with those calls. From there, @mnicely are you thinking you want the cmake updates as part of this PR or a separate followup PR? |
After further considerations, we've decided to remove This is because, now all kernel calls will work as if |
Closes #129 |
This PR is to investigate the use of loading kernels via PTX and cubins. Instead of compiling them at runtime.
The idea is to skip the process from Source Code to PTX or cubin. This should eliminate the need to precompile desired kernels, making the UI a little more friendly.
There are pros and cons to both PTX and cubin.
PTX
Pro: We can choose a single architecture (default 3.0) and any hardware will JIT based on Compute Capability.
Con: This can leave performance on the table and can be slower than cubins
Cubin
Pro: Optimal performance and (slightly) faster load because it skip the JIT
Con: Requires a cubin for each supported architecture (i.e. sm30, sm35, sm50, sm52, sm60, sm62, sm70, sm72, sm75, sm80)
Each methode required more space for files and cu file (if we decide to load it)
Currently, using PTX method.
Anecdotal results. PTX and cubin is ~18x faster on first pass