-
Notifications
You must be signed in to change notification settings - Fork 172
Adding SparseLinear with CUDA #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/THCUNN/SparseLinear.cu
Outdated
| nnz, | ||
| &pBufferSize | ||
| ); | ||
| cudaMalloc((void**)&pBuffer, pBufferSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you avoid cudaMalloc and free, and instead preallocate a buffer that is passed-in.
cudaFree causes a device synchronization, which stops us from doing multi-GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer size is not known ahead of time... I'm not sure how I would preallocate it. Would using a THCudaStorage work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since this is part of an nn layer, you can initialize it in nn, keep it around and pass it in. You can call THCudaTensor_resize() which will just resize if it needs a bigger buffer.
A similar line of code would be:
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L51
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L109
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L180
and follow the "columns" variable in here:
https://github.com/torch/cunn/blob/master/lib/THCUNN/SpatialConvolutionMM.cu
53e46ee to
37e29f6
Compare
|
Should be fixed, you merged the addition in nn half an hour ago Soumith, thanks. |
3239d64 to
1a09fac
Compare
|
This is now updated with the batch version of sparse linear given in this commit |
lib/THCUNN/SparseLinear.cu
Outdated
| csr_int = THCudaIntTensor_newWithSize1d(state, batchnum+1); | ||
| init_cusparse(); | ||
| for (h = 0; h < batchnum+1; h++) { | ||
| THCudaIntTensor_set1d(state, csr_int, h, 1 + nnz * h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this for loop a simple-stupid CUDA kernel.
|
This has been updated to work with the PR @ torch/nn#698 |
lib/THCUNN/SparseLinear.cu
Outdated
| thrust::copy(ptr, ptr+THCudaTensor_nElement(state, tensor), std::ostream_iterator<float>(std::cout, "\t")); | ||
| printf("\n"); | ||
| } | ||
| void printCuda(THCState *state, THCudaIntTensor *tensor, char* str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to be double-declared here
|
Fixed nits |
|
Thanks Zeming! |
Adding SparseLinear with CUDA. Most of the functions are directly converted from SparseLinear.c. Depending on how well THCudaBlas operations are pipelined, it may be more efficient to write custom kernels for most of the operations. UpdateOutput uses cusparse.