Skip to content

Conversation

@ebetica
Copy link

@ebetica ebetica commented Feb 23, 2016

Adding SparseLinear with CUDA. Most of the functions are directly converted from SparseLinear.c. Depending on how well THCudaBlas operations are pipelined, it may be more efficient to write custom kernels for most of the operations. UpdateOutput uses cusparse.

nnz,
&pBufferSize
);
cudaMalloc((void**)&pBuffer, pBufferSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you avoid cudaMalloc and free, and instead preallocate a buffer that is passed-in.
cudaFree causes a device synchronization, which stops us from doing multi-GPU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size is not known ahead of time... I'm not sure how I would preallocate it. Would using a THCudaStorage work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this is part of an nn layer, you can initialize it in nn, keep it around and pass it in. You can call THCudaTensor_resize() which will just resize if it needs a bigger buffer.
A similar line of code would be:
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L51
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L109
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L180

and follow the "columns" variable in here:
https://github.com/torch/cunn/blob/master/lib/THCUNN/SpatialConvolutionMM.cu

@ebetica
Copy link
Author

ebetica commented Feb 23, 2016

Should be fixed, you merged the addition in nn half an hour ago Soumith, thanks.

@ebetica ebetica force-pushed the sparse_linear branch 4 times, most recently from 3239d64 to 1a09fac Compare February 29, 2016 23:59
@ebetica
Copy link
Author

ebetica commented Mar 1, 2016

This is now updated with the batch version of sparse linear given in this commit

csr_int = THCudaIntTensor_newWithSize1d(state, batchnum+1);
init_cusparse();
for (h = 0; h < batchnum+1; h++) {
THCudaIntTensor_set1d(state, csr_int, h, 1 + nnz * h);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this for loop a simple-stupid CUDA kernel.

@ebetica
Copy link
Author

ebetica commented Mar 14, 2016

This has been updated to work with the PR @ torch/nn#698

thrust::copy(ptr, ptr+THCudaTensor_nElement(state, tensor), std::ostream_iterator<float>(std::cout, "\t"));
printf("\n");
}
void printCuda(THCState *state, THCudaIntTensor *tensor, char* str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be double-declared here

@ebetica
Copy link
Author

ebetica commented Mar 18, 2016

Fixed nits

soumith added a commit that referenced this pull request Mar 18, 2016
Adding SparseLinear with CUDA
@soumith soumith merged commit 64959b2 into torch:master Mar 18, 2016
@soumith
Copy link
Member

soumith commented Mar 18, 2016

Thanks Zeming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants