Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Matrix(compressed) * Dense Matrix product #37

Merged
merged 4 commits into from
Jul 25, 2013

Conversation

albertzaharovits
Copy link
Contributor

Note: needs review!!!

@karlrupp
Copy link
Collaborator

Thanks, Albert! The overall structure looks really good :-) Particularly it's great that you have an implementation for all three backends now. As for the kernels, I'll comment the commit inline. You can then either commit on top of the existing commit, or force-push an amended commit.

unsigned int row_start = sp_mat_row_indices[row];
unsigned int row_end = sp_mat_row_indices[row+1];

// load work rows to shared memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should limit the maximum size of shared memory used. If I ramp up the number of columns in the dense matrix to e.g. 3000, the GPU is easily out of shared memory.

for ( unsigned int col = get_local_id(0); col < result_col_size; col += get_local_size(0) ) {

float r = 0;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! You can optimize it a little further by interchanging the loops for 'col' and 'k', because then 'x' needs to be loaded only once. Right now you might get loaded multiple times.

@karlrupp
Copy link
Collaborator

Cool, almost done. Just a few minor optimizations left. :-)

@ptillet
Copy link
Collaborator

ptillet commented Jul 19, 2013

Hi Albert! Great job, really !

I would have some suggestions for potential performance improvement, but I think it would be wiser to first have benchmarks to measure whether performance improvements are necessary.
However, how could we benchmark it? Should we compare against other libraries or compute FLOPs, keeping in mind that the number of floating point operation is data-dependent. I have not enough experience to give an order of magnitude of the sparsity coefficient threshold at which the operation is no longer bandwidth-limited... In all cases, I think this patch would be even better if you could provide a simple benchmark in addition to the tests ;)

@karlrupp
Copy link
Collaborator

@PhilippeTillet For a general sparse-dense multiplication this will always be memory bandwidth limited. To make it more compute-limited we would need to introduce some block-CSR format or similar.

karlrupp added a commit that referenced this pull request Jul 25, 2013
Sparse Matrix(compressed) * Dense Matrix product
@karlrupp karlrupp merged commit af257ca into viennacl:master Jul 25, 2013
@karlrupp
Copy link
Collaborator

Thanks, Albert!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants