Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas on improving apply perf on non-nvidia hardware? #190

Closed
hughperkins opened this issue Jul 1, 2015 · 1 comment
Closed

Ideas on improving apply perf on non-nvidia hardware? #190

hughperkins opened this issue Jul 1, 2015 · 1 comment

Comments

@hughperkins
Copy link
Contributor

Hi,

I'm getting quite non-stellar perf with apply2 and 3 on non-nvidia, using cltorch port.

I think this is possibly because they're expecting coallesced memory reads, and they prefer float4s?

For apply2, I'm thinking of finding the two dimensions with the smallest stride (ideally: 1), and forming 16 by 16 chunks of floats over those two dimensions, and giving each chunk to a single 64x1 workgroup/cuda-block. This block will download that chunk using two float4 reads per thread, one for each tensor, into local/shared memory, hopefully somewhat pseudo-coallesced. Then, each workgroup/cuda-block will run the apply off the shared/local memory.

thoughts?

@soumith
Copy link
Member

soumith commented Jul 1, 2015

I'm assuming that the non nvidia hardware does not have a texture cache? It's possible that the nvidia hardware works for this kernel because it assumes that auto-caching of the global reads via the texture caches.

You could start explicitly using the shared memory, your idea seems decent to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants