You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm getting quite non-stellar perf with apply2 and 3 on non-nvidia, using cltorch port.
I think this is possibly because they're expecting coallesced memory reads, and they prefer float4s?
For apply2, I'm thinking of finding the two dimensions with the smallest stride (ideally: 1), and forming 16 by 16 chunks of floats over those two dimensions, and giving each chunk to a single 64x1 workgroup/cuda-block. This block will download that chunk using two float4 reads per thread, one for each tensor, into local/shared memory, hopefully somewhat pseudo-coallesced. Then, each workgroup/cuda-block will run the apply off the shared/local memory.
thoughts?
The text was updated successfully, but these errors were encountered:
I'm assuming that the non nvidia hardware does not have a texture cache? It's possible that the nvidia hardware works for this kernel because it assumes that auto-caching of the global reads via the texture caches.
You could start explicitly using the shared memory, your idea seems decent to do that.
Hi,
I'm getting quite non-stellar perf with apply2 and 3 on non-nvidia, using cltorch port.
I think this is possibly because they're expecting coallesced memory reads, and they prefer float4s?
For apply2, I'm thinking of finding the two dimensions with the smallest stride (ideally: 1), and forming 16 by 16 chunks of floats over those two dimensions, and giving each chunk to a single 64x1 workgroup/cuda-block. This block will download that chunk using two float4 reads per thread, one for each tensor, into local/shared memory, hopefully somewhat pseudo-coallesced. Then, each workgroup/cuda-block will run the apply off the shared/local memory.
thoughts?
The text was updated successfully, but these errors were encountered: