Ideas on improving apply perf on non-nvidia hardware? #190

hughperkins · 2015-07-01T14:55:37Z

Hi,

I'm getting quite non-stellar perf with apply2 and 3 on non-nvidia, using cltorch port.

I think this is possibly because they're expecting coallesced memory reads, and they prefer float4s?

For apply2, I'm thinking of finding the two dimensions with the smallest stride (ideally: 1), and forming 16 by 16 chunks of floats over those two dimensions, and giving each chunk to a single 64x1 workgroup/cuda-block. This block will download that chunk using two float4 reads per thread, one for each tensor, into local/shared memory, hopefully somewhat pseudo-coallesced. Then, each workgroup/cuda-block will run the apply off the shared/local memory.

thoughts?

soumith · 2015-07-01T15:01:29Z

I'm assuming that the non nvidia hardware does not have a texture cache? It's possible that the nvidia hardware works for this kernel because it assumes that auto-caching of the global reads via the texture caches.

You could start explicitly using the shared memory, your idea seems decent to do that.

hughperkins mentioned this issue Jul 11, 2015

Crazy idea: make an nngraph 'optimizer' torch/nngraph#60

Open

hughperkins closed this as completed Sep 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas on improving apply perf on non-nvidia hardware? #190

Ideas on improving apply perf on non-nvidia hardware? #190

hughperkins commented Jul 1, 2015

soumith commented Jul 1, 2015

Ideas on improving apply perf on non-nvidia hardware? #190

Ideas on improving apply perf on non-nvidia hardware? #190

Comments

hughperkins commented Jul 1, 2015

soumith commented Jul 1, 2015