Adds pack/inc optimization #21

dmclark17 · 2020-10-13T23:08:16Z

Optimizes the submatrix packing and incrementing kernels. I am marking this as a draft because it is hand-tuned to the Taxol problem size and V100 L2 cache size.

There are 2 main parts of the optimization:

Adding the submatrix indices to the submatrix cut vector—this allows cuts from the same task to be done concurrently. Additionally there is some loop unrolling to increase the number of load operations in flight, and inline PTX is used for stores to try to reduce pollution of L2 cache.
Block the large matrix such that it fits in the L2 cache. For the Taxol problem, the large matrix is about 1000 by 1000, which is roughly 8MBs. I believe this was causing a low L2 hit rate on a V100, which has about 6MB of L2 cache. To solve this, I am breaking the big matrix into four 512 by 512 chunks and using a separate kernel launch to process each one. To simplify the bookkeeping for the kernel, I enforce breaks in the cuts at intervals of 512. Each kernel call is passed the starting and stopping points within the submatrix cut vector it needs to process.

TODO:

The submatrix cut vector stores 2 pairs for each cut, but this could be optimized into 3 values as the lengths are the same. In fact, the end point of the small matrix is not used.
The second part of the optimization needs to be cleaned up, as the block size of the matrix should depend on the size of the L2 cache of the device
Right now, it is hard coded to break the problem into fourths and piggybacks on the submatrix cut vector to store the starting and stopping points for the kernel launch. This needs to be cleaned up so it can handle any number of blocks and so it uses a proper input variable to pass the starting and stopping points.
Investigate the performance of multiple kernel launches when there are many small cuts. While I am measuring a speed-up on Taxol, this technique might not scale well when the large matrix breaks into many different chunks. If the tasks are sufficiently localized then this might lead to thread-blocks doing no work for some kernel launches, which might motivate a more intelligent way to launch the work.

My measurements showed that both of these optimizations had an effect on the kernels. Even though it is tuned to the Taxol problem size, it sped up the ubiquitin problem by about 2x for each kernel.

I will work on the TODOs listed above, but I wanted to get your feedback on the approach.

wavefunction91

Overall it looks good, I think this is a good approach. When you say 2x for Ubiquitin, do you mean 2x for the kernel or full TTS?

src/integrator/cuda/cuda_inc_potential.cu

dmclark17 · 2020-10-14T00:08:36Z

Unfortunately, 2x for the individual kernels.

wavefunction91 · 2020-10-14T00:17:13Z

As these are dominant for the big problems, this still has the potential to have a large impact

wavefunction91 · 2020-10-16T00:21:35Z

@dmclark17 I haven't dug into the cause of this yet, but it looks like your changes to https://github.com/wavefunction91/GauXC/pull/21/files#diff-298977ee40a20ec2b41b3c2c95681d68687ad52b2b4afbc97afba95c1388b4eaR60 cause a hang in the unit tests. My suspicion is that this is due to the fact that we only test small problems in the UTs, and the blocking factor of 512 yields something incorrect. I'll look more at it tomorrow

dmclark17 · 2020-10-21T03:42:21Z

Interesting, I will take a look as well. How would you like to proceed with changes to the code that is common to all the backends? Right now I am adding an additional pair for each cut and this could be modified to be a single triplet per cut. Do you think the changes will be propagated to the other backends, or should I made a CUDA specific version with the changes?

wavefunction91 · 2020-10-21T04:02:44Z

My guess is that something analogous will get migrated to the other backends at some point (to the extent that it can at all), but I can handle that as the need arises. I'll give it some thought over the next few days. If it turns out that we need to sandbox this, we can do that before merge once the functionality is fleshed out a bit more.

wavefunction91 · 2020-10-21T16:50:32Z

@dmclark17 I've merged master locally (with testing fixes #27), but that also pulls in the row-major memory access optimizations. I can't see any conflicts, but I wanted to confirm with you whether you'd like me to push directly here or once you've finished more of the tasks?

src/integrator/cuda/cuda_inc_potential.cu

dmclark17 · 2020-10-21T19:29:40Z

I think pushing directly here would work—just to keep the branch up to date. I have made progress with some of the tasks and will try to push updates soon!

…t_packing

dmclark17 · 2020-10-23T19:46:50Z

I added a generalization to the blocking scheme for larger matrices as well as a few minor optimizations for the packing/inc kernels themselves.

To store information about the mapping from matrix blocks to cut indices, I introduced another array into the device task structure. This structure is generated by the gen_compressed_submat_map function which now returns a tuple of vectors.
I changed the submat_map structure to store deltas instead of end points which might be causing issues with the host backend.

src/integrator/cuda/cuda_pack_density.cu

…into clark/opt_packing

davidclark-nv added 2 commits October 13, 2020 14:25

Adds pack/inc optimization

ad5fa6a

Adds comment for submatrix array generation

2ddce3b

wavefunction91 reviewed Oct 14, 2020

View reviewed changes

src/integrator/cuda/cuda_inc_potential.cu Show resolved Hide resolved

davidclark-nv and others added 3 commits October 14, 2020 14:43

Changes cut generation algorithm

32d575a

Adds blocking scheme

7ebf159

Merge branch 'master' into clark/opt_packing

517b5f1

Merge branch 'master' into clark/opt_packing

46871e6

wavefunction91 reviewed Oct 21, 2020

View reviewed changes

src/integrator/cuda/cuda_inc_potential.cu Show resolved Hide resolved

davidclark-nv added 5 commits October 21, 2020 15:08

minor optimizations to kernels

dd80bd9

Adds function for determining block size

2e8bdcf

Adds basic heuristic for choosing block size

aecd245

Minor changes and comments

5f4ae8d

Merge remote-tracking branch 'public/clark/opt_packing' into clark/op…

e3c496d

…t_packing

dmclark17 commented Oct 26, 2020

View reviewed changes

src/integrator/cuda/cuda_pack_density.cu Show resolved Hide resolved

davidclark-nv added 3 commits November 2, 2020 16:25

Switches from pairs to arrays for subcut structure

8519179

Adds CUDA11 intrinsics for cache hints

642554f

Adjusts host code to use arrays instead of pairs

86f69fa

dmclark17 marked this pull request as ready for review November 6, 2020 17:54

Merge branch 'clark/opt_packing' of https://github.com/dmclark17/GauXC …

77fb6af

…into clark/opt_packing

wavefunction91 approved these changes Nov 9, 2020

View reviewed changes

wavefunction91 merged commit 82125c0 into wavefunction91:master Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds pack/inc optimization #21

Adds pack/inc optimization #21

Uh oh!

dmclark17 commented Oct 13, 2020 •

edited

Loading

Uh oh!

wavefunction91 left a comment

Uh oh!

Uh oh!

dmclark17 commented Oct 14, 2020

Uh oh!

wavefunction91 commented Oct 14, 2020

Uh oh!

wavefunction91 commented Oct 16, 2020

Uh oh!

dmclark17 commented Oct 21, 2020

Uh oh!

wavefunction91 commented Oct 21, 2020 via email

Uh oh!

wavefunction91 commented Oct 21, 2020

Uh oh!

Uh oh!

dmclark17 commented Oct 21, 2020

Uh oh!

dmclark17 commented Oct 23, 2020

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds pack/inc optimization #21

Adds pack/inc optimization #21

Uh oh!

Conversation

dmclark17 commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wavefunction91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmclark17 commented Oct 14, 2020

Uh oh!

wavefunction91 commented Oct 14, 2020

Uh oh!

wavefunction91 commented Oct 16, 2020

Uh oh!

dmclark17 commented Oct 21, 2020

Uh oh!

wavefunction91 commented Oct 21, 2020 via email

Uh oh!

wavefunction91 commented Oct 21, 2020

Uh oh!

Uh oh!

dmclark17 commented Oct 21, 2020

Uh oh!

dmclark17 commented Oct 23, 2020

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmclark17 commented Oct 13, 2020 •

edited

Loading