Skip to content

Conversation

@dmclark17
Copy link
Contributor

@dmclark17 dmclark17 commented Oct 13, 2020

Optimizes the submatrix packing and incrementing kernels. I am marking this as a draft because it is hand-tuned to the Taxol problem size and V100 L2 cache size.

There are 2 main parts of the optimization:

  • Adding the submatrix indices to the submatrix cut vector—this allows cuts from the same task to be done concurrently. Additionally there is some loop unrolling to increase the number of load operations in flight, and inline PTX is used for stores to try to reduce pollution of L2 cache.
  • Block the large matrix such that it fits in the L2 cache. For the Taxol problem, the large matrix is about 1000 by 1000, which is roughly 8MBs. I believe this was causing a low L2 hit rate on a V100, which has about 6MB of L2 cache. To solve this, I am breaking the big matrix into four 512 by 512 chunks and using a separate kernel launch to process each one. To simplify the bookkeeping for the kernel, I enforce breaks in the cuts at intervals of 512. Each kernel call is passed the starting and stopping points within the submatrix cut vector it needs to process.

TODO:

  • The submatrix cut vector stores 2 pairs for each cut, but this could be optimized into 3 values as the lengths are the same. In fact, the end point of the small matrix is not used.
  • The second part of the optimization needs to be cleaned up, as the block size of the matrix should depend on the size of the L2 cache of the device
  • Right now, it is hard coded to break the problem into fourths and piggybacks on the submatrix cut vector to store the starting and stopping points for the kernel launch. This needs to be cleaned up so it can handle any number of blocks and so it uses a proper input variable to pass the starting and stopping points.
  • Investigate the performance of multiple kernel launches when there are many small cuts. While I am measuring a speed-up on Taxol, this technique might not scale well when the large matrix breaks into many different chunks. If the tasks are sufficiently localized then this might lead to thread-blocks doing no work for some kernel launches, which might motivate a more intelligent way to launch the work.

My measurements showed that both of these optimizations had an effect on the kernels. Even though it is tuned to the Taxol problem size, it sped up the ubiquitin problem by about 2x for each kernel.

I will work on the TODOs listed above, but I wanted to get your feedback on the approach.

Copy link
Owner

@wavefunction91 wavefunction91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good, I think this is a good approach. When you say 2x for Ubiquitin, do you mean 2x for the kernel or full TTS?

@dmclark17
Copy link
Contributor Author

Unfortunately, 2x for the individual kernels.

@wavefunction91
Copy link
Owner

As these are dominant for the big problems, this still has the potential to have a large impact

@wavefunction91
Copy link
Owner

@dmclark17 I haven't dug into the cause of this yet, but it looks like your changes to https://github.com/wavefunction91/GauXC/pull/21/files#diff-298977ee40a20ec2b41b3c2c95681d68687ad52b2b4afbc97afba95c1388b4eaR60 cause a hang in the unit tests. My suspicion is that this is due to the fact that we only test small problems in the UTs, and the blocking factor of 512 yields something incorrect. I'll look more at it tomorrow

@dmclark17
Copy link
Contributor Author

Interesting, I will take a look as well. How would you like to proceed with changes to the code that is common to all the backends? Right now I am adding an additional pair for each cut and this could be modified to be a single triplet per cut. Do you think the changes will be propagated to the other backends, or should I made a CUDA specific version with the changes?

@wavefunction91
Copy link
Owner

wavefunction91 commented Oct 21, 2020 via email

@wavefunction91
Copy link
Owner

@dmclark17 I've merged master locally (with testing fixes #27), but that also pulls in the row-major memory access optimizations. I can't see any conflicts, but I wanted to confirm with you whether you'd like me to push directly here or once you've finished more of the tasks?

@dmclark17
Copy link
Contributor Author

I think pushing directly here would work—just to keep the branch up to date. I have made progress with some of the tasks and will try to push updates soon!

@dmclark17
Copy link
Contributor Author

I added a generalization to the blocking scheme for larger matrices as well as a few minor optimizations for the packing/inc kernels themselves.

  • To store information about the mapping from matrix blocks to cut indices, I introduced another array into the device task structure. This structure is generated by the gen_compressed_submat_map function which now returns a tuple of vectors.
  • I changed the submat_map structure to store deltas instead of end points which might be causing issues with the host backend.

@dmclark17 dmclark17 marked this pull request as ready for review November 6, 2020 17:54
@wavefunction91 wavefunction91 merged commit 82125c0 into wavefunction91:master Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants