-
Notifications
You must be signed in to change notification settings - Fork 27
Adds pack/inc optimization #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds pack/inc optimization #21
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good, I think this is a good approach. When you say 2x for Ubiquitin, do you mean 2x for the kernel or full TTS?
|
Unfortunately, 2x for the individual kernels. |
|
As these are dominant for the big problems, this still has the potential to have a large impact |
|
@dmclark17 I haven't dug into the cause of this yet, but it looks like your changes to https://github.com/wavefunction91/GauXC/pull/21/files#diff-298977ee40a20ec2b41b3c2c95681d68687ad52b2b4afbc97afba95c1388b4eaR60 cause a hang in the unit tests. My suspicion is that this is due to the fact that we only test small problems in the UTs, and the blocking factor of 512 yields something incorrect. I'll look more at it tomorrow |
|
Interesting, I will take a look as well. How would you like to proceed with changes to the code that is common to all the backends? Right now I am adding an additional pair for each cut and this could be modified to be a single triplet per cut. Do you think the changes will be propagated to the other backends, or should I made a CUDA specific version with the changes? |
|
My guess is that something analogous will get migrated to the other
backends at some point (to the extent that it can at all), but I can handle
that as the need arises. I'll give it some thought over the next few days.
If it turns out that we need to sandbox this, we can do that before merge
once the functionality is fleshed out a bit more.
|
|
@dmclark17 I've merged |
|
I think pushing directly here would work—just to keep the branch up to date. I have made progress with some of the tasks and will try to push updates soon! |
|
I added a generalization to the blocking scheme for larger matrices as well as a few minor optimizations for the packing/inc kernels themselves.
|
…into clark/opt_packing
Optimizes the submatrix packing and incrementing kernels. I am marking this as a draft because it is hand-tuned to the Taxol problem size and V100 L2 cache size.
There are 2 main parts of the optimization:
TODO:
My measurements showed that both of these optimizations had an effect on the kernels. Even though it is tuned to the Taxol problem size, it sped up the ubiquitin problem by about 2x for each kernel.
I will work on the TODOs listed above, but I wanted to get your feedback on the approach.