-
Notifications
You must be signed in to change notification settings - Fork 26
Optimizes weights kernel #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizes weights kernel #30
Conversation
Updates fork
Merge master into fork
Update weights branch with latest version of code See merge request daclark/gauxc-mirror!3
|
I updated the PR with some clean up changes and minor optimizations. Right now it is failing 371 of the assertions in the weight unit test, however the differences are all relatively small. Removing the reciprocal optimization seems to resolve the issue. |
|
@dmclark17 How small is small? I'm more than happy to update the UT checks if its a worth while optimization |
|
The maximum absolute difference 1.4e-5 and the average absolute difference being 4.8e-8. The maximum percent difference is 0.4% and the average percent difference is 0.03%. The reciprocal optimization brings the total runtime of the weights kernel from 18.9s to 14.2s for a ubiquitin simulation on a V100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look great, thanks! I'll pull and verify everything works on Summit and merge ASAP
This changes the level of parallelism to assign a warp to each point instead of a thread.
For the iParent portion, a warp computes the contribution from 32 jCenters concurrently. They then do a warp reduction to compute the final contribution. This changes the early exit scheme since the check effectively only happens every 32 jCenters.
For the pairs portion, each thread in the warp computes the contribution from 1 iCenter at a time as the warp synchronously iterates over jCenters. The threads are able to independently exit early and start on a new iCenter.