-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[NNC] Hook up registerizer to Cuda codegen [2/x] #42878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit e7690af (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 51 times. |
4cff9dd
to
c0dde0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: how about the case where thread_idx is outside block_idx? It is okay to ignore this for now. But maybe a TODO to remind ourselves of this case down the road.
for i in (0..10): # theadIdx
for j in (0..100): # blockIdx
t1 = alloc(1)
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this is an important case. I'll come back to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Maybe a TODO to handle the case where the size of the shared memory is dynamic. Then the dynamic size needs to be provided at the kernel launch time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why "metavars" is not a const ref, similar to "thread_local_bufs"? The most common reason for a const pointer is that it could be nullptr, but it doesn't seem to be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason was to match the local style of the file: storing as a unique_ptr and passing as ptr (see CudaAnalysis and it's use by CudaPrinter).
NBD to change this one, but wouldn't want to add a general cuda codgen cleanup into this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Style wise, I mildly prefer "std::vector<const Expr*> block_extents = ...", over "const ... &" here. With return-value-optimization, both does the same thing. But the first is easier to tell the scope of the temporary object, while the second relies on a special rule in C++ to extend its lifespan. Both are correct, just some people might trip over why the 2nd form is correct.
Either way is fine with me, if you are more comfortable with that style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly bothered either way, I can switch to value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this seems to be idiomatic in our code base now. Maybe we should have an overloaded "immediateEquals" that does this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense - in a follow up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature! This seems so fundamental that deserves its own directed tests. I would like to see a list of more explict expressions, and which will become AtomicAdd, vs others won't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this needs a lot more testing. The whole Cuda Codegen does. We'll have to come back to it.
c0dde0a
to
2786bc2
Compare
092de19
to
d652db0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
d652db0
to
e7690af
Compare
Codecov Report
@@ Coverage Diff @@
## master #42878 +/- ##
==========================================
- Coverage 69.41% 69.40% -0.01%
==========================================
Files 378 378
Lines 46602 46602
==========================================
- Hits 32347 32345 -2
- Misses 14255 14257 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.
First up the good stuff, benchmark before:
After:
Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).
The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the Better kernel.
It required a lot of refactoring and bug fixes on the way: