New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
construct only necessary elements in OffsetCalculator #55107
Conversation
💊 CI failures summary and remediationsAs of commit 646510f (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Codecov Report
@@ Coverage Diff @@
## master #55107 +/- ##
==========================================
+ Coverage 77.08% 77.42% +0.34%
==========================================
Files 1893 1893
Lines 186444 186444
==========================================
+ Hits 143715 144362 +647
+ Misses 42729 42082 -647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
for (int arg = 0; arg < NARGS; arg++) { | ||
int64_t element_size = (element_sizes == nullptr ? 1LL : element_sizes[arg]); | ||
strides_[i][arg] = i < dims ? strides[arg][i] / element_size : 0; | ||
strides_[i][arg] = strides[arg][i] / element_size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, might be worth emitting two loops here depending on whether element_sizes == nullptr
to avoid a bunch of idiv
instructions just to divide by 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it, it actually increases instruction count, and in most cases element_sizes
is not nullptr
.
It is silly that TI build
multiplies by element sizes and stores stride in bytes, and here we copy element_sizes into array (true story,
element_sizes[i] = iter.element_size(i); |
Per title. Elements beyond
dim
are never accessed becausepytorch/aten/src/ATen/cuda/detail/OffsetCalculator.cuh
Lines 49 to 51 in 646510f
On
addmm
instruction count per 30 repetitions 1467813 -> 1452261add
651522 -> 633462add_
529331 -> 511271add benchmarking snippet: