-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Speed up torch.unique_consecutive() #64835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 89e6648 (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Codecov Report
@@ Coverage Diff @@
## master #64835 +/- ##
==========================================
+ Coverage 66.65% 66.68% +0.03%
==========================================
Files 714 714
Lines 92546 92598 +52
==========================================
+ Hits 61685 61752 +67
+ Misses 30861 30846 -15 |
aten/src/ATen/native/Unique.cpp
Outdated
std::tuple<Tensor, Tensor, Tensor> | ||
unique_consecutive_cpu(const Tensor& self, const bool return_inverse, const bool return_counts, c10::optional<int64_t> dim) { | ||
if (!dim.has_value()) { | ||
if (!dim.has_value() || (dim.value() == 0 && self.sizes().size() == 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this check, the result would be
torch.unique_consecutive(dim=0) time: 16.88133478164673
torch.unique_consecutive() time: 0.024564027786254883
torch.unique_consecutive(dim=0) time: 28.210450410842896
torch.unique_consecutive(dim=1) time: 0.30884695053100586
As you could see, this check slightly slows down torch.unique_consecutive(dim=1)
, but it could also speed up torch.unique_consecutive(dim=0)
very much.
Hence, I still leave it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this check affect dim=1
case? With or without this check dim=1
case should go to unique_dim_consecutive_cpu
in line 287. 5x slowdown (from 0.3 to 1.6) seems significant, why is your algo slower than the previous one for dim=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe it should always go to unique_dim_consecutive_cpu
in this case.
Therefore, I am also curious about this situation. I didn't expect a huge differences at first.
Maybe it is some CPU branching issue?
If I tried loop the last case 5 times, the result become
torch.unique_consecutive(dim=0) time: 0.06868529319763184
torch.unique_consecutive() time: 0.06931805610656738
torch.unique_consecutive(dim=0) time: 26.346323251724243
torch.unique_consecutive(dim=1) time: 1.2163631916046143
torch.unique_consecutive(dim=1) time: 0.2443101406097412
torch.unique_consecutive(dim=1) time: 0.24067211151123047
torch.unique_consecutive(dim=1) time: 0.25301051139831543
torch.unique_consecutive(dim=1) time: 0.24738121032714844
I will try torch.utils.benchmark
or something else to get a more thorough analysis if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, it seems like a warm-up issue then, but please post torch.utils.benchmark timings here, so that we have a record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
For 1d case, >100x more faster on dim=0
case with (dim.value() == 0 && self.sizes().size() == 1)
.
For 2d case, there is no significant impact with or without (dim.value() == 0 && self.sizes().size() == 1)
.
In general, the new implementation (89e6648) is 3x more faster than the original implementation (be09195).
Result
The script I test test_1d.py, test_2d.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement!
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Fixes #62690
Summary
Like the way
unique_consecutive_cpu_template
implemented, this PR reimplements_unique_dim_cpu_impl
to get better performance.Also, because the overhead of
unique_dim_consecutive_cpu
is quite large, directly callunique_consecutive_cpu_template
when we know the given input is a 1d-array.Benchmark
Script
Before
After
System Information