-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Migrate equal from the TH to Aten (CPU) #33286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit f72b7f4 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 42 times. |
aten/src/ATen/native/BinaryOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also support Half here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
aten/src/ATen/native/BinaryOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use TensorIterator instead. This is perfect case of reduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU_tensor_apply2 is going away eventually, please use TensorIterator instead.
|
Thanks for the quick review. |
aten/src/ATen/native/BinaryOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move this part to aten/src/ATen/native/cpu/BinaryOpsKernel.cpp? There you can take advantage of SIMD instructions and the code structure is cleaner (especially after the CUDA version is also migrated). You can look into the other binary operators as examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review.
I moved this to BinaryOpsKernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't specify a return type here (It's causing build errors). Capture can be more specific [&equal]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that the unit test will fail when not specify a return type here. But test will be passed with a return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bfloat16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bfloat16 support and add tests to cover it. Everything else looks good.
|
Opting in @glaringlee for review and help with TensorIterator |
|
@xcnick The other way is to reuse foreach_reduced_elt(func(subiter)) in tensoriterator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xcnick
I commented this PR above, let me know if you need more info on how to do tensor reduction.
|
sorry for the late response. I would like to use the first way, so I add and add then add add Am I doing it right? |
|
@xcnik
need some tweaks in that function to support more inputs. |
|
Hey @xcnick, if you think changing the reduction code is too much for you, you can do the same thing as this PR. The at::eq support both cpu and cuda. And rename your equal() function to cpu_equal (change it in all related places), then you are done :) |
|
Hi, I modified the code as you said. Benchmarking results was updated. |
|
ping @glaringlee |
|
@xcnick @VitalyFedyunin |
|
@xcnick |
aten/src/ATen/native/ReduceOps.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xcnick
Based on your benchmark, I found one problem when directly calling at::native::eq directly.
for contiguous non equal test, the performance of the at::native::eq is very bad, the reason is that, at::native::eq will compare every element between the two tensors, even an unequal case is already found, it won't stop. This will also slow down the non-contiguous non equal case, but not as bad as contiguous case.
Let's do something else here:
Instead of calling eq directly, let's implement our own loop function. You can put the following code here and remove the 'return at::native::eq' line.
std::atomic<bool> result{true};
auto iter = TensorIteratorConfig()
.add_input(self)
.add_input(other)
.allow_cpu_scalars(true)
.promote_inputs_to_common_dtype(true)
.build();
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, iter.input_dtype(), "equal_cpu", [&]{
iter.for_each([&](char** data, const int64_t *strides, int64_t dim_size){
if (!result) {
return;
}
char* self_data = data[0];
char* other_data = data[1];
for (int64_t i = 0; i < dim_size; ++i) {
if (*((scalar_t*)self_data) != *((scalar_t*)other_data)) {
result = false;
return;
}
self_data += strides[0];
other_data += strides[1];
}
});
});
return result.load();
Let me explain a little bit. Since we just need to return a bool, so there is no need to even have an output tensor. So we initialize a tensor iterator with two input only, and keep track a global boolean 'result'
The tensor iterator has a for_each function which takes a function reference.
This reference has signatures like this: loop(char** data, const int64_t* strides, int size);
Tensor iterator internally, chunk the whole tensor into many 2D planes, and process them row by row, this loop function is used to iterator each row. 'data' array contains the starting point of current row for each input tensor, 'strides' array contains stride of current row (dimension) for each input tensor, the tensor orders for both 'data' and 'strides' are the same as what you initialize the tensor iterator. And the 'size' is the row length.
This code dispatch the iter.for_each() to all the supported data type, and if there is any non-equal element found between two tensors, it stopped immediately. For contiguous tensor, this means quit the whole tensor iteration since we totally have one continuous data array in this case. If tensor is not contiguous, there is still overhead, since in this case, the program will jump to next row and immediately break, and jump to next row, break, until looped entire tensors. But it already saved many time compare calling at::native::eq directly, and plus there is no tensor write.
Please make this change in and update the benchmark and rebase. I think we are good to go then.
aten/src/ATen/native/ReduceOps.cpp
Outdated
| if (!self.is_same_size(other)) { | ||
| return false; | ||
| } | ||
| bool result = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xcnick
Should this be atomic? I updated my comments around 1hr ago, I think your github page is cached.....
|
@glaringlee Thanks for your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Now.
@xcnick Thanks a lot for your contribution!!
|
@glaringlee merged this pull request in 72f2c47. |
#24697
@VitalyFedyunin
@glaringlee
Test script:
TH
ATen