New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support int32 indices and offsets in nn.EmbeddingBag #46758
Conversation
This pull request was exported from Phabricator. Differential Revision: D24470808 |
💊 CI failures summary and remediationsAs of commit 07607c1 (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 54 times. |
This pull request was exported from Phabricator. Differential Revision: D24470808 |
5da8558
to
8a9ed33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, this is a commonly needed feature.
This generally looks good, can you please update the documentation to indicate that indices/offsets can now be int, and that they have to be of the same type?
aten/src/ATen/native/Embedding.cpp
Outdated
} | ||
} | ||
AT_DISPATCH_INDEX_TYPES(indices.scalar_type(), "embedding_dense_backward_cpu", [&] () { | ||
auto indices_contig = indices.contiguous(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to put parts of code that don't depend on dispatch type outside of dispatch macro, that way it does not get compiled twice (e.g. indices.contiguous()
, indices.numel()
, grad_.contiguous()
etc. It also better separates what actually depends on the indices datatype
aten/src/ATen/native/Embedding.cpp
Outdated
if (numel > 1000) { | ||
// The strategy is to parallelize over sections of the vocabulary, so that | ||
// thread 1 handles updates to gradWeight[0..nVocab/nThreads]. Every thread | ||
// has to traverse the entire input, but the dominating factor is the axpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like comment is copy-pasted from somewhere else, there's no axpy in the parallel_section
? I know it's been this way before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Let me remove it.
aten/src/ATen/native/Embedding.cpp
Outdated
auto num_indices = indices.numel(); | ||
auto data_ptr = indices_contig.data_ptr<index_t>(); | ||
auto sorted_indices = std::vector<index_t>(data_ptr, data_ptr + num_indices); | ||
std::sort(sorted_indices.begin(), sorted_indices.end(), std::less<index_t>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::less is a default comparator, no need to explicitly specify it.
offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2] | ||
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) { | ||
auto ones = at::ones_like(offsets, LEGACY_CONTIGUOUS_MEMORY_FORMAT); | ||
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't you using index_add here? Don't want to convert offsets
to long? In this case you also don't need ones
tensor, just add literal ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to convert offsets to long
Yes. Would you suggest to keep using the original code with offsets.long() or the current one with literal ones? Seems the later may have better perf.
auto numel = add_indices.numel(); | ||
int64_t ddim = src.size(1); | ||
auto src_stride0 = src.stride(0); | ||
auto src_stride1 = src.stride(1); | ||
auto output_stride0 = output.stride(0); | ||
auto output_stride1 = output.stride(1); | ||
|
||
auto* scale_data = scale.data_ptr<T>(); | ||
auto* scale_data = scale.data_ptr<data_t>(); | ||
auto scale_stride = scale.stride(0); | ||
|
||
for (int64_t i = 0; i < numel; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why THBlas_axpy is not used here? One can specify scale
for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the history..
checkContiguous("embedding_bag", offsets_arg); | ||
|
||
Tensor offset2bag_; | ||
if (indices.numel() != 0 && offset2bag.numel() == 0) { | ||
offset2bag_ = at::zeros( | ||
{indices.sizes()[0] + 1}, indices.options()); // offset2bag = [0 0 0 0 0] | ||
{indices.sizes()[0] + 1}, offsets.options()); // offset2bag = [0 0 0 0 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason to replace indices.options()
with offsets.options()
? They should be of the same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think offset2bag_
is closer to offsets
- dims, type
index_grad_weight.select(1, dim).index_add_( | ||
0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim)); | ||
//index_grad_weight.select(1, dim).index_add_( | ||
// 0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a TODO here to use index_add again when it supports int32 indices, can you please add a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add.
test/test_nn.py
Outdated
@@ -11593,16 +11596,17 @@ def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum', | |||
bags.append(embeddings.narrow(0, offset, length).max(0)[0]) | |||
return torch.stack(bags) | |||
|
|||
def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device): | |||
@dtypes(torch.int, torch.long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you give tuples to @dtypes decorator, you can specify indices types and weights types via decorator and not via loops in the test, so it's slightly cleaner:
@dtypes(*itertools.product((torch.int, torch.long), (torch.float, torch.double))
def test_Embedding(self, device, dtypes): #dtypes here is an iterable of dtypes, you can access elements by dtypes[0] and dtypes[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Will update.
test/test_nn.py
Outdated
test_backward = wdtype is not torch.float | ||
|
||
self._test_EmbeddingBag(device, 'sum', True, wdtype, dtype, test_backward=test_backward) | ||
self._test_EmbeddingBag(device, 'mean', True, wdtype, dtype, test_backward=test_backward) | ||
|
||
@dtypesIfCUDA(torch.half, torch.float, torch.double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not look right, on cuda dtype will be half, float and double instead of integral. Same comment as above for specifying tuples of dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Will fix.
This pull request was exported from Phabricator. Differential Revision: D24470808 |
8a9ed33
to
040d76b
Compare
This pull request was exported from Phabricator. Differential Revision: D24470808 |
040d76b
to
5df1338
Compare
This pull request was exported from Phabricator. Differential Revision: D24470808 |
5df1338
to
68b1c11
Compare
This pull request was exported from Phabricator. Differential Revision: D24470808 |
68b1c11
to
14f1f20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic, thanks a lot for working on it!
aten/src/ATen/native/Embedding.cpp
Outdated
|
||
// TODO: use tensor.index() after improving perf | ||
if (indices.dim() == 1) { | ||
return weight.index_select(0, indices); | ||
return weight.index_select(0, indices.to(kLong)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, maybe we should make index_select to be int32 too, but that can wait till another day :)
at::empty({0, num_features}, dense_options), | ||
weight_size); | ||
} | ||
|
||
auto index = indices.reshape({1, -1}); | ||
auto values = grad.reshape({-1, num_features}); | ||
return at::_sparse_coo_tensor_unsafe(index, values, weight_size); | ||
return at::_sparse_coo_tensor_unsafe(index.to(kLong), values, weight_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, so we're doing still int64 always in the backwards? are there concerns with doing int32? sparse tensor should support it.
Or the worry is that we'd need to update optimizers too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter, indices of sparse tensor will be converted to Long anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use int32 initially and found that SparseTensor or at least some SparseTensor utilities assume int64 indices - https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/SparseTensorImpl.cpp.html#90
offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2] | ||
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) { | ||
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () { | ||
auto index_data = offsets.contiguous().data_ptr<index_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offsets.contiguous() is a temporary that will be deleted, you should have a local var for it
} | ||
}); | ||
offset2bag[0] -= 1; | ||
offset2bag = offset2bag.cumsum(0, offset2bag.scalar_type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since you're manually writing this code - you could directly fold cumsum calculation above - the offsets are always sorted
offset2bag[0] -= 1; // offset2bag = [0 0 1 0 1] | ||
offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2] | ||
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) { | ||
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qizzzh The reason that XLA test failed is here data storage is directly accessed. From native_functions.yaml _embedding_bag_backward (which calls make_offset2bag) should be device generic (cannot assume storage exists). There're two ways to fix:
- Use at:: functions instead of directly accessing storage in make_offset2bag
- Move the old _embedding_bag_backward implementation to Math key and add dispatch section for it in native_functions.yaml. The key here is after this change, your kernel no longer works for all backends. More details can found https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native#choosing-the-right-dispatch-keyword
Either solution works for me, the first one might be a little less work I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I guess I'll just make index_add support int32 indices so we keep the original code. index_add is pretty self-contained and I hope it would be fine.
This pull request was exported from Phabricator. Differential Revision: D24470808 |
14f1f20
to
450f385
Compare
This pull request was exported from Phabricator. Differential Revision: D24470808 |
450f385
to
5f5704e
Compare
@@ -454,7 +460,7 @@ Tensor & index_select_out_cpu_(Tensor & result, const Tensor & self, int64_t dim | |||
|
|||
auto numel = index.numel(); | |||
TORCH_CHECK_INDEX(index.dim() <= 1, "index_select(): Index is supposed to be a vector"); | |||
TORCH_CHECK(index.scalar_type() == ScalarType::Long, "index_select(): Expected dtype int64 for index"); | |||
TORCH_CHECK(index.scalar_type() == ScalarType::Long || index.scalar_type() == ScalarType::Int, "index_select(): Expected dtype int64 for index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the string to include int32 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Can you please file an issue to enable int support for index_add and index_select on cuda?
This pull request was exported from Phabricator. Differential Revision: D24470808 |
5f5704e
to
85cbf0a
Compare
scalar_t *self_ip = self_ptr + self_i * self_stride; | ||
*self_ip += *(source_ptr + i * source_stride); | ||
} | ||
AT_DISPATCH_INDEX_TYPES(index.scalar_type(), "index_add_cpu_", [&] () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hitting some windows build failures here - https://app.circleci.com/pipelines/github/pytorch/pytorch/233453/workflows/e2a13402-bc72-474f-89a7-6c1f0132cdd0/jobs/8597430/parallel-runs/0/steps/0-104
Seems variables are not captured correctly in the nested lambda. Also found some asks about similar issues online - https://developercommunity.visualstudio.com/content/problem/604920/visual-c-1613-visibility-error-in-nested-lambda-in.html
Hi @peterjc123 ! @ngimel suggested that you might have solution/workaround to this issue. Would you mind taking a look? Thank you in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly capturing variables could work around it.
This pull request was exported from Phabricator. Differential Revision: D24470808 |
85cbf0a
to
3df4651
Compare
This pull request was exported from Phabricator. Differential Revision: D24470808 |
3df4651
to
8c6554e
Compare
Summary: Pull Request resolved: pytorch#46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Differential Revision: D24470808 fbshipit-source-id: c468b9678b10ef937c645a14c84e3094b8a1c7ba
This pull request was exported from Phabricator. Differential Revision: D24470808 |
8c6554e
to
07607c1
Compare
This pull request has been merged in 0ec717c. |
Summary:
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.
Tensor indexing utilities all assume indices should be int64. I'd like to address that in a follow up diff as this one is already pretty big.
Test Plan: unit tests
Differential Revision: D24470808