Support int32 indices and offsets in nn.EmbeddingBag #46758

qizzzh · 2020-10-23T04:45:37Z

Summary:
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Tensor indexing utilities all assume indices should be int64. I'd like to address that in a follow up diff as this one is already pretty big.

Test Plan: unit tests

Differential Revision: D24470808

facebook-github-bot · 2020-10-23T04:45:53Z

This pull request was exported from Phabricator. Differential Revision: D24470808

dr-ci · 2020-10-23T05:59:51Z

💊 CI failures summary and remediations

As of commit 07607c1 (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 54 times.

facebook-github-bot · 2020-10-23T06:49:46Z

This pull request was exported from Phabricator. Differential Revision: D24470808

ngimel

Thank you for working on this, this is a commonly needed feature.
This generally looks good, can you please update the documentation to indicate that indices/offsets can now be int, and that they have to be of the same type?

ngimel · 2020-10-24T23:51:57Z

aten/src/ATen/native/Embedding.cpp

-    }
-  }
+  AT_DISPATCH_INDEX_TYPES(indices.scalar_type(), "embedding_dense_backward_cpu", [&] () {
+    auto indices_contig = indices.contiguous();


It's better to put parts of code that don't depend on dispatch type outside of dispatch macro, that way it does not get compiled twice (e.g. indices.contiguous(), indices.numel(), grad_.contiguous() etc. It also better separates what actually depends on the indices datatype

ngimel · 2020-10-24T23:54:48Z

aten/src/ATen/native/Embedding.cpp

+    if (numel > 1000) {
+      // The strategy is to parallelize over sections of the vocabulary, so that
+      // thread 1 handles updates to gradWeight[0..nVocab/nThreads]. Every thread
+      // has to traverse the entire input, but the dominating factor is the axpy


looks like comment is copy-pasted from somewhere else, there's no axpy in the parallel_section? I know it's been this way before this PR.

Good catch. Let me remove it.

ngimel · 2020-10-25T00:01:06Z

aten/src/ATen/native/Embedding.cpp

+    auto num_indices = indices.numel();
+    auto data_ptr = indices_contig.data_ptr<index_t>();
+    auto sorted_indices = std::vector<index_t>(data_ptr, data_ptr + num_indices);
+    std::sort(sorted_indices.begin(), sorted_indices.end(), std::less<index_t>());


std::less is a default comparator, no need to explicitly specify it.

ngimel · 2020-10-25T00:09:49Z

aten/src/ATen/native/EmbeddingBag.cpp

-  offset2bag = offset2bag.cumsum(0);     // offset2bag = [0 0 1 1 2]
+static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
+  auto ones = at::ones_like(offsets, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {


why aren't you using index_add here? Don't want to convert offsets to long? In this case you also don't need ones tensor, just add literal ones.

Don't want to convert offsets to long

Yes. Would you suggest to keep using the original code with offsets.long() or the current one with literal ones? Seems the later may have better perf.

ngimel · 2020-10-25T00:22:56Z

aten/src/ATen/native/EmbeddingBag.cpp

  auto numel = add_indices.numel();
  int64_t ddim = src.size(1);
  auto src_stride0 = src.stride(0);
  auto src_stride1 = src.stride(1);
  auto output_stride0 = output.stride(0);
  auto output_stride1 = output.stride(1);

-  auto* scale_data = scale.data_ptr<T>();
+  auto* scale_data = scale.data_ptr<data_t>();
  auto scale_stride = scale.stride(0);

  for (int64_t i = 0; i < numel; i++) {


Do you know why THBlas_axpy is not used here? One can specify scale for it.

Not sure about the history..

ngimel · 2020-10-25T00:32:02Z

aten/src/ATen/native/EmbeddingBag.cpp

  checkContiguous("embedding_bag", offsets_arg);

  Tensor offset2bag_;
  if (indices.numel() != 0 && offset2bag.numel() == 0) {
    offset2bag_ = at::zeros(
-       {indices.sizes()[0] + 1}, indices.options()); // offset2bag = [0 0 0 0 0]
+       {indices.sizes()[0] + 1}, offsets.options()); // offset2bag = [0 0 0 0 0]


is there any reason to replace indices.options() with offsets.options()? They should be of the same type.

I think offset2bag_ is closer to offsets - dims, type

ngimel · 2020-10-25T00:36:39Z

aten/src/ATen/native/EmbeddingBag.cpp

-    index_grad_weight.select(1, dim).index_add_(
-      0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim));
+    //index_grad_weight.select(1, dim).index_add_(
+    //  0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim));


if there's a TODO here to use index_add again when it supports int32 indices, can you please add a TODO?

ngimel · 2020-10-25T01:58:36Z

test/test_nn.py

@@ -11593,16 +11596,17 @@ def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum',
                        bags.append(embeddings.narrow(0, offset, length).max(0)[0])
        return torch.stack(bags)

-    def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device):
+    @dtypes(torch.int, torch.long)


if you give tuples to @dtypes decorator, you can specify indices types and weights types via decorator and not via loops in the test, so it's slightly cleaner:

@dtypes(*itertools.product((torch.int, torch.long), (torch.float, torch.double)) def test_Embedding(self, device, dtypes): #dtypes here is an iterable of dtypes, you can access elements by dtypes[0] and dtypes[1]

Thanks for the suggestion. Will update.

ngimel · 2020-10-25T02:02:17Z

test/test_nn.py

+                test_backward = wdtype is not torch.float
+
+            self._test_EmbeddingBag(device, 'sum', True, wdtype, dtype, test_backward=test_backward)
+            self._test_EmbeddingBag(device, 'mean', True, wdtype, dtype, test_backward=test_backward)

    @dtypesIfCUDA(torch.half, torch.float, torch.double)


this does not look right, on cuda dtype will be half, float and double instead of integral. Same comment as above for specifying tuples of dtypes

Thanks for the catch. Will fix.

facebook-github-bot · 2020-10-26T19:40:42Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-10-26T20:21:04Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-10-27T01:05:43Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-10-27T03:05:35Z

This pull request was exported from Phabricator. Differential Revision: D24470808

dzhulgakov

Epic, thanks a lot for working on it!

dzhulgakov · 2020-10-27T04:38:06Z

aten/src/ATen/native/Embedding.cpp


  // TODO: use tensor.index() after improving perf
  if (indices.dim() == 1) {
-    return weight.index_select(0, indices);
+    return weight.index_select(0, indices.to(kLong));


haha, maybe we should make index_select to be int32 too, but that can wait till another day :)

dzhulgakov · 2020-10-27T04:41:09Z

aten/src/ATen/native/Embedding.cpp

                                         at::empty({0, num_features}, dense_options),
                                         weight_size);
  }

  auto index = indices.reshape({1, -1});
  auto values = grad.reshape({-1, num_features});
-  return at::_sparse_coo_tensor_unsafe(index, values, weight_size);
+  return at::_sparse_coo_tensor_unsafe(index.to(kLong), values, weight_size);


hm, so we're doing still int64 always in the backwards? are there concerns with doing int32? sparse tensor should support it.

Or the worry is that we'd need to update optimizers too?

It doesn't matter, indices of sparse tensor will be converted to Long anyway.

I tried to use int32 initially and found that SparseTensor or at least some SparseTensor utilities assume int64 indices - https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/SparseTensorImpl.cpp.html#90

dzhulgakov · 2020-10-27T04:42:58Z

aten/src/ATen/native/EmbeddingBag.cpp

-  offset2bag = offset2bag.cumsum(0);     // offset2bag = [0 0 1 1 2]
+static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
+  AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {
+    auto index_data = offsets.contiguous().data_ptr<index_t>();


offsets.contiguous() is a temporary that will be deleted, you should have a local var for it

dzhulgakov · 2020-10-27T04:44:05Z

aten/src/ATen/native/EmbeddingBag.cpp

+    }
+  });
+  offset2bag[0] -= 1;
+  offset2bag = offset2bag.cumsum(0, offset2bag.scalar_type());


nit: since you're manually writing this code - you could directly fold cumsum calculation above - the offsets are always sorted

ailzhang · 2020-10-27T17:30:31Z

aten/src/ATen/native/EmbeddingBag.cpp

-  offset2bag[0] -= 1;                     // offset2bag = [0 0 1 0 1]
-  offset2bag = offset2bag.cumsum(0);     // offset2bag = [0 0 1 1 2]
+static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
+  AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {


@qizzzh The reason that XLA test failed is here data storage is directly accessed. From native_functions.yaml _embedding_bag_backward (which calls make_offset2bag) should be device generic (cannot assume storage exists). There're two ways to fix:

Use at:: functions instead of directly accessing storage in make_offset2bag

Move the old _embedding_bag_backward implementation to Math key and add dispatch section for it in native_functions.yaml. The key here is after this change, your kernel no longer works for all backends. More details can found https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native#choosing-the-right-dispatch-keyword
Either solution works for me, the first one might be a little less work I think.

Thanks for the explanation. I guess I'll just make index_add support int32 indices so we keep the original code. index_add is pretty self-contained and I hope it would be fine.

facebook-github-bot · 2020-10-28T05:54:26Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-10-28T21:35:46Z

This pull request was exported from Phabricator. Differential Revision: D24470808

JackCaoG · 2020-10-29T20:01:54Z

aten/src/ATen/native/TensorAdvancedIndexing.cpp

@@ -454,7 +460,7 @@ Tensor & index_select_out_cpu_(Tensor & result, const Tensor & self, int64_t dim

  auto numel = index.numel();
  TORCH_CHECK_INDEX(index.dim() <= 1, "index_select(): Index is supposed to be a vector");
-  TORCH_CHECK(index.scalar_type() == ScalarType::Long, "index_select(): Expected dtype int64 for index");
+  TORCH_CHECK(index.scalar_type() == ScalarType::Long || index.scalar_type() == ScalarType::Int, "index_select(): Expected dtype int64 for index");


Update the string to include int32 as well?

ngimel

This looks good. Can you please file an issue to enable int support for index_add and index_select on cuda?

torch/_torch_docs.py

facebook-github-bot · 2020-10-30T22:25:32Z

This pull request was exported from Phabricator. Differential Revision: D24470808

qizzzh · 2020-11-02T20:26:12Z

aten/src/ATen/native/TensorAdvancedIndexing.cpp

-        scalar_t *self_ip = self_ptr + self_i * self_stride;
-        *self_ip += *(source_ptr + i * source_stride);
-      }
+      AT_DISPATCH_INDEX_TYPES(index.scalar_type(), "index_add_cpu_", [&] () {


I'm hitting some windows build failures here - https://app.circleci.com/pipelines/github/pytorch/pytorch/233453/workflows/e2a13402-bc72-474f-89a7-6c1f0132cdd0/jobs/8597430/parallel-runs/0/steps/0-104

Seems variables are not captured correctly in the nested lambda. Also found some asks about similar issues online - https://developercommunity.visualstudio.com/content/problem/604920/visual-c-1613-visibility-error-in-nested-lambda-in.html

Hi @peterjc123 ! @ngimel suggested that you might have solution/workaround to this issue. Would you mind taking a look? Thank you in advance!

Explicitly capturing variables could work around it.

facebook-github-bot · 2020-11-03T04:08:30Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-11-03T05:48:39Z

This pull request was exported from Phabricator. Differential Revision: D24470808

Summary: Pull Request resolved: pytorch#46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Differential Revision: D24470808 fbshipit-source-id: c468b9678b10ef937c645a14c84e3094b8a1c7ba

facebook-github-bot · 2020-11-03T19:13:51Z

This pull request was exported from Phabricator. Differential Revision: D24470808

facebook-github-bot · 2020-11-04T11:10:02Z

This pull request has been merged in 0ec717c.

facebook-github-bot added the fb-exported label Oct 23, 2020

dzhulgakov requested review from ngimel and VitalyFedyunin October 23, 2020 05:30

qizzzh force-pushed the export-D24470808 branch from 5da8558 to 8a9ed33 Compare October 23, 2020 06:49

qizzzh requested a review from apaszke as a code owner October 23, 2020 06:49

ngimel reviewed Oct 25, 2020

View reviewed changes

qizzzh force-pushed the export-D24470808 branch from 8a9ed33 to 040d76b Compare October 26, 2020 19:40

qizzzh force-pushed the export-D24470808 branch from 040d76b to 5df1338 Compare October 26, 2020 20:21

qizzzh mentioned this pull request Oct 27, 2020

XLA test failures when using int32 indices/offsets for nn.EmbeddingBag pytorch/xla#2575

Closed

qizzzh force-pushed the export-D24470808 branch from 5df1338 to 68b1c11 Compare October 27, 2020 01:05

qizzzh force-pushed the export-D24470808 branch from 68b1c11 to 14f1f20 Compare October 27, 2020 03:05

dzhulgakov reviewed Oct 27, 2020

View reviewed changes

ailzhang reviewed Oct 27, 2020

View reviewed changes

qizzzh force-pushed the export-D24470808 branch from 14f1f20 to 450f385 Compare October 28, 2020 05:54

JackCaoG mentioned this pull request Oct 28, 2020

Accept int32 as index for index_add op pytorch/xla#2580

Merged

qizzzh force-pushed the export-D24470808 branch from 450f385 to 5f5704e Compare October 28, 2020 21:36

JackCaoG reviewed Oct 29, 2020

View reviewed changes

facebook-github-bot added the cla signed label Oct 30, 2020

ngimel approved these changes Oct 30, 2020

View reviewed changes

torch/_torch_docs.py Show resolved Hide resolved

JackCaoG mentioned this pull request Oct 30, 2020

Revert "Revert "Accept int32 as index for index_add op"" pytorch/xla#2595

Merged

qizzzh force-pushed the export-D24470808 branch from 5f5704e to 85cbf0a Compare October 30, 2020 22:25

qizzzh commented Nov 2, 2020

View reviewed changes

qizzzh force-pushed the export-D24470808 branch from 85cbf0a to 3df4651 Compare November 3, 2020 04:08

qizzzh force-pushed the export-D24470808 branch from 3df4651 to 8c6554e Compare November 3, 2020 05:48

qizzzh force-pushed the export-D24470808 branch from 8c6554e to 07607c1 Compare November 3, 2020 19:13

facebook-github-bot closed this in 0ec717c Nov 4, 2020

facebook-github-bot added the Merged label Nov 4, 2020

albanD mentioned this pull request Mar 17, 2021

embeddings layer with IntTensor / cuda.IntTensor inputs #145

Closed

JohnHBrock mentioned this pull request Jun 14, 2023

Support ByteTensor and ShortTensor for nn.Embedding and nn.EmbeddingBag #103580

Open

Support int32 indices and offsets in nn.EmbeddingBag #46758

Support int32 indices and offsets in nn.EmbeddingBag #46758

Conversation

qizzzh commented Oct 23, 2020

facebook-github-bot commented Oct 23, 2020

dr-ci bot commented Oct 23, 2020 • edited

💊 CI failures summary and remediations

XLA failure

facebook-github-bot commented Oct 23, 2020

ngimel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 26, 2020

facebook-github-bot commented Oct 26, 2020

facebook-github-bot commented Oct 27, 2020

facebook-github-bot commented Oct 27, 2020

dzhulgakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 28, 2020

facebook-github-bot commented Oct 28, 2020

Choose a reason for hiding this comment

ngimel left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 30, 2020

qizzzh Nov 2, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 3, 2020

facebook-github-bot commented Nov 3, 2020

facebook-github-bot commented Nov 3, 2020

facebook-github-bot commented Nov 4, 2020

dr-ci bot commented Oct 23, 2020 •

edited

qizzzh Nov 2, 2020 •

edited