Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support int32 indices and offsets in nn.EmbeddingBag #46758

Closed
wants to merge 1 commit into from

Conversation

qizzzh
Copy link

@qizzzh qizzzh commented Oct 23, 2020

Summary:
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Tensor indexing utilities all assume indices should be int64. I'd like to address that in a follow up diff as this one is already pretty big.

Test Plan: unit tests

Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@dr-ci
Copy link

dr-ci bot commented Oct 23, 2020

💊 CI failures summary and remediations

As of commit 07607c1 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 54 times.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, this is a commonly needed feature.
This generally looks good, can you please update the documentation to indicate that indices/offsets can now be int, and that they have to be of the same type?

}
}
AT_DISPATCH_INDEX_TYPES(indices.scalar_type(), "embedding_dense_backward_cpu", [&] () {
auto indices_contig = indices.contiguous();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to put parts of code that don't depend on dispatch type outside of dispatch macro, that way it does not get compiled twice (e.g. indices.contiguous(), indices.numel(), grad_.contiguous() etc. It also better separates what actually depends on the indices datatype

if (numel > 1000) {
// The strategy is to parallelize over sections of the vocabulary, so that
// thread 1 handles updates to gradWeight[0..nVocab/nThreads]. Every thread
// has to traverse the entire input, but the dominating factor is the axpy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like comment is copy-pasted from somewhere else, there's no axpy in the parallel_section? I know it's been this way before this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me remove it.

auto num_indices = indices.numel();
auto data_ptr = indices_contig.data_ptr<index_t>();
auto sorted_indices = std::vector<index_t>(data_ptr, data_ptr + num_indices);
std::sort(sorted_indices.begin(), sorted_indices.end(), std::less<index_t>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::less is a default comparator, no need to explicitly specify it.

offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2]
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
auto ones = at::ones_like(offsets, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you using index_add here? Don't want to convert offsets to long? In this case you also don't need ones tensor, just add literal ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to convert offsets to long

Yes. Would you suggest to keep using the original code with offsets.long() or the current one with literal ones? Seems the later may have better perf.

auto numel = add_indices.numel();
int64_t ddim = src.size(1);
auto src_stride0 = src.stride(0);
auto src_stride1 = src.stride(1);
auto output_stride0 = output.stride(0);
auto output_stride1 = output.stride(1);

auto* scale_data = scale.data_ptr<T>();
auto* scale_data = scale.data_ptr<data_t>();
auto scale_stride = scale.stride(0);

for (int64_t i = 0; i < numel; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why THBlas_axpy is not used here? One can specify scale for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the history..

checkContiguous("embedding_bag", offsets_arg);

Tensor offset2bag_;
if (indices.numel() != 0 && offset2bag.numel() == 0) {
offset2bag_ = at::zeros(
{indices.sizes()[0] + 1}, indices.options()); // offset2bag = [0 0 0 0 0]
{indices.sizes()[0] + 1}, offsets.options()); // offset2bag = [0 0 0 0 0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to replace indices.options() with offsets.options()? They should be of the same type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think offset2bag_ is closer to offsets - dims, type

index_grad_weight.select(1, dim).index_add_(
0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim));
//index_grad_weight.select(1, dim).index_add_(
// 0, nonempty_max_indices.select(1, dim), nonempty_grad.select(1, dim));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a TODO here to use index_add again when it supports int32 indices, can you please add a TODO?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

test/test_nn.py Outdated
@@ -11593,16 +11596,17 @@ def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum',
bags.append(embeddings.narrow(0, offset, length).max(0)[0])
return torch.stack(bags)

def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device):
@dtypes(torch.int, torch.long)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you give tuples to @dtypes decorator, you can specify indices types and weights types via decorator and not via loops in the test, so it's slightly cleaner:

@dtypes(*itertools.product((torch.int, torch.long), (torch.float, torch.double))
def test_Embedding(self, device, dtypes): #dtypes here is an iterable of dtypes, you can access elements by dtypes[0] and dtypes[1]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Will update.

test/test_nn.py Outdated
test_backward = wdtype is not torch.float

self._test_EmbeddingBag(device, 'sum', True, wdtype, dtype, test_backward=test_backward)
self._test_EmbeddingBag(device, 'mean', True, wdtype, dtype, test_backward=test_backward)

@dtypesIfCUDA(torch.half, torch.float, torch.double)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not look right, on cuda dtype will be half, float and double instead of integral. Same comment as above for specifying tuples of dtypes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Will fix.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic, thanks a lot for working on it!


// TODO: use tensor.index() after improving perf
if (indices.dim() == 1) {
return weight.index_select(0, indices);
return weight.index_select(0, indices.to(kLong));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, maybe we should make index_select to be int32 too, but that can wait till another day :)

at::empty({0, num_features}, dense_options),
weight_size);
}

auto index = indices.reshape({1, -1});
auto values = grad.reshape({-1, num_features});
return at::_sparse_coo_tensor_unsafe(index, values, weight_size);
return at::_sparse_coo_tensor_unsafe(index.to(kLong), values, weight_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, so we're doing still int64 always in the backwards? are there concerns with doing int32? sparse tensor should support it.

Or the worry is that we'd need to update optimizers too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter, indices of sparse tensor will be converted to Long anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use int32 initially and found that SparseTensor or at least some SparseTensor utilities assume int64 indices - https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/SparseTensorImpl.cpp.html#90

offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2]
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {
auto index_data = offsets.contiguous().data_ptr<index_t>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offsets.contiguous() is a temporary that will be deleted, you should have a local var for it

}
});
offset2bag[0] -= 1;
offset2bag = offset2bag.cumsum(0, offset2bag.scalar_type());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since you're manually writing this code - you could directly fold cumsum calculation above - the offsets are always sorted

offset2bag[0] -= 1; // offset2bag = [0 0 1 0 1]
offset2bag = offset2bag.cumsum(0); // offset2bag = [0 0 1 1 2]
static void make_offset2bag(const Tensor &offsets, Tensor& offset2bag) {
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "make_offset2bag", [&] () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qizzzh The reason that XLA test failed is here data storage is directly accessed. From native_functions.yaml _embedding_bag_backward (which calls make_offset2bag) should be device generic (cannot assume storage exists). There're two ways to fix:

  1. Use at:: functions instead of directly accessing storage in make_offset2bag
  2. Move the old _embedding_bag_backward implementation to Math key and add dispatch section for it in native_functions.yaml. The key here is after this change, your kernel no longer works for all backends. More details can found https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native#choosing-the-right-dispatch-keyword
    Either solution works for me, the first one might be a little less work I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I guess I'll just make index_add support int32 indices so we keep the original code. index_add is pretty self-contained and I hope it would be fine.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@@ -454,7 +460,7 @@ Tensor & index_select_out_cpu_(Tensor & result, const Tensor & self, int64_t dim

auto numel = index.numel();
TORCH_CHECK_INDEX(index.dim() <= 1, "index_select(): Index is supposed to be a vector");
TORCH_CHECK(index.scalar_type() == ScalarType::Long, "index_select(): Expected dtype int64 for index");
TORCH_CHECK(index.scalar_type() == ScalarType::Long || index.scalar_type() == ScalarType::Int, "index_select(): Expected dtype int64 for index");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the string to include int32 as well?

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Can you please file an issue to enable int support for index_add and index_select on cuda?

torch/_torch_docs.py Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

scalar_t *self_ip = self_ptr + self_i * self_stride;
*self_ip += *(source_ptr + i * source_stride);
}
AT_DISPATCH_INDEX_TYPES(index.scalar_type(), "index_add_cpu_", [&] () {
Copy link
Author

@qizzzh qizzzh Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hitting some windows build failures here - https://app.circleci.com/pipelines/github/pytorch/pytorch/233453/workflows/e2a13402-bc72-474f-89a7-6c1f0132cdd0/jobs/8597430/parallel-runs/0/steps/0-104

Seems variables are not captured correctly in the nested lambda. Also found some asks about similar issues online - https://developercommunity.visualstudio.com/content/problem/604920/visual-c-1613-visibility-error-in-nested-lambda-in.html

Hi @peterjc123 ! @ngimel suggested that you might have solution/workaround to this issue. Would you mind taking a look? Thank you in advance!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly capturing variables could work around it.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

Summary:
Pull Request resolved: pytorch#46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Differential Revision: D24470808

fbshipit-source-id: c468b9678b10ef937c645a14c84e3094b8a1c7ba
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24470808

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0ec717c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants