[NestedTensor]Remove tensor buffer replace with Storage type #82757

drisspg · 2022-08-03T21:32:28Z

Description

In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl.

Issue

#82671

Testing

The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful.

Performance

One change that has potentially large performance impact is that most nested_tensor kernels call get_buffer to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads.

facebook-github-bot · 2022-08-03T21:32:36Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82757
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (8 Pending)

As of commit b493a2d (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

drisspg · 2022-08-04T03:26:35Z

aten/src/ATen/native/transformers/cuda/attention.cu

@@ -387,7 +388,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
          const auto input_dim = sizes.sizes()[1];
          TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input_dim == 1);
          if (aligned &&
-              ((reinterpret_cast<intptr_t>(nt_qkv->get_buffer().data_ptr()) %
+              ((reinterpret_cast<intptr_t>(qkv.data_ptr()) %


Since we store a real storage_ on the NestedTensorImpl should we can call data_ptr()

drisspg · 2022-08-04T03:26:36Z

aten/src/ATen/native/transformers/cuda/attention.cu

@@ -376,6 +376,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
        }
        if (qkv.is_nested()) {
          auto* nt_qkv = get_nested_tensor_impl(qkv);
+          const at::Tensor& nt_qkv_buffer = nt_qkv->get_buffer();


@swolchok The goal of this PR is to sub out the buffer_ that is stored on NestedTensor with instead a storage. Calling packed_accessor64 in theory should be possible for NestedTensor but there is some size checks that currently fail. get_buffer now returns a regular at::Tensor instead of a const at::Tensor&. This tensor gets gets constructed at function execution from the underlying storage of the NestedTensorImpl. The change here seems a little like a compiler hack since the version I was trying to call gets deleted:

pytorch/aten/src/ATen/core/TensorBase.h

Line 562 in 1164c83

PackedTensorAccessor64<T,N,PtrTraits> packed_accessor64() && = delete;

It seems to work but not sure if I am somehow foot gunning here.

this is fine, but note that it's the same as if you had written (more clearly IMO) at::Tensor nt_qkv_buffer = nt_qkv->get_buffer(); . It builds because of temporary lifetime extension -- https://abseil.io/tips/107

aten/src/ATen/NestedTensorImpl.cpp

drisspg · 2022-08-04T03:29:25Z

aten/src/ATen/NestedTensorImpl.cpp

-          buffer.dtype(),
-          buffer.device()),
-      buffer_(std::move(buffer)),
+          buffer.dtype()),


Another note is in the wrap_buffer could instead pass in a Storage and a Size, We can then get the device, and dtype from the Storage instance. I am open to this change but would add some more bloat since constructor call sites would need to be changed

I think it's a little cleaner to have the NestedTensorImpl take a storage directly. wrap_buffer() could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl constructor. cc @albanD / @bdhirsh opinion here?

I am open to this change but would add some more bloat since constructor call sites would need to be changed

Are there many call sites? It seems usually helper functions like wrap_buffer() are used instead of the constructor directly

Looks like there are 11 calls to: at::detail::make_tensor<NestedTensorImpl>
I think alot of these though are by the transformer folk and should probably switch to wrap_buffer.

I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.

jbschlosser

Can't speak to perf but changes mostly look like I expect :)

aten/src/ATen/NestedTensorImpl.h

aten/src/ATen/NestedTensorImpl.cpp

jbschlosser · 2022-08-04T16:08:29Z

aten/src/ATen/NestedTensorImpl.cpp

-          buffer.dtype(),
-          buffer.device()),
-      buffer_(std::move(buffer)),
+          buffer.dtype()),


I think it's a little cleaner to have the NestedTensorImpl take a storage directly. wrap_buffer() could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl constructor. cc @albanD / @bdhirsh opinion here?

I am open to this change but would add some more bloat since constructor call sites would need to be changed

Are there many call sites? It seems usually helper functions like wrap_buffer() are used instead of the constructor directly

aten/src/ATen/NestedTensorImpl.cpp

albanD · 2022-08-04T17:17:39Z

aten/src/ATen/NestedTensorImpl.cpp

-          buffer.dtype(),
-          buffer.device()),
-      buffer_(std::move(buffer)),
+          buffer.dtype()),


I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.

aten/src/ATen/NestedTensorImpl.cpp

albanD · 2022-08-04T17:21:25Z

aten/src/ATen/native/nested/NestedTensorMath.cpp

    result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]);
  }
  return result_tensors;
 }

 Tensor& NestedTensor_relu_(Tensor& self) {
-  at::relu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()));
+  auto buffer = get_nested_tensor_impl(self) -> get_buffer();


weird spacing around the -> ?
Also you splitted this in two lines to avoid the cast?

Sorry yeah I can unspace them. Since get buffer no longer returns a const tensor ref the const_cast is not necessary anymore. When it is a one liner I get:

Non-const lvalue reference to type 'at::Tensor' cannot bind to a temporary of type 'at::Tensor'clang(lvalue_reference_bind_to_temporary)

So I made the result of get_buffer() not a temporary by binding it to buffer. I figured since what relu_ is doing under the head will be mutating the values in Storage this is okay

albanD

Small nit but SGTM otherwise

aten/src/ATen/NestedTensorImpl.cpp

jbschlosser

LGTM also

drisspg · 2022-08-04T21:00:11Z

Benchmarking Linear with a an nt of nested size [[1,1]] and a weight of size [1,1] in order to profile new overhead of get_buffer()

The blocked_autorange output for running this for 15 seconds on master before the change to NestedTensorImpl:

Minimal nt linear - Dtype:torch.float16, device:cuda
  Median: 30.40 us
  IQR:    2.11 us (29.75 to 31.86)
  477590 measurements, 1 runs per measurement, 48 threads

Minimal nt linear - Dtype:torch.float32, Device:cpu
  Median: 9.45 us
  IQR:    0.24 us (9.41 to 9.65)
  157 measurements, 10000 runs per measurement, 48 threads

The blocked_autorange output for running this for 15 seconds after the change to NestedTensorImpl adding storage:

Minimal nt linear - Dtype:torch.float16, device:cuda
  Median: 29.95 us
  IQR:    2.15 us (29.49 to 31.64)
  481336 measurements, 1 runs per measurement, 48 threads

Minimal nt linear - Dtype:torch.float32, Device:cpu
  Median: 9.49 us
  IQR:    0.43 us (9.45 to 9.88)
  157 measurements, 10000 runs per measurement, 48 threads

drisspg · 2022-08-04T21:04:53Z

@pytorchbot merge -l

pytorchmergebot · 2022-08-04T21:06:03Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-04T23:27:27Z

Hey @drisspg.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…#82757) Summary: ### Description In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl. ### Issue #82671 ### Testing The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful. ### Performance One change that has potentially large performance impact is that most nested_tensor kernels call `get_buffer` to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads. Pull Request resolved: #82757 Approved by: https://github.com/albanD, https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2f9d046d6717d9423e012c5249624fbeeb506cbb Reviewed By: kit1980 Differential Revision: D38446691 Pulled By: drisspg fbshipit-source-id: 748d9fff07c79e5c42743542662fa0ed3b46b662

facebook-github-bot added the cla signed label Aug 3, 2022

drisspg requested review from jbschlosser and albanD August 3, 2022 21:32

drisspg force-pushed the replace_nested_tensor_buffer_tensor_with_storage branch 2 times, most recently from 8396495 to 95bccd4 Compare August 4, 2022 03:20

drisspg commented Aug 4, 2022

View reviewed changes

aten/src/ATen/NestedTensorImpl.cpp Show resolved Hide resolved

drisspg commented Aug 4, 2022

View reviewed changes

drisspg added module: nestedtensor NestedTensor tag see issue #25032 release notes: nested tensor Changes that have a direct impact on nested tensors ciflow/trunk Trigger trunk jobs on your pull request labels Aug 4, 2022

jbschlosser reviewed Aug 4, 2022

View reviewed changes

albanD reviewed Aug 4, 2022

View reviewed changes

drisspg added 5 commits August 4, 2022 17:54

Remove buffer and use storage

cbe965f

Fixes to inplace get_buffer call sites

d46b97f

make IntArrayRef construction explicit

9c9608e

Fix packed_accessor calling

8bd2696

store buffer_size as an int64

cb8eab4

drisspg force-pushed the replace_nested_tensor_buffer_tensor_with_storage branch from b19b837 to cb8eab4 Compare August 4, 2022 17:54

format

0176f2b

albanD approved these changes Aug 4, 2022

View reviewed changes

aten/src/ATen/NestedTensorImpl.cpp Outdated Show resolved Hide resolved

aten/src/ATen/NestedTensorImpl.cpp Outdated Show resolved Hide resolved

fix buffer_size error message

b493a2d

jbschlosser approved these changes Aug 4, 2022

View reviewed changes

pytorchmergebot added the Merged label Aug 4, 2022

pytorchmergebot closed this in 2f9d046 Aug 4, 2022

[NestedTensor]Remove tensor buffer replace with Storage type #82757

[NestedTensor]Remove tensor buffer replace with Storage type #82757

Uh oh!

Conversation

drisspg commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue

Testing

Performance

Uh oh!

facebook-github-bot commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (8 Pending)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drisspg Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbschlosser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jbschlosser left a comment

Choose a reason for hiding this comment

Uh oh!

drisspg commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking Linear with a an nt of nested size [[1,1]] and a weight of size [1,1] in order to profile new overhead of get_buffer()

Uh oh!

drisspg commented Aug 4, 2022

Uh oh!

pytorchmergebot commented Aug 4, 2022

Uh oh!

github-actions bot commented Aug 4, 2022

Uh oh!

Uh oh!

drisspg commented Aug 3, 2022 •

edited

Loading

facebook-github-bot commented Aug 3, 2022 •

edited

Loading

drisspg Aug 4, 2022 •

edited

Loading

drisspg Aug 4, 2022 •

edited

Loading

drisspg commented Aug 4, 2022 •

edited

Loading