Skip to content

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Aug 3, 2022

Description

In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl.

Issue

#82671

Testing

The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful.

Performance

One change that has potentially large performance impact is that most nested_tensor kernels call get_buffer to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 3, 2022

🔗 Helpful links

✅ No Failures (8 Pending)

As of commit b493a2d (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@drisspg drisspg requested review from jbschlosser and albanD August 3, 2022 21:32
@drisspg drisspg force-pushed the replace_nested_tensor_buffer_tensor_with_storage branch 2 times, most recently from 8396495 to 95bccd4 Compare August 4, 2022 03:20
@@ -387,7 +388,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
const auto input_dim = sizes.sizes()[1];
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input_dim == 1);
if (aligned &&
((reinterpret_cast<intptr_t>(nt_qkv->get_buffer().data_ptr()) %
((reinterpret_cast<intptr_t>(qkv.data_ptr()) %
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we store a real storage_ on the NestedTensorImpl should we can call data_ptr()

@@ -376,6 +376,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
}
if (qkv.is_nested()) {
auto* nt_qkv = get_nested_tensor_impl(qkv);
const at::Tensor& nt_qkv_buffer = nt_qkv->get_buffer();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swolchok The goal of this PR is to sub out the buffer_ that is stored on NestedTensor with instead a storage. Calling packed_accessor64 in theory should be possible for NestedTensor but there is some size checks that currently fail. get_buffer now returns a regular at::Tensor instead of a const at::Tensor&. This tensor gets gets constructed at function execution from the underlying storage of the NestedTensorImpl. The change here seems a little like a compiler hack since the version I was trying to call gets deleted:

PackedTensorAccessor64<T,N,PtrTraits> packed_accessor64() && = delete;

It seems to work but not sure if I am somehow foot gunning here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, but note that it's the same as if you had written (more clearly IMO) at::Tensor nt_qkv_buffer = nt_qkv->get_buffer(); . It builds because of temporary lifetime extension -- https://abseil.io/tips/107

buffer.dtype(),
buffer.device()),
buffer_(std::move(buffer)),
buffer.dtype()),
Copy link
Contributor Author

@drisspg drisspg Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note is in the wrap_buffer could instead pass in a Storage and a Size, We can then get the device, and dtype from the Storage instance. I am open to this change but would add some more bloat since constructor call sites would need to be changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little cleaner to have the NestedTensorImpl take a storage directly. wrap_buffer() could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl constructor. cc @albanD / @bdhirsh opinion here?

I am open to this change but would add some more bloat since constructor call sites would need to be changed

Are there many call sites? It seems usually helper functions like wrap_buffer() are used instead of the constructor directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are 11 calls to: at::detail::make_tensor<NestedTensorImpl>
I think alot of these though are by the transformer folk and should probably switch to wrap_buffer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.

@drisspg drisspg added module: nestedtensor NestedTensor tag see issue #25032 release notes: nested tensor Changes that have a direct impact on nested tensors ciflow/trunk Trigger trunk jobs on your pull request labels Aug 4, 2022
Copy link
Contributor

@jbschlosser jbschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't speak to perf but changes mostly look like I expect :)

buffer.dtype(),
buffer.device()),
buffer_(std::move(buffer)),
buffer.dtype()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little cleaner to have the NestedTensorImpl take a storage directly. wrap_buffer() could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl constructor. cc @albanD / @bdhirsh opinion here?

I am open to this change but would add some more bloat since constructor call sites would need to be changed

Are there many call sites? It seems usually helper functions like wrap_buffer() are used instead of the constructor directly

buffer.dtype(),
buffer.device()),
buffer_(std::move(buffer)),
buffer.dtype()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.

result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]);
}
return result_tensors;
}

Tensor& NestedTensor_relu_(Tensor& self) {
at::relu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()));
auto buffer = get_nested_tensor_impl(self) -> get_buffer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird spacing around the -> ?
Also you splitted this in two lines to avoid the cast?

Copy link
Contributor Author

@drisspg drisspg Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yeah I can unspace them. Since get buffer no longer returns a const tensor ref the const_cast is not necessary anymore. When it is a one liner I get:

Non-const lvalue reference to type 'at::Tensor' cannot bind to a temporary of type 'at::Tensor'clang(lvalue_reference_bind_to_temporary)

So I made the result of get_buffer() not a temporary by binding it to buffer. I figured since what relu_ is doing under the head will be mutating the values in Storage this is okay

@drisspg drisspg force-pushed the replace_nested_tensor_buffer_tensor_with_storage branch from b19b837 to cb8eab4 Compare August 4, 2022 17:54
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit but SGTM otherwise

Copy link
Contributor

@jbschlosser jbschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also

@drisspg
Copy link
Contributor Author

drisspg commented Aug 4, 2022

Benchmarking Linear with a an nt of nested size [[1,1]] and a weight of size [1,1] in order to profile new overhead of get_buffer()

The blocked_autorange output for running this for 15 seconds on master before the change to NestedTensorImpl:

Minimal nt linear - Dtype:torch.float16, device:cuda
  Median: 30.40 us
  IQR:    2.11 us (29.75 to 31.86)
  477590 measurements, 1 runs per measurement, 48 threads

Minimal nt linear - Dtype:torch.float32, Device:cpu
  Median: 9.45 us
  IQR:    0.24 us (9.41 to 9.65)
  157 measurements, 10000 runs per measurement, 48 threads

The blocked_autorange output for running this for 15 seconds after the change to NestedTensorImpl adding storage:

Minimal nt linear - Dtype:torch.float16, device:cuda
  Median: 29.95 us
  IQR:    2.15 us (29.49 to 31.64)
  481336 measurements, 1 runs per measurement, 48 threads

Minimal nt linear - Dtype:torch.float32, Device:cpu
  Median: 9.49 us
  IQR:    0.43 us (9.45 to 9.88)
  157 measurements, 10000 runs per measurement, 48 threads

@drisspg
Copy link
Contributor Author

drisspg commented Aug 4, 2022

@pytorchbot merge -l

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2022

Hey @drisspg.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Aug 7, 2022
…#82757)

Summary:
### Description
In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl.

### Issue
#82671

### Testing
The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful.

### Performance
One change that has potentially large performance impact is that most nested_tensor kernels call `get_buffer` to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads.

Pull Request resolved: #82757
Approved by: https://github.com/albanD, https://github.com/jbschlosser

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2f9d046d6717d9423e012c5249624fbeeb506cbb

Reviewed By: kit1980

Differential Revision: D38446691

Pulled By: drisspg

fbshipit-source-id: 748d9fff07c79e5c42743542662fa0ed3b46b662
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged module: nestedtensor NestedTensor tag see issue #25032 release notes: nested tensor Changes that have a direct impact on nested tensors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants