-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[NestedTensor]Remove tensor buffer replace with Storage type #82757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NestedTensor]Remove tensor buffer replace with Storage type #82757
Conversation
🔗 Helpful links
✅ No Failures (8 Pending)As of commit b493a2d (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
8396495
to
95bccd4
Compare
@@ -387,7 +388,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda( | |||
const auto input_dim = sizes.sizes()[1]; | |||
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input_dim == 1); | |||
if (aligned && | |||
((reinterpret_cast<intptr_t>(nt_qkv->get_buffer().data_ptr()) % | |||
((reinterpret_cast<intptr_t>(qkv.data_ptr()) % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we store a real storage_ on the NestedTensorImpl should we can call data_ptr()
@@ -376,6 +376,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda( | |||
} | |||
if (qkv.is_nested()) { | |||
auto* nt_qkv = get_nested_tensor_impl(qkv); | |||
const at::Tensor& nt_qkv_buffer = nt_qkv->get_buffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swolchok The goal of this PR is to sub out the buffer_ that is stored on NestedTensor with instead a storage. Calling packed_accessor64
in theory should be possible for NestedTensor but there is some size checks that currently fail. get_buffer
now returns a regular at::Tensor
instead of a const at::Tensor&
. This tensor gets gets constructed at function execution from the underlying storage of the NestedTensorImpl. The change here seems a little like a compiler hack since the version I was trying to call gets deleted:
pytorch/aten/src/ATen/core/TensorBase.h
Line 562 in 1164c83
PackedTensorAccessor64<T,N,PtrTraits> packed_accessor64() && = delete; |
It seems to work but not sure if I am somehow foot gunning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine, but note that it's the same as if you had written (more clearly IMO) at::Tensor nt_qkv_buffer = nt_qkv->get_buffer();
. It builds because of temporary lifetime extension -- https://abseil.io/tips/107
buffer.dtype(), | ||
buffer.device()), | ||
buffer_(std::move(buffer)), | ||
buffer.dtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note is in the wrap_buffer could instead pass in a Storage and a Size, We can then get the device, and dtype from the Storage instance. I am open to this change but would add some more bloat since constructor call sites would need to be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little cleaner to have the NestedTensorImpl
take a storage directly. wrap_buffer()
could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl
constructor. cc @albanD / @bdhirsh opinion here?
I am open to this change but would add some more bloat since constructor call sites would need to be changed
Are there many call sites? It seems usually helper functions like wrap_buffer()
are used instead of the constructor directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are 11 calls to: at::detail::make_tensor<NestedTensorImpl>
I think alot of these though are by the transformer folk and should probably switch to wrap_buffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't speak to perf but changes mostly look like I expect :)
buffer.dtype(), | ||
buffer.device()), | ||
buffer_(std::move(buffer)), | ||
buffer.dtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little cleaner to have the NestedTensorImpl
take a storage directly. wrap_buffer()
could still take a buffer as a Tensor and pull out the parts to pass to the NestedTensorImpl
constructor. cc @albanD / @bdhirsh opinion here?
I am open to this change but would add some more bloat since constructor call sites would need to be changed
Are there many call sites? It seems usually helper functions like wrap_buffer()
are used instead of the constructor directly
buffer.dtype(), | ||
buffer.device()), | ||
buffer_(std::move(buffer)), | ||
buffer.dtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good first step. We can always migrate to a unpacked API later if the current one is confusing people.
result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]); | ||
} | ||
return result_tensors; | ||
} | ||
|
||
Tensor& NestedTensor_relu_(Tensor& self) { | ||
at::relu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer())); | ||
auto buffer = get_nested_tensor_impl(self) -> get_buffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird spacing around the ->
?
Also you splitted this in two lines to avoid the cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry yeah I can unspace them. Since get buffer no longer returns a const tensor ref the const_cast is not necessary anymore. When it is a one liner I get:
Non-const lvalue reference to type 'at::Tensor' cannot bind to a temporary of type 'at::Tensor'clang(lvalue_reference_bind_to_temporary)
So I made the result of get_buffer() not a temporary by binding it to buffer
. I figured since what relu_ is doing under the head will be mutating the values in Storage
this is okay
b19b837
to
cb8eab4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit but SGTM otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also
Benchmarking Linear with a an nt of nested size [[1,1]] and a weight of size [1,1] in order to profile new overhead of get_buffer()
|
@pytorchbot merge -l |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @drisspg. |
…#82757) Summary: ### Description In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl. ### Issue #82671 ### Testing The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful. ### Performance One change that has potentially large performance impact is that most nested_tensor kernels call `get_buffer` to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads. Pull Request resolved: #82757 Approved by: https://github.com/albanD, https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2f9d046d6717d9423e012c5249624fbeeb506cbb Reviewed By: kit1980 Differential Revision: D38446691 Pulled By: drisspg fbshipit-source-id: 748d9fff07c79e5c42743542662fa0ed3b46b662
Description
In order to enable NestedTensor views NestedTensorImpl no longer stores its data in a at::Tensor buffer_ instead it conforms to the practice of most TensorImpls and uses a Storage class. This change will enable NestedTensor to use the view constructor defined on the base TensorImpl.
Issue
#82671
Testing
The existing nested_tensor tests are utilized since this is core functionality and would break these tests if not successful.
Performance
One change that has potentially large performance impact is that most nested_tensor kernels call
get_buffer
to get the buffer in Tensor form and perform ops on this buffer. Previously this was free since we stored the data as a Tensor but now each kernel must construct a Tensor from the storage. The most performance critical/heavy user of nested tensors is BetterTransformer. I would be curious to see if this change significantly impacts performance for this and other workloads.