Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize DLPack stride to 1 where shape < 2 #83158

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 15 additions & 2 deletions aten/src/ATen/DLConvertor.cpp
Expand Up @@ -209,6 +209,7 @@ struct ATenDLMTensor {
};

void deleter(DLManagedTensor* arg) {
delete [] arg->dl_tensor.strides;
delete static_cast<ATenDLMTensor*>(arg->manager_ctx);
}

Expand All @@ -227,12 +228,24 @@ DLManagedTensor* toDLPack(const Tensor& src) {
atDLMTensor->tensor.dl_tensor.device = getDLDevice(src, device_id);
atDLMTensor->tensor.dl_tensor.ndim = src.dim();
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
// Normalize the strides to 1 wherever shape < 2
// gh-83069
auto shape = src.sizes().data();
int64_t *strides = new int64_t[src.dim()];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you guys have something like a MAX_DIM concept? perhaps you could add a slot strides[MAX_DIM] in the ATenDLMTensor structure to store the strides. That way you can implement this hole thing with a single new allocation.

int64_t *strides = ATenDLMTensor->strides;

PS: This is just a suggestion for an alternative implementation. I'm totally fine with things as they are now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a concept of MAX_DIMS but I am not sure the src.sizes.data() is forced to be a int64_t[MAX_DIMS]. If it is shorter, arbitrary memory would be copied into the ATenDLMTensor.strides.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the src.sizes.data() is forced to be a int64_t[MAX_DIMS]

It does not have to be. It does not matter.

If it is shorter, arbitrary memory would be copied into the ATenDLMTensor.strides.

No, simply because the for loop below fills up to src.dim() entries in the stride buffer. Any entries beyond src.dim() will be unused.

The real issue is whether you are willing to pay the extra 1 KiB needed to store the MAX_DIMS strides vs saving one new allocation call.

Am I making myself clear enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right. I misunderstood. I will let other reviewers weigh in whether they want to avoid the new call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care about perf here, I just want to minimize manual memory management. Probably the slickest way to do this is to create a new Tensor that has the normalized strides and then use the fields on that tensor directly from dlpack. WDYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as_strided is the ticket here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to use as_strided I need to allocate an int vector for the new strides, right? I am not sure what advantage creating a new container gives over using the existing capsule deallocate mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, the benefit is you don't have to manually memory manage strides separately from Tensor object. Tensor is the owning object that is managed and everything else hangs off it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, the benefit is you don't have to manually memory manage strides separately from Tensor object. Tensor is the owning object that is managed and everything else hangs off it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (assuming CI passes)

for (int i=0; i<src.dim(); i++) {
if (shape[i] < 2) {
strides[i] = 1;
}
else {
strides[i] = src.strides()[i];
}
}
atDLMTensor->tensor.dl_tensor.shape =
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
const_cast<int64_t*>(src.sizes().data());
const_cast<int64_t*>(shape);
atDLMTensor->tensor.dl_tensor.strides =
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
const_cast<int64_t*>(src.strides().data());
const_cast<int64_t*>(strides);
atDLMTensor->tensor.dl_tensor.byte_offset = 0;
return &(atDLMTensor->tensor);
}
Expand Down
193 changes: 193 additions & 0 deletions test/test_dlpack.py
@@ -0,0 +1,193 @@
# -*- coding: utf-8 -*-
# Owner(s): ["module: tests"]

import torch
from torch.testing import make_tensor
from torch.testing._internal.common_utils import TestCase, run_tests
from torch.testing._internal.common_device_type import (
instantiate_device_type_tests, onlyCUDA, dtypes, skipMeta,
onlyNativeDeviceTypes)
from torch.testing._internal.common_dtype import all_types_and_complex_and
from torch.utils.dlpack import from_dlpack, to_dlpack


class TestTorchDlPack(TestCase):
exact_dtype = True

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_dlpack_capsule_conversion(self, device, dtype):
# DLpack does not explicitly support bool (xref dmlc/dlpack#75)
x = make_tensor((5,), dtype=dtype, device=device)
z = from_dlpack(to_dlpack(x))
self.assertEqual(z, x)

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_dlpack_protocol_conversion(self, device, dtype):
x = make_tensor((5,), dtype=dtype, device=device)
z = from_dlpack(x)
self.assertEqual(z, x)

@skipMeta
@onlyNativeDeviceTypes
def test_dlpack_shared_storage(self, device):
x = make_tensor((5,), dtype=torch.float64, device=device)
z = from_dlpack(to_dlpack(x))
z[0] = z[0] + 20.0
self.assertEqual(z, x)

@skipMeta
@onlyCUDA
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_dlpack_conversion_with_streams(self, device, dtype):
# Create a stream where the tensor will reside
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
# Do an operation in the actual stream
x = make_tensor((5,), dtype=dtype, device=device) + 1
# DLPack protocol helps establish a correct stream order
# (hence data dependency) at the exchange boundary.
# DLPack manages this synchronization for us, so we don't need to
# explicitly wait until x is populated
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
z = from_dlpack(x)
stream.synchronize()
self.assertEqual(z, x)

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_from_dlpack(self, device, dtype):
x = make_tensor((5,), dtype=dtype, device=device)
y = torch.from_dlpack(x)
self.assertEqual(x, y)

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_from_dlpack_noncontinguous(self, device, dtype):
x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)

y1 = x[0]
y1_dl = torch.from_dlpack(y1)
self.assertEqual(y1, y1_dl)

y2 = x[:, 0]
y2_dl = torch.from_dlpack(y2)
self.assertEqual(y2, y2_dl)

y3 = x[1, :]
y3_dl = torch.from_dlpack(y3)
self.assertEqual(y3, y3_dl)

y4 = x[1]
y4_dl = torch.from_dlpack(y4)
self.assertEqual(y4, y4_dl)

y5 = x.t()
y5_dl = torch.from_dlpack(y5)
self.assertEqual(y5, y5_dl)

@skipMeta
@onlyCUDA
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_dlpack_conversion_with_diff_streams(self, device, dtype):
stream_a = torch.cuda.Stream()
stream_b = torch.cuda.Stream()
# DLPack protocol helps establish a correct stream order
# (hence data dependency) at the exchange boundary.
# the `tensor.__dlpack__` method will insert a synchronization event
# in the current stream to make sure that it was correctly populated.
with torch.cuda.stream(stream_a):
x = make_tensor((5,), dtype=dtype, device=device) + 1
z = torch.from_dlpack(x.__dlpack__(stream_b.cuda_stream))
stream_a.synchronize()
stream_b.synchronize()
self.assertEqual(z, x)

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_from_dlpack_dtype(self, device, dtype):
x = make_tensor((5,), dtype=dtype, device=device)
y = torch.from_dlpack(x)
assert x.dtype == y.dtype

@skipMeta
@onlyCUDA
def test_dlpack_default_stream(self, device):
class DLPackTensor:
def __init__(self, tensor):
self.tensor = tensor

def __dlpack_device__(self):
return self.tensor.__dlpack_device__()

def __dlpack__(self, stream=None):
if torch.version.hip is None:
assert stream == 1
else:
assert stream == 0
capsule = self.tensor.__dlpack__(stream)
return capsule

# CUDA-based tests runs on non-default streams
with torch.cuda.stream(torch.cuda.default_stream()):
x = DLPackTensor(make_tensor((5,), dtype=torch.float32, device=device))
from_dlpack(x)

@skipMeta
@onlyNativeDeviceTypes
@dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
def test_dlpack_tensor_invalid_stream(self, device, dtype):
with self.assertRaises(TypeError):
x = make_tensor((5,), dtype=dtype, device=device)
x.__dlpack__(stream=object())

@skipMeta
def test_dlpack_error_on_bool_tensor(self):
x = torch.tensor([True], dtype=torch.bool)
with self.assertRaises(RuntimeError):
to_dlpack(x)

# TODO: add interchange tests once NumPy 1.22 (dlpack support) is required
@skipMeta
def test_dlpack_export_requires_grad(self):
x = torch.zeros(10, dtype=torch.float32, requires_grad=True)
with self.assertRaisesRegex(RuntimeError, r"require gradient"):
x.__dlpack__()

@skipMeta
def test_dlpack_export_is_conj(self):
x = torch.tensor([-1 + 1j, -2 + 2j, 3 - 3j])
y = torch.conj(x)
with self.assertRaisesRegex(RuntimeError, r"conjugate bit"):
y.__dlpack__()

@skipMeta
def test_dlpack_export_non_strided(self):
x = torch.sparse_coo_tensor([[0]], [1], size=(1,))
y = torch.conj(x)
with self.assertRaisesRegex(RuntimeError, r"strided"):
y.__dlpack__()

@skipMeta
def test_dlpack_normalize_strides(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for other reviewers: this is the new test (the rest of the tests are just moving code around)

x = torch.rand(16)
y = x[::3][:1]
self.assertEqual(y.shape, (1,))
self.assertEqual(y.stride(), (3,))
z = from_dlpack(y)
self.assertEqual(z.shape, (1,))
# gh-83069, make sure __dlpack__ normalizes strides
self.assertEqual(z.stride(), (1,))


instantiate_device_type_tests(TestTorchDlPack, globals())

if __name__ == '__main__':
run_tests()