New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port cat
kernel to structured kernels.
#68640
Conversation
Tracking issue: #55070 [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 95b2799 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakagespull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)Step: "Test" (full log | diagnosis details | 🔁 rerun)
|
Tracking issue: #55070 [ghstack-poisoned]
Tracking issue: #55070 [ghstack-poisoned]
aten/src/ATen/native/TensorShape.cpp
Outdated
bool all_contiguous = true; | ||
bool all_same_dtype = true; | ||
bool all_same_sizes_and_stride = true; | ||
auto memory_format = cat_compute_output_memory_format(tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU and CUDA kernels computed the memory format in different ways. Here, I adopted how CUDA used to do it. Not sure if this is the best way to go at it. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ngimel for another opinion. I found the PR that added support for cuda, and a bit later for cpu. There's no explicit mention of why they differ, so, going with the way cuda does it now seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry helped point me to some discussion about this: #62560 (comment).
It sounds like making cuda's behavior the general behavior is the right move in this PR. We should also add a test (if one doesn't exist already), that confirms that memory_format behavior is the same across cpu and cuda.
@mruberry also pointed out another thing we need to fix in this PR: see #64709. Apparently, there parts of the codebase that call cat
and expect cat(out=...)
to resize the output tensor. That'll start printing warnings now that cat
is structured. We should fix all of the places in our codebase that call cat(out=...)
and expect the resize to happen, that way users don't start seeing warnings that they can't fix.
The easiest way to do that is probably to copy conditions in resize_output()
that create the warn, and run them directly in the cat meta function, but raise an error instead of a warning. Then fix all the errors that show up from CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing cat memory_format test is here:
pytorch/test/test_tensor_creation_ops.py
Line 721 in 1d269e8
def test_cat_out_memory_format(self, device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ysiraichi, indeed the CUDA and CPU memory format calculation is inconsistent. I opened issue #63998 (talking exactly about this) earlier and wanted to work on a fix for that but never got around that because of some new responsibilities until earlier this week. I just came across this PR by chance and I see you have already incorporated the change.
So, with this PR, #63998 should also be fixed.
@bdhirsh when adding the existing cat memory format test, I kept the different behaviour of memory formats in mind. It will now need slight modification with your update making CPU and CUDA consistent, which should be quite straightforward (Edit: Yukio already covered that!).
@@ -626,16 +626,22 @@ def test_cat_out(self, device): | |||
y = torch.randn((4, 6), device=device) | |||
|
|||
with self.assertRaisesRegex( | |||
RuntimeError, r"unsupported operation:.* input tensor 0"): | |||
RuntimeError, | |||
r"unsupported operation: some elements of the input tensor and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages changed here, because I'm using at::assert_no_overlap
function. Should I revert that to show better error messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. The previous error message was also related to overlap:
>>> torch.cat([x, y], out=x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: 0unsupported operation: the input tensors cannot refer to any of the output memory locations. Found overlap in input tensor 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I was a bit worried that not having the number of the input tensor that had an overlap would be bad. :)
@@ -751,7 +756,7 @@ def test_cat_out_memory_format(self, device): | |||
res2_cpu = torch.cat((a_cpu, b_cpu), out=out_cpu) | |||
|
|||
self.assertTrue(res2_cuda.is_contiguous(memory_format=torch.contiguous_format)) | |||
self.assertTrue(res2_cpu.is_contiguous(memory_format=torch.channels_last)) | |||
self.assertTrue(res2_cpu.is_contiguous(memory_format=torch.contiguous_format)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm using only one method for inferring the memory format, CPU and CUDA behave consistently.
Tracking issue: #55070 [ghstack-poisoned]
Tracking issue: #55070 [ghstack-poisoned]
Tracking issue: #55070 [ghstack-poisoned]
CPU: _cat_out_cpu | ||
CUDA: cat_out_cuda | ||
QuantizedCPU: cat_out_quantized_cpu | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 do you happen to know the history behind why we had both an at::cat
and at::_cat
? I couldn't dig up much of a reason from git blame. Although it seems useful to try to kill it as part of this structured kernel port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we have both. I agree if we can kill _cat
we should; there doesn't seem to be a need for both _cat
and cat
to exist
dispatch: | ||
CompositeExplicitAutograd: cat | ||
SparseCPU, SparseCUDA: cat_sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaning up the sparse logic :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for consistency we'd like to have an out= variant for sparse kernels, but it looks like that's not true today. It's probably not necessary to try to fix that in this PR.
@@ -739,8 +745,7 @@ def test_cat_out_memory_format(self, device): | |||
self.assertTrue(res1_cpu.is_contiguous(memory_format=torch.contiguous_format)) | |||
|
|||
# Case 2: if out= is not the correct shape then the output it is resized internally | |||
# - For the CPU variant the memory format is that of the first tensor | |||
# - For the CUDA variant it only propagates memory format if all the tensors have | |||
# - For both CPU and CUDA variants, it only propagates memory format if all the tensors have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel, is this change in memory format propagation on CPU an issue? It seems useful to clean this logic up so it's backend-agnostic, but it sounds mildly BC-breaking.
Tracking issue: #55070 [ghstack-poisoned]
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Tracking issue: #55070 Differential Revision: [D34521686](https://our.internmc.facebook.com/intern/diff/D34521686) [ghstack-poisoned]
Tracking issue: pytorch#55070 ghstack-source-id: cc4bb13cb10c842b31b3ff5e3be2b5360e927d39 Pull Request resolved: pytorch#68640
Tracking issue: #55070 Differential Revision: [D34521686](https://our.internmc.facebook.com/intern/diff/D34521686) [ghstack-poisoned]
Tracking issue: #55070 Differential Revision: [D34521686](https://our.internmc.facebook.com/intern/diff/D34521686) [ghstack-poisoned]
@ngimel @ezyang @bdhirsh
Benchmark
ntensors = 1000
sizes = [
(1, 2),
(1, 2, 2),
(3, 256, 256),
]
results = []
results_callgrind = []
for size in sizes:
size_str = f"""[{",".join(str(s) for s in size)}]"""
timer = benchmark.Timer(
stmt="torch.cat(xs)",
setup=f"import torch; xs = [torch.rand(*{size}) for _ in range({ntensors})];",
label="Cat",
sub_label=size_str,
description="time (cpu)"
)
timer_cuda = benchmark.Timer(
stmt="torch.cat(xs); torch.cuda.synchronize()",
setup=f"import torch; xs = [torch.rand(*{size}, device='cuda') for _ in range({ntensors})]; torch.cuda.synchronize()",
label="Cat",
sub_label=size_str,
description="time (cuda)"
)
results.append(timer.blocked_autorange(min_run_time=1))
results.append(timer_cuda.blocked_autorange(min_run_time=1))
results_callgrind.append(timer.collect_callgrind())
compare = benchmark.Compare(results)
compare.print()
for r in results_callgrind:
print(r) |
Tracking issue: #55070 Differential Revision: [D34521686](https://our.internmc.facebook.com/intern/diff/D34521686) [ghstack-poisoned]
I would love to start merging this stack but there are still build failures
|
Since you've got the benchmark, I wonder if you can run a little experiment, which is to try NOT materializing in the body of the kernel, and being willing to iterate multiple times. It may be that the dynamic allocation is swamping the predictable branches, and so we'd rather pump up the instruction count and avoid the dynamic alloc. Would be nice to know one way or another. |
I'm working on the build failures.
Sure! I will do that once CI is green. |
Tracking issue: #55070 Differential Revision: [D34521686](https://our.internmc.facebook.com/intern/diff/D34521686) [ghstack-poisoned]
@pytorchbot merge this |
Hey @ysiraichi. |
this is about to get yanked from the diff train because it breaks internal users. I just need to update call sites to not use native:: |
Summary: Tracking issue: #55070 Pull Request resolved: #68640 Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/22a10ce51310e690745accce0910d740b82a1503 Reviewed By: dagitses Differential Revision: D34521686 Pulled By: mehtanirav fbshipit-source-id: 58434529551e9e09939f6e28367316a6a20d7774
Fixed the internal references and few other linter errors in internal diff stack |
@ezyang
|
#68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this Differential Revision: [D35780296](https://our.internmc.facebook.com/intern/diff/D35780296/) [ghstack-poisoned]
I'm having a little difficulty interpreting the numbers here. What does the parenthesized number mean? If I go only by the non-parenthesized one, it seems like no materialize is better despite higher instruction count (what I suspected) and we should use that. |
Summary: Pull Request resolved: #76111 #68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this ghstack-source-id: 154335722 Test Plan: CI Reviewed By: navahgar, ajyu Differential Revision: D35780296 fbshipit-source-id: 0a262eb06a8d619227e5db10b6a775bf0b2e17c1 (cherry picked from commit aea6fbf)
It's the standard deviation. I left it there just to give a sense of how much the runs varied.
Agreed. |
Stack from ghstack (oldest at bottom):
Tensor?[]
for structured kernels. #73351Tensor[]
for structured kernels. #73350index.Tensor
to structured kernels. #69607Tensor[]?
for structured kernel codegen. #69606cat
kernel to structured kernels. #68640Tracking issue: #55070
Differential Revision: D34521686