Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) #120077

vfdev-5 · 2024-02-16T13:24:08Z

Description:

PR tries to fuse nodes with compatible sizes, for example node1: (s0, s1, s2) and node2: (s0 * s1 * s2). On main these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
this should influence only cpu device

Example:

from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config


# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1


@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output


in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output


x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)

Output on main:

graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]

Output on this PR:

graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]

Context:
While working on #120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond config.realize_opcount_threshold.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-02-16T13:24:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120077

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 6a8d136 with merge base bd19d6d ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
DebertaForQuestionAnswering
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
DebertaForQuestionAnswering
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
llava
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
llava

This comment was automatically generated by Dr. CI and updates every 15 minutes.

peterbell10 · 2024-02-16T14:18:46Z

torch/_inductor/scheduler.py

+                ref_node = node2 if len(vars1) < len(vars2) else node1
+                extra_indexing_constraints = get_indexing_ranges_exprs(ref_node)
+                if extra_indexing_constraints is not None and isinstance(node_to_recomp, SchedulerNode):
+                    node_to_recomp.recompute_size_and_body(extra_indexing_constraints=extra_indexing_constraints)


One concern is that retracing the body is fairly expensive, so we'll need to benchmark this to see how it impacts compile time overhead. I don't expect it to be too bad though as scheduler time doesn't usually dominate compilation.

torch/_inductor/scheduler.py

torch/_inductor/ir.py

torch/_inductor/codegen/cpp.py

torch/_inductor/scheduler.py

test/inductor/test_perf.py

torch/_inductor/codegen/cpp.py

lezcano · 2024-02-23T09:39:40Z

torch/_inductor/codegen/cpp.py

        else:
            self.kernel_group = KernelGroup()

+    def fuse(self, node1, node2):


Type the function to make it more readable. Same below.

well, if you check can_fuse_* methods here, they are not typed neither, I follow the trend here :)

torch/_inductor/codegen/cpp.py

torch/_inductor/ir.py

torch/_inductor/codegen/cpp.py

jgong5

@vfdev-5 Do you have performance comparison with and without this PR on the three inductor benchmark suites? I guess my major concern is that whether the fusion would always bring performance benefit. It might change the parallelization scheduling and data access patterns of the node with sizes s0*s1*... cc @chuanqi129 for awareness. We may evaluate its performance impact from our side too.

torch/_inductor/ir.py

vfdev-5 · 2024-02-24T13:50:07Z

three inductor benchmark suites

@jgong5 can you please point out which benchmark suites you think of? Thanks!

I'll provide a benchmark of the case I worked on (upsampling bicubic) where this fusion leads to some speed-up.

jgong5 · 2024-02-24T17:01:45Z

@jgong5 can you please point out which benchmark suites you think of? Thanks!

@vfdev-5 we are tracking the performance with the three benchmark suites: torchbench, huggingface and timm. @chuanqi129 can share the steps how to run them to check the performance.

lezcano · 2024-02-25T14:49:13Z

@vfdev-5 run the torchbench suite on the hud, please.

chuanqi129 · 2024-02-26T03:13:52Z

@jgong5 can you please point out which benchmark suites you think of? Thanks!

@vfdev-5 we are tracking the performance with the three benchmark suites: torchbench, huggingface and timm. @chuanqi129 can share the steps how to run them to check the performance.

Hi @vfdev-5 @jgong5 , for Inductor CPU performance dashboard benchmark side, we used this Dockerfileto prepare the test environment with cmd docker build --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} --build-arg PT_REPO=https://github.com/pytorch/pytorch --build-arg PT_COMMIT=xxxxxxx -t pt:test -f Dockerfile --target image ..
After that, we can sue the test script inductor_test.sh to launch 3 suites (torchbench, huggingface and timm) with multi-thread mode and single thread mode on cpu together by using cmd bash inductor_test.sh under pytorch source code root folder. By default, it will test float32 static shape default wrapper inference accuracy and performance, we can pass different parameter to this script to test other combinations.
If we want to test single model, we can directly use inductor_single_run.sh with proper parameters. Please feel free let me know if there any issues.

lezcano · 2024-02-26T10:06:52Z

If CPU benchmarks are not run on the hud, then it may be better for them to be run by someone from intel so that they are run on the relevant hardware, perhaps with core pinning, disabled throttling, w/without hyperthreading, etc... Also, we don't have access to a box with AVX512, for example. @jgong5

vfdev-5 · 2024-02-26T16:47:08Z

[------------------------------------------------------------------------------------------------ Interpolate, cpu ------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.3.0a0)   |  Compiled (2.3.0a0)
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Downsampling to small output image 256x256
- No fusion (8ef4a43a31a)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.745 (+-0.073)        |         3.598 (+-0.111)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.749 (+-0.065)        |         3.852 (+-0.140)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.804 (+-0.131)        |         3.784 (+-0.141)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.817 (+-0.138)        |         4.134 (+-0.132)

- With fusion (PR 120077, merged as ffd0b4de1d3) -> fusion benefits are invisible
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.756 (+-0.060)        |         3.574 (+-0.108)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.753 (+-0.056)        |         3.686 (+-0.123)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.834 (+-0.135)        |         3.782 (+-0.088)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.862 (+-0.126)        |         4.130 (+-0.105)

- With extended fusion -> fusion benefits are invisible
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.768 (+-0.061)        |         3.601 (+-0.108)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.761 (+-0.075)        |         3.690 (+-0.112)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.876 (+-0.126)        |         3.719 (+-0.123)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.859 (+-0.123)        |         4.061 (+-0.131)

Upsampling to large output image 1024x1024
- No fusion (8ef4a43a31a)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        27.097 (+-1.031)       |         56.546 (+-1.339)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        26.972 (+-1.128)       |         66.710 (+-3.071)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        43.651 (+-1.563)       |         61.883 (+-2.002)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        43.777 (+-1.466)       |         80.301 (+-3.199)

- With fusion (PR 120077, merged as ffd0b4de1d3) -> visible only for ac=false and CF
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        26.906 (+-1.212)       |         56.570 (+-1.846)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        26.940 (+-1.122)       |         58.319 (+-1.924)    <---
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        44.541 (+-1.838)       |         61.911 (+-2.280)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        44.513 (+-1.792)       |         80.829 (+-2.849)

- With extended fusion -> visible only for ac=false and CF / CL
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        27.171 (+-1.108)       |         56.638 (+-1.361)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        27.242 (+-1.067)       |         58.309 (+-1.500)    <---
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        44.792 (+-1.700)       |         58.871 (+-1.848)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        44.419 (+-1.432)       |         64.399 (+-1.467)    <---


Times are in milliseconds (ms).

Observations:

For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option
For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

"extended fusion" refers to the following update to this PR to handle channels-last case:

-        c3 = len(vars1) == 1 or len(vars2) == 1
+        c3 = len(vars1) in (1, 2) or len(vars2) in (1, 2)

jgong5 · 2024-02-28T02:52:29Z

Observations:

For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option

For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

Thanks for sharing the numbers. @chuanqi129 will check the impact on the inference performance on the three benchmark suites.

chuanqi129 · 2024-03-01T03:48:34Z

Observations:

For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option

For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

Thanks for sharing the numbers. @chuanqi129 will check the impact on the inference performance on the three benchmark suites.

Have verified for FP32 static shape default wrapper inference test with 3 suites, there is no performance drop. cc: @jgong5 @vfdev-5

jgong5 · 2024-03-01T08:57:44Z

Have verified for FP32 static shape default wrapper inference test with 3 suites, there is no performance drop. cc: @jgong5 @vfdev-

Thanks @chuanqi129 !

vfdev-5 · 2024-03-01T09:20:25Z

@chuanqi129 thanks for the feedback, I wonder whether you have seen any perf improvements on some tests ?

lezcano

sgtm. @peterbell10 wanna have a last look?

Also, @vfdev-5 will you implement the more generalised version of this algorithm after this PR is merged?

pytorchmergebot · 2024-03-04T16:38:32Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-03-04T16:38:36Z

Successfully rebased cpp-nodes-fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cpp-nodes-fusion && git pull --rebase)

vfdev-5 · 2024-03-05T10:19:55Z

@pytorchbot rebase

pytorchmergebot · 2024-03-05T10:21:30Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…ndexing_constraints

pytorchmergebot · 2024-03-05T10:21:33Z

Successfully rebased cpp-nodes-fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cpp-nodes-fusion && git pull --rebase)

…n for that case only (fixed failing test)

lezcano · 2024-03-06T08:57:40Z

torch/_inductor/codegen/cpp.py

+            return ReasonFusedNodes.COMPATIBLE_REDUCTION
+        if self._can_fuse_nodes_with_compatible_ranges(node1, node2):
+            return ReasonFusedNodes.COMPATIBLE_RANGES_NO_REDUCTION
+        # TODO(jansel): allow fusion pointwise (vars1, ()) suffix?


this comment is no longer relevant :)

vfdev-5 · 2024-03-06T12:16:58Z

@pytorchbot merge

pytorchmergebot · 2024-03-06T12:18:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…mpatible ranges (#122420) Fixes #122283 Description: PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: #122420 Approved by: https://github.com/lezcano

Superseeds #104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: #120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 613.029 (+-1.590) | 5477.608 (+-9.027) | 3060.314 (+-12.368) | 0.559 (+-0.000) | 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 610.176 (+-1.428) | 5718.503 (+-11.203) | 3424.022 (+-12.836) | 0.599 (+-0.000) | 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 325.001 (+-0.840) | 6183.029 (+-10.893) | 3275.032 (+-7.625) | 0.530 (+-0.000) | 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 325.855 (+-1.108) | 6391.394 (+-11.552) | 3533.410 (+-7.666) | 0.553 (+-0.000) | 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2521.533 (+-14.857) | 5025.217 (+-13.415) | 2814.304 (+-6.742) | 0.560 (+-0.000) | 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2531.204 (+-12.534) | 5294.925 (+-11.994) | 3147.590 (+-6.808) | 0.594 (+-0.000) | 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 758.352 (+-10.362) | 5639.912 (+-14.495) | 3014.123 (+-8.799) | 0.534 (+-0.000) | 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 758.712 (+-5.781) | 5927.541 (+-9.982) | 3249.555 (+-7.226) | 0.548 (+-0.000) | 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 1524.469 (+-12.860) | 34321.641 (+-80.310) | 19373.714 (+-56.351) | 0.564 (+-0.000) | 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 1521.746 (+-13.780) | 35949.711 (+-81.010) | 21782.366 (+-68.938) | 0.606 (+-0.000) | 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 712.311 (+-5.361) | 38826.510 (+-92.267) | 20762.314 (+-59.303) | 0.535 (+-0.000) | 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 715.060 (+-4.757) | 40269.353 (+-92.543) | 22402.114 (+-81.574) | 0.556 (+-0.000) | 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2331.889 (+-29.159) | 21541.096 (+-72.346) | 12181.194 (+-45.288) | 0.565 (+-0.000) | 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2333.697 (+-10.066) | 22514.154 (+-57.798) | 21709.449 (+-98.307) | 0.964 (+-0.000) | 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1198.768 (+-5.364) | 37652.371 (+-101.644) | 42740.413 (+-98.571) | 1.135 (+-0.000) | 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1196.851 (+-5.118) | 39678.341 (+-173.750) | 46807.738 (+-92.744) | 1.180 (+-0.000) | 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10020.978 (+-54.855) | 19955.290 (+-71.891) | 11420.521 (+-53.179) | 0.572 (+-0.000) | 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10066.441 (+-62.700) | 21058.334 (+-183.414) | 19986.577 (+-65.304) | 0.949 (+-0.000) | 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 3171.135 (+-14.635) | 19687.864 (+-54.320) | 23313.699 (+-57.391) | 1.184 (+-0.000) | 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 3181.314 (+-13.784) | 20224.468 (+-50.827) | 30541.963 (+-381.385) | 1.510 (+-0.000) | 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 5879.450 (+-31.551) | 136918.555 (+-480.320) | 77723.568 (+-331.766) | 0.568 (+-0.000) | 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 5882.869 (+-30.325) | 143378.094 (+-513.842) | 137244.074 (+-4827.730) | 0.957 (+-0.000) | 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 2674.937 (+-45.003) | 244829.360 (+-1930.579) | 271283.073 (+-2243.245) | 1.108 (+-0.000) | 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 2676.217 (+-16.601) | 248658.668 (+-2904.952) | 296514.520 (+-2983.281) | 1.192 (+-0.000) | 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1768.437 (+-6.294) | 2934.013 (+-28.870) | 2520.649 (+-6.797) | 0.859 (+-0.000) | 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1748.660 (+-5.550) | 3271.104 (+-7.557) | 2891.306 (+-7.632) | 0.884 (+-0.000) | 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2813.150 (+-6.656) | 3258.973 (+-7.543) | 2766.286 (+-6.473) | 0.849 (+-0.000) | 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2812.102 (+-8.211) | 3568.780 (+-9.018) | 3125.870 (+-7.324) | 0.876 (+-0.000) | 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 1687.975 (+-9.527) | 2752.085 (+-9.627) | 2373.274 (+-7.888) | 0.862 (+-0.000) | 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 1696.606 (+-8.678) | 3056.317 (+-13.303) | 2699.160 (+-10.638) | 0.883 (+-0.000) | 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2613.491 (+-9.769) | 3176.493 (+-13.366) | 2730.193 (+-9.573) | 0.859 (+-0.000) | 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2614.946 (+-34.129) | 3465.398 (+-11.165) | 3044.396 (+-11.447) | 0.879 (+-0.000) | 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 10784.549 (+-58.181) | 18292.452 (+-59.344) | 15909.922 (+-49.864) | 0.870 (+-0.000) | 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 10786.513 (+-52.308) | 20449.038 (+-56.204) | 18295.997 (+-54.522) | 0.895 (+-0.000) | 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 17532.699 (+-64.807) | 20425.699 (+-80.271) | 17517.040 (+-79.705) | 0.858 (+-0.000) | 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 17530.816 (+-55.131) | 22450.080 (+-92.899) | 19827.828 (+-77.649) | 0.883 (+-0.000) | 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 6875.484 (+-40.543) | 11569.509 (+-62.462) | 10053.350 (+-208.136) | 0.869 (+-0.000) | 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 6843.126 (+-44.498) | 12915.236 (+-60.654) | 25335.058 (+-382.640) | 1.962 (+-0.000) | 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 11103.418 (+-51.318) | 28834.389 (+-78.395) | 37405.463 (+-581.646) | 1.297 (+-0.000) | 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 11092.994 (+-70.835) | 36597.023 (+-118.988) | 45761.267 (+-85.051) | 1.250 (+-0.000) | 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 7106.791 (+-63.666) | 11191.071 (+-45.402) | 9786.037 (+-75.781) | 0.874 (+-0.000) | 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 7146.519 (+-28.376) | 12443.571 (+-39.425) | 20147.067 (+-74.771) | 1.619 (+-0.000) | 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10533.849 (+-44.227) | 34814.909 (+-138.127) | 42803.001 (+-114.326) | 1.229 (+-0.000) | 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10548.910 (+-44.221) | 42876.940 (+-146.959) | 49711.443 (+-139.276) | 1.159 (+-0.000) | 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 42814.521 (+-103.198) | 73100.489 (+-435.262) | 63587.659 (+-134.266) | 0.870 (+-0.000) | 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 42812.373 (+-103.870) | 81769.160 (+-373.369) | 175159.813 (+-2028.558) | 2.142 (+-0.000) | 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 69955.505 (+-373.373) | 215248.616 (+-2040.775) | 267511.246 (+-2094.161) | 1.243 (+-0.000) | 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 69852.157 (+-490.076) | 242841.484 (+-19645.513) | 317931.678 (+-2016.498) | 1.309 (+-0.000) | 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 97.727 (+-0.018) | 97.765 (+-0.025) | 97.773 (+-0.027) | 1.000 (+-0.000) | 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 97.615 (+-0.066) | 97.332 (+-0.032) | 97.950 (+-0.026) | 1.006 (+-0.000) | 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 100.635 (+-0.033) | 125.883 (+-0.020) | 102.499 (+-0.116) | 0.814 (+-0.000) | 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 100.898 (+-0.036) | 109.717 (+-0.336) | 102.558 (+-0.120) | 0.935 (+-0.000) | 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 462.853 (+-0.028) | 382.475 (+-0.047) | 382.472 (+-0.033) | 1.000 (+-0.000) | 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 462.783 (+-0.021) | 382.806 (+-0.037) | 382.563 (+-0.043) | 0.999 (+-0.000) | 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 466.721 (+-0.022) | 384.438 (+-0.027) | 384.886 (+-0.037) | 1.001 (+-0.000) | 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 466.993 (+-0.032) | 384.212 (+-0.009) | 383.946 (+-0.029) | 0.999 (+-0.000) | 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 190.070 (+-0.082) | 209.353 (+-1.096) | 202.870 (+-0.888) | 0.969 (+-0.000) | 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 190.021 (+-0.018) | 210.504 (+-0.456) | 201.814 (+-0.770) | 0.959 (+-0.000) | 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 188.860 (+-0.207) | 336.635 (+-0.023) | 252.026 (+-0.510) | 0.749 (+-0.000) | 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 188.725 (+-0.214) | 276.329 (+-0.563) | 251.439 (+-0.524) | 0.910 (+-0.000) | 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 781.879 (+-0.086) | 836.389 (+-7.177) | 816.483 (+-6.626) | 0.976 (+-0.000) | 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 781.824 (+-0.099) | 840.406 (+-7.111) | 807.530 (+-6.514) | 0.961 (+-0.000) | 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 769.290 (+-0.309) | 675.498 (+-1.537) | 688.171 (+-4.326) | 1.019 (+-0.000) | 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 769.240 (+-0.179) | 675.800 (+-1.113) | 673.176 (+-1.740) | 0.996 (+-0.000) | 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: #120411 Approved by: https://github.com/lezcano

…120411) Superseeds pytorch#104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: pytorch#120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 613.029 (+-1.590) | 5477.608 (+-9.027) | 3060.314 (+-12.368) | 0.559 (+-0.000) | 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 610.176 (+-1.428) | 5718.503 (+-11.203) | 3424.022 (+-12.836) | 0.599 (+-0.000) | 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 325.001 (+-0.840) | 6183.029 (+-10.893) | 3275.032 (+-7.625) | 0.530 (+-0.000) | 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 325.855 (+-1.108) | 6391.394 (+-11.552) | 3533.410 (+-7.666) | 0.553 (+-0.000) | 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2521.533 (+-14.857) | 5025.217 (+-13.415) | 2814.304 (+-6.742) | 0.560 (+-0.000) | 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2531.204 (+-12.534) | 5294.925 (+-11.994) | 3147.590 (+-6.808) | 0.594 (+-0.000) | 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 758.352 (+-10.362) | 5639.912 (+-14.495) | 3014.123 (+-8.799) | 0.534 (+-0.000) | 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 758.712 (+-5.781) | 5927.541 (+-9.982) | 3249.555 (+-7.226) | 0.548 (+-0.000) | 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 1524.469 (+-12.860) | 34321.641 (+-80.310) | 19373.714 (+-56.351) | 0.564 (+-0.000) | 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 1521.746 (+-13.780) | 35949.711 (+-81.010) | 21782.366 (+-68.938) | 0.606 (+-0.000) | 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 712.311 (+-5.361) | 38826.510 (+-92.267) | 20762.314 (+-59.303) | 0.535 (+-0.000) | 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 715.060 (+-4.757) | 40269.353 (+-92.543) | 22402.114 (+-81.574) | 0.556 (+-0.000) | 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2331.889 (+-29.159) | 21541.096 (+-72.346) | 12181.194 (+-45.288) | 0.565 (+-0.000) | 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2333.697 (+-10.066) | 22514.154 (+-57.798) | 21709.449 (+-98.307) | 0.964 (+-0.000) | 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1198.768 (+-5.364) | 37652.371 (+-101.644) | 42740.413 (+-98.571) | 1.135 (+-0.000) | 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1196.851 (+-5.118) | 39678.341 (+-173.750) | 46807.738 (+-92.744) | 1.180 (+-0.000) | 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10020.978 (+-54.855) | 19955.290 (+-71.891) | 11420.521 (+-53.179) | 0.572 (+-0.000) | 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10066.441 (+-62.700) | 21058.334 (+-183.414) | 19986.577 (+-65.304) | 0.949 (+-0.000) | 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 3171.135 (+-14.635) | 19687.864 (+-54.320) | 23313.699 (+-57.391) | 1.184 (+-0.000) | 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 3181.314 (+-13.784) | 20224.468 (+-50.827) | 30541.963 (+-381.385) | 1.510 (+-0.000) | 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 5879.450 (+-31.551) | 136918.555 (+-480.320) | 77723.568 (+-331.766) | 0.568 (+-0.000) | 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 5882.869 (+-30.325) | 143378.094 (+-513.842) | 137244.074 (+-4827.730) | 0.957 (+-0.000) | 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 2674.937 (+-45.003) | 244829.360 (+-1930.579) | 271283.073 (+-2243.245) | 1.108 (+-0.000) | 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 2676.217 (+-16.601) | 248658.668 (+-2904.952) | 296514.520 (+-2983.281) | 1.192 (+-0.000) | 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1768.437 (+-6.294) | 2934.013 (+-28.870) | 2520.649 (+-6.797) | 0.859 (+-0.000) | 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1748.660 (+-5.550) | 3271.104 (+-7.557) | 2891.306 (+-7.632) | 0.884 (+-0.000) | 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2813.150 (+-6.656) | 3258.973 (+-7.543) | 2766.286 (+-6.473) | 0.849 (+-0.000) | 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2812.102 (+-8.211) | 3568.780 (+-9.018) | 3125.870 (+-7.324) | 0.876 (+-0.000) | 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 1687.975 (+-9.527) | 2752.085 (+-9.627) | 2373.274 (+-7.888) | 0.862 (+-0.000) | 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 1696.606 (+-8.678) | 3056.317 (+-13.303) | 2699.160 (+-10.638) | 0.883 (+-0.000) | 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2613.491 (+-9.769) | 3176.493 (+-13.366) | 2730.193 (+-9.573) | 0.859 (+-0.000) | 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2614.946 (+-34.129) | 3465.398 (+-11.165) | 3044.396 (+-11.447) | 0.879 (+-0.000) | 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 10784.549 (+-58.181) | 18292.452 (+-59.344) | 15909.922 (+-49.864) | 0.870 (+-0.000) | 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 10786.513 (+-52.308) | 20449.038 (+-56.204) | 18295.997 (+-54.522) | 0.895 (+-0.000) | 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 17532.699 (+-64.807) | 20425.699 (+-80.271) | 17517.040 (+-79.705) | 0.858 (+-0.000) | 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 17530.816 (+-55.131) | 22450.080 (+-92.899) | 19827.828 (+-77.649) | 0.883 (+-0.000) | 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 6875.484 (+-40.543) | 11569.509 (+-62.462) | 10053.350 (+-208.136) | 0.869 (+-0.000) | 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 6843.126 (+-44.498) | 12915.236 (+-60.654) | 25335.058 (+-382.640) | 1.962 (+-0.000) | 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 11103.418 (+-51.318) | 28834.389 (+-78.395) | 37405.463 (+-581.646) | 1.297 (+-0.000) | 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 11092.994 (+-70.835) | 36597.023 (+-118.988) | 45761.267 (+-85.051) | 1.250 (+-0.000) | 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 7106.791 (+-63.666) | 11191.071 (+-45.402) | 9786.037 (+-75.781) | 0.874 (+-0.000) | 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 7146.519 (+-28.376) | 12443.571 (+-39.425) | 20147.067 (+-74.771) | 1.619 (+-0.000) | 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10533.849 (+-44.227) | 34814.909 (+-138.127) | 42803.001 (+-114.326) | 1.229 (+-0.000) | 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10548.910 (+-44.221) | 42876.940 (+-146.959) | 49711.443 (+-139.276) | 1.159 (+-0.000) | 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 42814.521 (+-103.198) | 73100.489 (+-435.262) | 63587.659 (+-134.266) | 0.870 (+-0.000) | 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 42812.373 (+-103.870) | 81769.160 (+-373.369) | 175159.813 (+-2028.558) | 2.142 (+-0.000) | 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 69955.505 (+-373.373) | 215248.616 (+-2040.775) | 267511.246 (+-2094.161) | 1.243 (+-0.000) | 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 69852.157 (+-490.076) | 242841.484 (+-19645.513) | 317931.678 (+-2016.498) | 1.309 (+-0.000) | 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 97.727 (+-0.018) | 97.765 (+-0.025) | 97.773 (+-0.027) | 1.000 (+-0.000) | 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 97.615 (+-0.066) | 97.332 (+-0.032) | 97.950 (+-0.026) | 1.006 (+-0.000) | 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 100.635 (+-0.033) | 125.883 (+-0.020) | 102.499 (+-0.116) | 0.814 (+-0.000) | 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 100.898 (+-0.036) | 109.717 (+-0.336) | 102.558 (+-0.120) | 0.935 (+-0.000) | 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 462.853 (+-0.028) | 382.475 (+-0.047) | 382.472 (+-0.033) | 1.000 (+-0.000) | 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 462.783 (+-0.021) | 382.806 (+-0.037) | 382.563 (+-0.043) | 0.999 (+-0.000) | 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 466.721 (+-0.022) | 384.438 (+-0.027) | 384.886 (+-0.037) | 1.001 (+-0.000) | 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 466.993 (+-0.032) | 384.212 (+-0.009) | 383.946 (+-0.029) | 0.999 (+-0.000) | 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 190.070 (+-0.082) | 209.353 (+-1.096) | 202.870 (+-0.888) | 0.969 (+-0.000) | 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 190.021 (+-0.018) | 210.504 (+-0.456) | 201.814 (+-0.770) | 0.959 (+-0.000) | 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 188.860 (+-0.207) | 336.635 (+-0.023) | 252.026 (+-0.510) | 0.749 (+-0.000) | 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 188.725 (+-0.214) | 276.329 (+-0.563) | 251.439 (+-0.524) | 0.910 (+-0.000) | 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 781.879 (+-0.086) | 836.389 (+-7.177) | 816.483 (+-6.626) | 0.976 (+-0.000) | 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 781.824 (+-0.099) | 840.406 (+-7.111) | 807.530 (+-6.514) | 0.961 (+-0.000) | 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 769.290 (+-0.309) | 675.498 (+-1.537) | 688.171 (+-4.326) | 1.019 (+-0.000) | 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 769.240 (+-0.179) | 675.800 (+-1.113) | 673.176 (+-1.740) | 0.996 (+-0.000) | 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: pytorch#120411 Approved by: https://github.com/lezcano

…mpatible ranges (#122420) Fixes #122283 Description: PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: #122420 Approved by: https://github.com/lezcano

github-actions bot added module: inductor ciflow/inductor labels Feb 16, 2024

pytorchbot added the open source label Feb 16, 2024

peterbell10 reviewed Feb 16, 2024

View reviewed changes

vfdev-5 force-pushed the cpp-nodes-fusion branch from 9d79773 to ac52258 Compare February 16, 2024 17:21

peterbell10 reviewed Feb 16, 2024

View reviewed changes

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

vfdev-5 marked this pull request as ready for review February 22, 2024 08:13

vfdev-5 requested a review from peterbell10 February 22, 2024 08:13

vfdev-5 changed the title ~~[WIP] Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...)~~ Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) Feb 22, 2024

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 22, 2024

vfdev-5 requested a review from lezcano February 23, 2024 08:50

lezcano reviewed Feb 23, 2024

View reviewed changes

vfdev-5 mentioned this pull request Feb 23, 2024

Fixed support for uint8 in upsample bicubic2d decomposition #120411

Closed

peterbell10 reviewed Feb 23, 2024

View reviewed changes

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

lezcano requested a review from jgong5 February 23, 2024 15:36

jgong5 reviewed Feb 24, 2024

View reviewed changes

torch/_inductor/ir.py Show resolved Hide resolved

jgong5 approved these changes Mar 1, 2024

View reviewed changes

vfdev-5 requested a review from lezcano March 1, 2024 13:45

lezcano approved these changes Mar 1, 2024

View reviewed changes

pytorchmergebot force-pushed the cpp-nodes-fusion branch from cf655ea to 7d65fd5 Compare March 4, 2024 16:38

vfdev-5 force-pushed the cpp-nodes-fusion branch from 7d65fd5 to e7489d8 Compare March 4, 2024 17:15

vfdev-5 added 7 commits March 5, 2024 10:21

[WIP] Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...)

9478f4f

Addressed PR review and fixed failing tests

e4ce536

Added cpu test and fixed failing test

354083b

Removed symbols assertion as unnecessary

39c8dd4

Fixed code formatting issue

453d37a

Addessed PR review

a403b93

Updated docstring of simplify_and_reorder adding a note about extra_i…

07723fb

…ndexing_constraints

pytorchmergebot force-pushed the cpp-nodes-fusion branch from e7489d8 to 07723fb Compare March 5, 2024 10:21

Added explicit nodes fusion reason and handle compatible ranges fusio…

6a8d136

…n for that case only (fixed failing test)

lezcano reviewed Mar 6, 2024

View reviewed changes

pytorchmergebot closed this in 49d1fd3 Mar 6, 2024

pytorchmergebot added Merged and removed merging labels Mar 6, 2024

vfdev-5 deleted the cpp-nodes-fusion branch March 6, 2024 12:27

chuanqi129 mentioned this pull request Mar 21, 2024

[inductor][cpu] pyhpc_equation_of_state fp32 static shape default/cpp wrapper multiple thread performance test crashed #122283

Closed

vfdev-5 mentioned this pull request Mar 21, 2024

Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges #122420

Closed

Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) #120077

Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) #120077

Conversation

vfdev-5 commented Feb 16, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120077

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

peterbell10 Feb 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lezcano Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgong5 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vfdev-5 commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Feb 24, 2024

Uh oh!

lezcano commented Feb 25, 2024

Uh oh!

chuanqi129 commented Feb 26, 2024

Uh oh!

lezcano commented Feb 26, 2024

Uh oh!

vfdev-5 commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Feb 28, 2024

Uh oh!

chuanqi129 commented Mar 1, 2024

Uh oh!

jgong5 commented Mar 1, 2024

Uh oh!

vfdev-5 commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Mar 4, 2024

Uh oh!

pytorchmergebot commented Mar 4, 2024

Uh oh!

vfdev-5 commented Mar 5, 2024

Uh oh!

pytorchmergebot commented Mar 5, 2024

Uh oh!

pytorchmergebot commented Mar 5, 2024

Uh oh!

lezcano Mar 6, 2024

Choose a reason for hiding this comment

Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) #120077

Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) #120077

vfdev-5 commented Feb 16, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 16, 2024 •

edited

Loading

jgong5 left a comment •

edited

Loading

vfdev-5 commented Feb 24, 2024 •

edited

Loading

vfdev-5 commented Feb 26, 2024 •

edited

Loading

vfdev-5 commented Mar 1, 2024 •

edited

Loading