Skip to content

Conversation

vfdev-5
Copy link
Contributor

@vfdev-5 vfdev-5 commented Feb 16, 2024

Description:

  • PR tries to fuse nodes with compatible sizes, for example node1: (s0, s1, s2) and node2: (s0 * s1 * s2). On main these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
  • this should influence only cpu device

Example:

from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config


# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1


@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output


in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output


x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)

Output on main:

graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]

Output on this PR:

graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]

Context:
While working on #120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond config.realize_opcount_threshold.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Feb 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120077

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 6a8d136 with merge base bd19d6d (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ref_node = node2 if len(vars1) < len(vars2) else node1
extra_indexing_constraints = get_indexing_ranges_exprs(ref_node)
if extra_indexing_constraints is not None and isinstance(node_to_recomp, SchedulerNode):
node_to_recomp.recompute_size_and_body(extra_indexing_constraints=extra_indexing_constraints)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is that retracing the body is fairly expensive, so we'll need to benchmark this to see how it impacts compile time overhead. I don't expect it to be too bad though as scheduler time doesn't usually dominate compilation.

@vfdev-5 vfdev-5 marked this pull request as ready for review February 22, 2024 08:13
@vfdev-5 vfdev-5 requested a review from peterbell10 February 22, 2024 08:13
@vfdev-5 vfdev-5 changed the title [WIP] Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) Feb 22, 2024
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 22, 2024
@vfdev-5 vfdev-5 requested a review from lezcano February 23, 2024 08:50
else:
self.kernel_group = KernelGroup()

def fuse(self, node1, node2):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type the function to make it more readable. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if you check can_fuse_* methods here, they are not typed neither, I follow the trend here :)

@lezcano lezcano requested a review from jgong5 February 23, 2024 15:36
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vfdev-5 Do you have performance comparison with and without this PR on the three inductor benchmark suites? I guess my major concern is that whether the fusion would always bring performance benefit. It might change the parallelization scheduling and data access patterns of the node with sizes s0*s1*... cc @chuanqi129 for awareness. We may evaluate its performance impact from our side too.

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Feb 24, 2024

three inductor benchmark suites

@jgong5 can you please point out which benchmark suites you think of? Thanks!

I'll provide a benchmark of the case I worked on (upsampling bicubic) where this fusion leads to some speed-up.

@jgong5
Copy link
Collaborator

jgong5 commented Feb 24, 2024

@jgong5 can you please point out which benchmark suites you think of? Thanks!

@vfdev-5 we are tracking the performance with the three benchmark suites: torchbench, huggingface and timm. @chuanqi129 can share the steps how to run them to check the performance.

@lezcano
Copy link
Collaborator

lezcano commented Feb 25, 2024

@vfdev-5 run the torchbench suite on the hud, please.

@chuanqi129
Copy link
Collaborator

@jgong5 can you please point out which benchmark suites you think of? Thanks!

@vfdev-5 we are tracking the performance with the three benchmark suites: torchbench, huggingface and timm. @chuanqi129 can share the steps how to run them to check the performance.

Hi @vfdev-5 @jgong5 , for Inductor CPU performance dashboard benchmark side, we used this Dockerfileto prepare the test environment with cmd docker build --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} --build-arg PT_REPO=https://github.com/pytorch/pytorch --build-arg PT_COMMIT=xxxxxxx -t pt:test -f Dockerfile --target image ..
After that, we can sue the test script inductor_test.sh to launch 3 suites (torchbench, huggingface and timm) with multi-thread mode and single thread mode on cpu together by using cmd bash inductor_test.sh under pytorch source code root folder. By default, it will test float32 static shape default wrapper inference accuracy and performance, we can pass different parameter to this script to test other combinations.
If we want to test single model, we can directly use inductor_single_run.sh with proper parameters. Please feel free let me know if there any issues.

@lezcano
Copy link
Collaborator

lezcano commented Feb 26, 2024

If CPU benchmarks are not run on the hud, then it may be better for them to be run by someone from intel so that they are run on the relevant hardware, perhaps with core pinning, disabled throttling, w/without hyperthreading, etc... Also, we don't have access to a box with AVX512, for example. @jgong5

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Feb 26, 2024

[------------------------------------------------------------------------------------------------ Interpolate, cpu ------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.3.0a0)   |  Compiled (2.3.0a0)
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Downsampling to small output image 256x256
- No fusion (8ef4a43a31a)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.745 (+-0.073)        |         3.598 (+-0.111)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.749 (+-0.065)        |         3.852 (+-0.140)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.804 (+-0.131)        |         3.784 (+-0.141)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.817 (+-0.138)        |         4.134 (+-0.132)

- With fusion (PR 120077, merged as ffd0b4de1d3) -> fusion benefits are invisible
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.756 (+-0.060)        |         3.574 (+-0.108)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.753 (+-0.056)        |         3.686 (+-0.123)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.834 (+-0.135)        |         3.782 (+-0.088)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.862 (+-0.126)        |         4.130 (+-0.105)

- With extended fusion -> fusion benefits are invisible
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1.768 (+-0.061)        |         3.601 (+-0.108)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1.761 (+-0.075)        |         3.690 (+-0.112)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2.876 (+-0.126)        |         3.719 (+-0.123)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2.859 (+-0.123)        |         4.061 (+-0.131)

Upsampling to large output image 1024x1024
- No fusion (8ef4a43a31a)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        27.097 (+-1.031)       |         56.546 (+-1.339)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        26.972 (+-1.128)       |         66.710 (+-3.071)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        43.651 (+-1.563)       |         61.883 (+-2.002)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        43.777 (+-1.466)       |         80.301 (+-3.199)

- With fusion (PR 120077, merged as ffd0b4de1d3) -> visible only for ac=false and CF
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        26.906 (+-1.212)       |         56.570 (+-1.846)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        26.940 (+-1.122)       |         58.319 (+-1.924)    <---
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        44.541 (+-1.838)       |         61.911 (+-2.280)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        44.513 (+-1.792)       |         80.829 (+-2.849)

- With extended fusion -> visible only for ac=false and CF / CL
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)   |        27.171 (+-1.108)       |         56.638 (+-1.361)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)  |        27.242 (+-1.067)       |         58.309 (+-1.500)    <---
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1024, 1024)       |        44.792 (+-1.700)       |         58.871 (+-1.848)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1024, 1024)      |        44.419 (+-1.432)       |         64.399 (+-1.467)    <---


Times are in milliseconds (ms).

Observations:

  • For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option
  • For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

"extended fusion" refers to the following update to this PR to handle channels-last case:

-        c3 = len(vars1) == 1 or len(vars2) == 1
+        c3 = len(vars1) in (1, 2) or len(vars2) in (1, 2)

@jgong5
Copy link
Collaborator

jgong5 commented Feb 28, 2024

Observations:

  • For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option
  • For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

Thanks for sharing the numbers. @chuanqi129 will check the impact on the inference performance on the three benchmark suites.

@chuanqi129
Copy link
Collaborator

Observations:

  • For CL, ac=false and ac=true there is no fusion of the last summation buffer into the main loop for "with fusion" option
  • For CF, ac=true, there is no last summation buffer for "no fusion" and "with fusion" options -> there is no speed-up. This is due to ac=true has less ops vs ac=false and thus config.realize_opcount_threshold is not reached.

Thanks for sharing the numbers. @chuanqi129 will check the impact on the inference performance on the three benchmark suites.

Have verified for FP32 static shape default wrapper inference test with 3 suites, there is no performance drop. cc: @jgong5 @vfdev-5

@jgong5
Copy link
Collaborator

jgong5 commented Mar 1, 2024

Have verified for FP32 static shape default wrapper inference test with 3 suites, there is no performance drop. cc: @jgong5 @vfdev-

Thanks @chuanqi129 !

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 1, 2024

@chuanqi129 thanks for the feedback, I wonder whether you have seen any perf improvements on some tests ?

@vfdev-5 vfdev-5 requested a review from lezcano March 1, 2024 13:45
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. @peterbell10 wanna have a last look?

Also, @vfdev-5 will you implement the more generalised version of this algorithm after this PR is merged?

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cpp-nodes-fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cpp-nodes-fusion && git pull --rebase)

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 5, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cpp-nodes-fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cpp-nodes-fusion && git pull --rebase)

return ReasonFusedNodes.COMPATIBLE_REDUCTION
if self._can_fuse_nodes_with_compatible_ranges(node1, node2):
return ReasonFusedNodes.COMPATIBLE_RANGES_NO_REDUCTION
# TODO(jansel): allow fusion pointwise (vars1, ()) suffix?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is no longer relevant :)

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 6, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@vfdev-5 vfdev-5 deleted the cpp-nodes-fusion branch March 6, 2024 12:27
pytorchmergebot pushed a commit that referenced this pull request Mar 24, 2024
…mpatible ranges (#122420)

Fixes #122283

Description:

PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: #122420
Approved by: https://github.com/lezcano
pytorchmergebot pushed a commit that referenced this pull request Mar 29, 2024
Superseeds #104248

Description:
- Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown)
- Added missing clamp(0, 1) for xscale and yscale
  - slowdown for f32 on cpu. PR on nodes fusion on CPU: #120077 can help for upsampling cases with align corners = true
  - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu.
- Removed lowering implementation

Benchmarks:
```
[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |        613.029 (+-1.590)        |         5477.608 (+-9.027)         |           3060.314 (+-12.368)           |     0.559 (+-0.000)      |          608.735 (+-6.336)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |        610.176 (+-1.428)        |        5718.503 (+-11.203)         |           3424.022 (+-12.836)           |     0.599 (+-0.000)      |          604.781 (+-6.229)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        325.001 (+-0.840)        |        6183.029 (+-10.893)         |            3275.032 (+-7.625)           |     0.530 (+-0.000)      |          325.693 (+-1.067)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        325.855 (+-1.108)        |        6391.394 (+-11.552)         |            3533.410 (+-7.666)           |     0.553 (+-0.000)      |          325.838 (+-1.457)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       2521.533 (+-14.857)       |        5025.217 (+-13.415)         |            2814.304 (+-6.742)           |     0.560 (+-0.000)      |         2520.308 (+-10.796)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       2531.204 (+-12.534)       |        5294.925 (+-11.994)         |            3147.590 (+-6.808)           |     0.594 (+-0.000)      |         2521.228 (+-11.732)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |        758.352 (+-10.362)       |        5639.912 (+-14.495)         |            3014.123 (+-8.799)           |     0.534 (+-0.000)      |          756.114 (+-4.792)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |        758.712 (+-5.781)        |         5927.541 (+-9.982)         |            3249.555 (+-7.226)           |     0.548 (+-0.000)      |          757.719 (+-5.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       1524.469 (+-12.860)       |        34321.641 (+-80.310)        |           19373.714 (+-56.351)          |     0.564 (+-0.000)      |         1518.082 (+-49.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       1521.746 (+-13.780)       |        35949.711 (+-81.010)        |           21782.366 (+-68.938)          |     0.606 (+-0.000)      |         1467.911 (+-15.901)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |        712.311 (+-5.361)        |        38826.510 (+-92.267)        |           20762.314 (+-59.303)          |     0.535 (+-0.000)      |          712.669 (+-4.673)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |        715.060 (+-4.757)        |        40269.353 (+-92.543)        |           22402.114 (+-81.574)          |     0.556 (+-0.000)      |          716.001 (+-8.945)

      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |       2331.889 (+-29.159)       |        21541.096 (+-72.346)        |           12181.194 (+-45.288)          |     0.565 (+-0.000)      |         2304.864 (+-21.351)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |       2333.697 (+-10.066)       |        22514.154 (+-57.798)        |           21709.449 (+-98.307)          |     0.964 (+-0.000)      |         2302.141 (+-13.041)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        1198.768 (+-5.364)       |       37652.371 (+-101.644)        |           42740.413 (+-98.571)          |     1.135 (+-0.000)      |          1197.104 (+-7.225)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        1196.851 (+-5.118)       |       39678.341 (+-173.750)        |           46807.738 (+-92.744)          |     1.180 (+-0.000)      |          1189.322 (+-5.681)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       10020.978 (+-54.855)      |        19955.290 (+-71.891)        |           11420.521 (+-53.179)          |     0.572 (+-0.000)      |         9999.583 (+-61.230)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       10066.441 (+-62.700)      |       21058.334 (+-183.414)        |           19986.577 (+-65.304)          |     0.949 (+-0.000)      |         10018.672 (+-59.188)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |       3171.135 (+-14.635)       |        19687.864 (+-54.320)        |           23313.699 (+-57.391)          |     1.184 (+-0.000)      |         3182.191 (+-17.686)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |       3181.314 (+-13.784)       |        20224.468 (+-50.827)        |          30541.963 (+-381.385)          |     1.510 (+-0.000)      |         3183.578 (+-16.203)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       5879.450 (+-31.551)       |       136918.555 (+-480.320)       |          77723.568 (+-331.766)          |     0.568 (+-0.000)      |         5726.061 (+-87.517)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       5882.869 (+-30.325)       |       143378.094 (+-513.842)       |         137244.074 (+-4827.730)         |     0.957 (+-0.000)      |         5727.679 (+-22.164)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |       2674.937 (+-45.003)       |      244829.360 (+-1930.579)       |         271283.073 (+-2243.245)         |     1.108 (+-0.000)      |         2676.054 (+-24.632)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |       2676.217 (+-16.601)       |      248658.668 (+-2904.952)       |         296514.520 (+-2983.281)         |     1.192 (+-0.000)      |         2682.844 (+-19.886)

      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1768.437 (+-6.294)       |        2934.013 (+-28.870)         |            2520.649 (+-6.797)           |     0.859 (+-0.000)      |          1759.292 (+-5.097)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1748.660 (+-5.550)       |         3271.104 (+-7.557)         |            2891.306 (+-7.632)           |     0.884 (+-0.000)      |          1746.341 (+-5.845)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2813.150 (+-6.656)       |         3258.973 (+-7.543)         |            2766.286 (+-6.473)           |     0.849 (+-0.000)      |          2805.077 (+-7.611)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2812.102 (+-8.211)       |         3568.780 (+-9.018)         |            3125.870 (+-7.324)           |     0.876 (+-0.000)      |          2834.178 (+-9.034)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |        1687.975 (+-9.527)       |         2752.085 (+-9.627)         |            2373.274 (+-7.888)           |     0.862 (+-0.000)      |          1698.782 (+-8.098)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |        1696.606 (+-8.678)       |        3056.317 (+-13.303)         |           2699.160 (+-10.638)           |     0.883 (+-0.000)      |         1684.942 (+-10.519)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |        2613.491 (+-9.769)       |        3176.493 (+-13.366)         |            2730.193 (+-9.573)           |     0.859 (+-0.000)      |          2625.085 (+-9.943)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       2614.946 (+-34.129)       |        3465.398 (+-11.165)         |           3044.396 (+-11.447)           |     0.879 (+-0.000)      |          2627.355 (+-9.608)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |       10784.549 (+-58.181)      |        18292.452 (+-59.344)        |           15909.922 (+-49.864)          |     0.870 (+-0.000)      |         10837.656 (+-51.947)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |       10786.513 (+-52.308)      |        20449.038 (+-56.204)        |           18295.997 (+-54.522)          |     0.895 (+-0.000)      |         10843.751 (+-44.781)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |       17532.699 (+-64.807)      |        20425.699 (+-80.271)        |           17517.040 (+-79.705)          |     0.858 (+-0.000)      |         17595.597 (+-61.870)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |       17530.816 (+-55.131)      |        22450.080 (+-92.899)        |           19827.828 (+-77.649)          |     0.883 (+-0.000)      |         17615.934 (+-71.716)

      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |       6875.484 (+-40.543)       |        11569.509 (+-62.462)        |          10053.350 (+-208.136)          |     0.869 (+-0.000)      |         6864.501 (+-46.747)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |       6843.126 (+-44.498)       |        12915.236 (+-60.654)        |          25335.058 (+-382.640)          |     1.962 (+-0.000)      |         6899.002 (+-46.861)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |       11103.418 (+-51.318)      |        28834.389 (+-78.395)        |          37405.463 (+-581.646)          |     1.297 (+-0.000)      |         11223.012 (+-60.709)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |       11092.994 (+-70.835)      |       36597.023 (+-118.988)        |           45761.267 (+-85.051)          |     1.250 (+-0.000)      |         11104.014 (+-61.288)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |       7106.791 (+-63.666)       |        11191.071 (+-45.402)        |           9786.037 (+-75.781)           |     0.874 (+-0.000)      |         7129.419 (+-77.674)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |       7146.519 (+-28.376)       |        12443.571 (+-39.425)        |           20147.067 (+-74.771)          |     1.619 (+-0.000)      |         7179.622 (+-64.847)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |       10533.849 (+-44.227)      |       34814.909 (+-138.127)        |          42803.001 (+-114.326)          |     1.229 (+-0.000)      |         10644.039 (+-59.681)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       10548.910 (+-44.221)      |       42876.940 (+-146.959)        |          49711.443 (+-139.276)          |     1.159 (+-0.000)      |         10652.375 (+-44.174)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |      42814.521 (+-103.198)      |       73100.489 (+-435.262)        |          63587.659 (+-134.266)          |     0.870 (+-0.000)      |        43208.921 (+-195.287)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |      42812.373 (+-103.870)      |       81769.160 (+-373.369)        |         175159.813 (+-2028.558)         |     2.142 (+-0.000)      |         43007.691 (+-96.358)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |      69955.505 (+-373.373)      |      215248.616 (+-2040.775)       |         267511.246 (+-2094.161)         |     1.243 (+-0.000)      |        70382.679 (+-594.941)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |      69852.157 (+-490.076)      |      242841.484 (+-19645.513)      |         317931.678 (+-2016.498)         |     1.309 (+-0.000)      |        70074.819 (+-352.919)

Times are in microseconds (us).

[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                     |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |         97.727 (+-0.018)        |          97.765 (+-0.025)          |             97.773 (+-0.027)            |     1.000 (+-0.000)      |           97.905 (+-0.040)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.615 (+-0.066)        |          97.332 (+-0.032)          |             97.950 (+-0.026)            |     1.006 (+-0.000)      |           97.690 (+-0.062)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        100.635 (+-0.033)        |         125.883 (+-0.020)          |            102.499 (+-0.116)            |     0.814 (+-0.000)      |          101.103 (+-0.027)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        100.898 (+-0.036)        |         109.717 (+-0.336)          |            102.558 (+-0.120)            |     0.935 (+-0.000)      |          101.642 (+-0.105)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |        462.853 (+-0.028)        |         382.475 (+-0.047)          |            382.472 (+-0.033)            |     1.000 (+-0.000)      |          462.188 (+-0.014)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |        462.783 (+-0.021)        |         382.806 (+-0.037)          |            382.563 (+-0.043)            |     0.999 (+-0.000)      |          462.089 (+-0.028)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        466.721 (+-0.022)        |         384.438 (+-0.027)          |            384.886 (+-0.037)            |     1.001 (+-0.000)      |          467.014 (+-0.025)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        466.993 (+-0.032)        |         384.212 (+-0.009)          |            383.946 (+-0.029)            |     0.999 (+-0.000)      |          466.575 (+-0.020)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        190.070 (+-0.082)        |         209.353 (+-1.096)          |            202.870 (+-0.888)            |     0.969 (+-0.000)      |          189.371 (+-0.164)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        190.021 (+-0.018)        |         210.504 (+-0.456)          |            201.814 (+-0.770)            |     0.959 (+-0.000)      |          189.314 (+-0.036)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        188.860 (+-0.207)        |         336.635 (+-0.023)          |            252.026 (+-0.510)            |     0.749 (+-0.000)      |          188.860 (+-0.170)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        188.725 (+-0.214)        |         276.329 (+-0.563)          |            251.439 (+-0.524)            |     0.910 (+-0.000)      |          188.776 (+-0.189)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        781.879 (+-0.086)        |         836.389 (+-7.177)          |            816.483 (+-6.626)            |     0.976 (+-0.000)      |          781.362 (+-0.106)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        781.824 (+-0.099)        |         840.406 (+-7.111)          |            807.530 (+-6.514)            |     0.961 (+-0.000)      |          781.307 (+-0.129)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        769.290 (+-0.309)        |         675.498 (+-1.537)          |            688.171 (+-4.326)            |     1.019 (+-0.000)      |          769.830 (+-0.222)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        769.240 (+-0.179)        |         675.800 (+-1.113)          |            673.176 (+-1.740)            |     0.996 (+-0.000)      |          769.935 (+-0.171)

Times are in microseconds (us).

```

Pull Request resolved: #120411
Approved by: https://github.com/lezcano
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…120411)

Superseeds pytorch#104248

Description:
- Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown)
- Added missing clamp(0, 1) for xscale and yscale
  - slowdown for f32 on cpu. PR on nodes fusion on CPU: pytorch#120077 can help for upsampling cases with align corners = true
  - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu.
- Removed lowering implementation

Benchmarks:
```
[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |        613.029 (+-1.590)        |         5477.608 (+-9.027)         |           3060.314 (+-12.368)           |     0.559 (+-0.000)      |          608.735 (+-6.336)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |        610.176 (+-1.428)        |        5718.503 (+-11.203)         |           3424.022 (+-12.836)           |     0.599 (+-0.000)      |          604.781 (+-6.229)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        325.001 (+-0.840)        |        6183.029 (+-10.893)         |            3275.032 (+-7.625)           |     0.530 (+-0.000)      |          325.693 (+-1.067)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        325.855 (+-1.108)        |        6391.394 (+-11.552)         |            3533.410 (+-7.666)           |     0.553 (+-0.000)      |          325.838 (+-1.457)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       2521.533 (+-14.857)       |        5025.217 (+-13.415)         |            2814.304 (+-6.742)           |     0.560 (+-0.000)      |         2520.308 (+-10.796)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       2531.204 (+-12.534)       |        5294.925 (+-11.994)         |            3147.590 (+-6.808)           |     0.594 (+-0.000)      |         2521.228 (+-11.732)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |        758.352 (+-10.362)       |        5639.912 (+-14.495)         |            3014.123 (+-8.799)           |     0.534 (+-0.000)      |          756.114 (+-4.792)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |        758.712 (+-5.781)        |         5927.541 (+-9.982)         |            3249.555 (+-7.226)           |     0.548 (+-0.000)      |          757.719 (+-5.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       1524.469 (+-12.860)       |        34321.641 (+-80.310)        |           19373.714 (+-56.351)          |     0.564 (+-0.000)      |         1518.082 (+-49.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       1521.746 (+-13.780)       |        35949.711 (+-81.010)        |           21782.366 (+-68.938)          |     0.606 (+-0.000)      |         1467.911 (+-15.901)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |        712.311 (+-5.361)        |        38826.510 (+-92.267)        |           20762.314 (+-59.303)          |     0.535 (+-0.000)      |          712.669 (+-4.673)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |        715.060 (+-4.757)        |        40269.353 (+-92.543)        |           22402.114 (+-81.574)          |     0.556 (+-0.000)      |          716.001 (+-8.945)

      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |       2331.889 (+-29.159)       |        21541.096 (+-72.346)        |           12181.194 (+-45.288)          |     0.565 (+-0.000)      |         2304.864 (+-21.351)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |       2333.697 (+-10.066)       |        22514.154 (+-57.798)        |           21709.449 (+-98.307)          |     0.964 (+-0.000)      |         2302.141 (+-13.041)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        1198.768 (+-5.364)       |       37652.371 (+-101.644)        |           42740.413 (+-98.571)          |     1.135 (+-0.000)      |          1197.104 (+-7.225)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        1196.851 (+-5.118)       |       39678.341 (+-173.750)        |           46807.738 (+-92.744)          |     1.180 (+-0.000)      |          1189.322 (+-5.681)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       10020.978 (+-54.855)      |        19955.290 (+-71.891)        |           11420.521 (+-53.179)          |     0.572 (+-0.000)      |         9999.583 (+-61.230)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       10066.441 (+-62.700)      |       21058.334 (+-183.414)        |           19986.577 (+-65.304)          |     0.949 (+-0.000)      |         10018.672 (+-59.188)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |       3171.135 (+-14.635)       |        19687.864 (+-54.320)        |           23313.699 (+-57.391)          |     1.184 (+-0.000)      |         3182.191 (+-17.686)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |       3181.314 (+-13.784)       |        20224.468 (+-50.827)        |          30541.963 (+-381.385)          |     1.510 (+-0.000)      |         3183.578 (+-16.203)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       5879.450 (+-31.551)       |       136918.555 (+-480.320)       |          77723.568 (+-331.766)          |     0.568 (+-0.000)      |         5726.061 (+-87.517)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       5882.869 (+-30.325)       |       143378.094 (+-513.842)       |         137244.074 (+-4827.730)         |     0.957 (+-0.000)      |         5727.679 (+-22.164)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |       2674.937 (+-45.003)       |      244829.360 (+-1930.579)       |         271283.073 (+-2243.245)         |     1.108 (+-0.000)      |         2676.054 (+-24.632)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |       2676.217 (+-16.601)       |      248658.668 (+-2904.952)       |         296514.520 (+-2983.281)         |     1.192 (+-0.000)      |         2682.844 (+-19.886)

      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1768.437 (+-6.294)       |        2934.013 (+-28.870)         |            2520.649 (+-6.797)           |     0.859 (+-0.000)      |          1759.292 (+-5.097)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1748.660 (+-5.550)       |         3271.104 (+-7.557)         |            2891.306 (+-7.632)           |     0.884 (+-0.000)      |          1746.341 (+-5.845)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2813.150 (+-6.656)       |         3258.973 (+-7.543)         |            2766.286 (+-6.473)           |     0.849 (+-0.000)      |          2805.077 (+-7.611)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2812.102 (+-8.211)       |         3568.780 (+-9.018)         |            3125.870 (+-7.324)           |     0.876 (+-0.000)      |          2834.178 (+-9.034)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |        1687.975 (+-9.527)       |         2752.085 (+-9.627)         |            2373.274 (+-7.888)           |     0.862 (+-0.000)      |          1698.782 (+-8.098)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |        1696.606 (+-8.678)       |        3056.317 (+-13.303)         |           2699.160 (+-10.638)           |     0.883 (+-0.000)      |         1684.942 (+-10.519)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |        2613.491 (+-9.769)       |        3176.493 (+-13.366)         |            2730.193 (+-9.573)           |     0.859 (+-0.000)      |          2625.085 (+-9.943)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       2614.946 (+-34.129)       |        3465.398 (+-11.165)         |           3044.396 (+-11.447)           |     0.879 (+-0.000)      |          2627.355 (+-9.608)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |       10784.549 (+-58.181)      |        18292.452 (+-59.344)        |           15909.922 (+-49.864)          |     0.870 (+-0.000)      |         10837.656 (+-51.947)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |       10786.513 (+-52.308)      |        20449.038 (+-56.204)        |           18295.997 (+-54.522)          |     0.895 (+-0.000)      |         10843.751 (+-44.781)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |       17532.699 (+-64.807)      |        20425.699 (+-80.271)        |           17517.040 (+-79.705)          |     0.858 (+-0.000)      |         17595.597 (+-61.870)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |       17530.816 (+-55.131)      |        22450.080 (+-92.899)        |           19827.828 (+-77.649)          |     0.883 (+-0.000)      |         17615.934 (+-71.716)

      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |       6875.484 (+-40.543)       |        11569.509 (+-62.462)        |          10053.350 (+-208.136)          |     0.869 (+-0.000)      |         6864.501 (+-46.747)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |       6843.126 (+-44.498)       |        12915.236 (+-60.654)        |          25335.058 (+-382.640)          |     1.962 (+-0.000)      |         6899.002 (+-46.861)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |       11103.418 (+-51.318)      |        28834.389 (+-78.395)        |          37405.463 (+-581.646)          |     1.297 (+-0.000)      |         11223.012 (+-60.709)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |       11092.994 (+-70.835)      |       36597.023 (+-118.988)        |           45761.267 (+-85.051)          |     1.250 (+-0.000)      |         11104.014 (+-61.288)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |       7106.791 (+-63.666)       |        11191.071 (+-45.402)        |           9786.037 (+-75.781)           |     0.874 (+-0.000)      |         7129.419 (+-77.674)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |       7146.519 (+-28.376)       |        12443.571 (+-39.425)        |           20147.067 (+-74.771)          |     1.619 (+-0.000)      |         7179.622 (+-64.847)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |       10533.849 (+-44.227)      |       34814.909 (+-138.127)        |          42803.001 (+-114.326)          |     1.229 (+-0.000)      |         10644.039 (+-59.681)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       10548.910 (+-44.221)      |       42876.940 (+-146.959)        |          49711.443 (+-139.276)          |     1.159 (+-0.000)      |         10652.375 (+-44.174)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |      42814.521 (+-103.198)      |       73100.489 (+-435.262)        |          63587.659 (+-134.266)          |     0.870 (+-0.000)      |        43208.921 (+-195.287)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |      42812.373 (+-103.870)      |       81769.160 (+-373.369)        |         175159.813 (+-2028.558)         |     2.142 (+-0.000)      |         43007.691 (+-96.358)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |      69955.505 (+-373.373)      |      215248.616 (+-2040.775)       |         267511.246 (+-2094.161)         |     1.243 (+-0.000)      |        70382.679 (+-594.941)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |      69852.157 (+-490.076)      |      242841.484 (+-19645.513)      |         317931.678 (+-2016.498)         |     1.309 (+-0.000)      |        70074.819 (+-352.919)

Times are in microseconds (us).

[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                     |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |         97.727 (+-0.018)        |          97.765 (+-0.025)          |             97.773 (+-0.027)            |     1.000 (+-0.000)      |           97.905 (+-0.040)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.615 (+-0.066)        |          97.332 (+-0.032)          |             97.950 (+-0.026)            |     1.006 (+-0.000)      |           97.690 (+-0.062)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        100.635 (+-0.033)        |         125.883 (+-0.020)          |            102.499 (+-0.116)            |     0.814 (+-0.000)      |          101.103 (+-0.027)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        100.898 (+-0.036)        |         109.717 (+-0.336)          |            102.558 (+-0.120)            |     0.935 (+-0.000)      |          101.642 (+-0.105)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |        462.853 (+-0.028)        |         382.475 (+-0.047)          |            382.472 (+-0.033)            |     1.000 (+-0.000)      |          462.188 (+-0.014)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |        462.783 (+-0.021)        |         382.806 (+-0.037)          |            382.563 (+-0.043)            |     0.999 (+-0.000)      |          462.089 (+-0.028)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        466.721 (+-0.022)        |         384.438 (+-0.027)          |            384.886 (+-0.037)            |     1.001 (+-0.000)      |          467.014 (+-0.025)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        466.993 (+-0.032)        |         384.212 (+-0.009)          |            383.946 (+-0.029)            |     0.999 (+-0.000)      |          466.575 (+-0.020)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        190.070 (+-0.082)        |         209.353 (+-1.096)          |            202.870 (+-0.888)            |     0.969 (+-0.000)      |          189.371 (+-0.164)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        190.021 (+-0.018)        |         210.504 (+-0.456)          |            201.814 (+-0.770)            |     0.959 (+-0.000)      |          189.314 (+-0.036)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        188.860 (+-0.207)        |         336.635 (+-0.023)          |            252.026 (+-0.510)            |     0.749 (+-0.000)      |          188.860 (+-0.170)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        188.725 (+-0.214)        |         276.329 (+-0.563)          |            251.439 (+-0.524)            |     0.910 (+-0.000)      |          188.776 (+-0.189)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        781.879 (+-0.086)        |         836.389 (+-7.177)          |            816.483 (+-6.626)            |     0.976 (+-0.000)      |          781.362 (+-0.106)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        781.824 (+-0.099)        |         840.406 (+-7.111)          |            807.530 (+-6.514)            |     0.961 (+-0.000)      |          781.307 (+-0.129)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        769.290 (+-0.309)        |         675.498 (+-1.537)          |            688.171 (+-4.326)            |     1.019 (+-0.000)      |          769.830 (+-0.222)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        769.240 (+-0.179)        |         675.800 (+-1.113)          |            673.176 (+-1.740)            |     0.996 (+-0.000)      |          769.935 (+-0.171)

Times are in microseconds (us).

```

Pull Request resolved: pytorch#120411
Approved by: https://github.com/lezcano
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
…mpatible ranges (#122420)

Fixes #122283

Description:

PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: #122420
Approved by: https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants