-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) #120077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120077
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit 6a8d136 with merge base bd19d6d ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_inductor/scheduler.py
Outdated
ref_node = node2 if len(vars1) < len(vars2) else node1 | ||
extra_indexing_constraints = get_indexing_ranges_exprs(ref_node) | ||
if extra_indexing_constraints is not None and isinstance(node_to_recomp, SchedulerNode): | ||
node_to_recomp.recompute_size_and_body(extra_indexing_constraints=extra_indexing_constraints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern is that retracing the body is fairly expensive, so we'll need to benchmark this to see how it impacts compile time overhead. I don't expect it to be too bad though as scheduler time doesn't usually dominate compilation.
9d79773
to
ac52258
Compare
else: | ||
self.kernel_group = KernelGroup() | ||
|
||
def fuse(self, node1, node2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type the function to make it more readable. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, if you check can_fuse_*
methods here, they are not typed neither, I follow the trend here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vfdev-5 Do you have performance comparison with and without this PR on the three inductor benchmark suites? I guess my major concern is that whether the fusion would always bring performance benefit. It might change the parallelization scheduling and data access patterns of the node with sizes s0*s1*...
cc @chuanqi129 for awareness. We may evaluate its performance impact from our side too.
@jgong5 can you please point out which benchmark suites you think of? Thanks! I'll provide a benchmark of the case I worked on (upsampling bicubic) where this fusion leads to some speed-up. |
@vfdev-5 we are tracking the performance with the three benchmark suites: torchbench, huggingface and timm. @chuanqi129 can share the steps how to run them to check the performance. |
@vfdev-5 run the torchbench suite on the hud, please. |
Hi @vfdev-5 @jgong5 , for Inductor CPU performance dashboard benchmark side, we used this |
If CPU benchmarks are not run on the hud, then it may be better for them to be run by someone from intel so that they are run on the relevant hardware, perhaps with core pinning, disabled throttling, w/without hyperthreading, etc... Also, we don't have access to a box with AVX512, for example. @jgong5 |
Observations:
"extended fusion" refers to the following update to this PR to handle channels-last case: - c3 = len(vars1) == 1 or len(vars2) == 1
+ c3 = len(vars1) in (1, 2) or len(vars2) in (1, 2) |
Thanks for sharing the numbers. @chuanqi129 will check the impact on the inference performance on the three benchmark suites. |
Have verified for FP32 static shape default wrapper inference test with 3 suites, there is no performance drop. cc: @jgong5 @vfdev-5 |
Thanks @chuanqi129 ! |
@chuanqi129 thanks for the feedback, I wonder whether you have seen any perf improvements on some tests ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm. @peterbell10 wanna have a last look?
Also, @vfdev-5 will you implement the more generalised version of this algorithm after this PR is merged?
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
cf655ea
to
7d65fd5
Compare
7d65fd5
to
e7489d8
Compare
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…ndexing_constraints
Successfully rebased |
e7489d8
to
07723fb
Compare
…n for that case only (fixed failing test)
return ReasonFusedNodes.COMPATIBLE_REDUCTION | ||
if self._can_fuse_nodes_with_compatible_ranges(node1, node2): | ||
return ReasonFusedNodes.COMPATIBLE_RANGES_NO_REDUCTION | ||
# TODO(jansel): allow fusion pointwise (vars1, ()) suffix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is no longer relevant :)
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…mpatible ranges (#122420) Fixes #122283 Description: PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: #122420 Approved by: https://github.com/lezcano
Superseeds #104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: #120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 613.029 (+-1.590) | 5477.608 (+-9.027) | 3060.314 (+-12.368) | 0.559 (+-0.000) | 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 610.176 (+-1.428) | 5718.503 (+-11.203) | 3424.022 (+-12.836) | 0.599 (+-0.000) | 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 325.001 (+-0.840) | 6183.029 (+-10.893) | 3275.032 (+-7.625) | 0.530 (+-0.000) | 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 325.855 (+-1.108) | 6391.394 (+-11.552) | 3533.410 (+-7.666) | 0.553 (+-0.000) | 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2521.533 (+-14.857) | 5025.217 (+-13.415) | 2814.304 (+-6.742) | 0.560 (+-0.000) | 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2531.204 (+-12.534) | 5294.925 (+-11.994) | 3147.590 (+-6.808) | 0.594 (+-0.000) | 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 758.352 (+-10.362) | 5639.912 (+-14.495) | 3014.123 (+-8.799) | 0.534 (+-0.000) | 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 758.712 (+-5.781) | 5927.541 (+-9.982) | 3249.555 (+-7.226) | 0.548 (+-0.000) | 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 1524.469 (+-12.860) | 34321.641 (+-80.310) | 19373.714 (+-56.351) | 0.564 (+-0.000) | 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 1521.746 (+-13.780) | 35949.711 (+-81.010) | 21782.366 (+-68.938) | 0.606 (+-0.000) | 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 712.311 (+-5.361) | 38826.510 (+-92.267) | 20762.314 (+-59.303) | 0.535 (+-0.000) | 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 715.060 (+-4.757) | 40269.353 (+-92.543) | 22402.114 (+-81.574) | 0.556 (+-0.000) | 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2331.889 (+-29.159) | 21541.096 (+-72.346) | 12181.194 (+-45.288) | 0.565 (+-0.000) | 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2333.697 (+-10.066) | 22514.154 (+-57.798) | 21709.449 (+-98.307) | 0.964 (+-0.000) | 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1198.768 (+-5.364) | 37652.371 (+-101.644) | 42740.413 (+-98.571) | 1.135 (+-0.000) | 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1196.851 (+-5.118) | 39678.341 (+-173.750) | 46807.738 (+-92.744) | 1.180 (+-0.000) | 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10020.978 (+-54.855) | 19955.290 (+-71.891) | 11420.521 (+-53.179) | 0.572 (+-0.000) | 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10066.441 (+-62.700) | 21058.334 (+-183.414) | 19986.577 (+-65.304) | 0.949 (+-0.000) | 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 3171.135 (+-14.635) | 19687.864 (+-54.320) | 23313.699 (+-57.391) | 1.184 (+-0.000) | 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 3181.314 (+-13.784) | 20224.468 (+-50.827) | 30541.963 (+-381.385) | 1.510 (+-0.000) | 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 5879.450 (+-31.551) | 136918.555 (+-480.320) | 77723.568 (+-331.766) | 0.568 (+-0.000) | 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 5882.869 (+-30.325) | 143378.094 (+-513.842) | 137244.074 (+-4827.730) | 0.957 (+-0.000) | 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 2674.937 (+-45.003) | 244829.360 (+-1930.579) | 271283.073 (+-2243.245) | 1.108 (+-0.000) | 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 2676.217 (+-16.601) | 248658.668 (+-2904.952) | 296514.520 (+-2983.281) | 1.192 (+-0.000) | 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1768.437 (+-6.294) | 2934.013 (+-28.870) | 2520.649 (+-6.797) | 0.859 (+-0.000) | 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1748.660 (+-5.550) | 3271.104 (+-7.557) | 2891.306 (+-7.632) | 0.884 (+-0.000) | 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2813.150 (+-6.656) | 3258.973 (+-7.543) | 2766.286 (+-6.473) | 0.849 (+-0.000) | 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2812.102 (+-8.211) | 3568.780 (+-9.018) | 3125.870 (+-7.324) | 0.876 (+-0.000) | 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 1687.975 (+-9.527) | 2752.085 (+-9.627) | 2373.274 (+-7.888) | 0.862 (+-0.000) | 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 1696.606 (+-8.678) | 3056.317 (+-13.303) | 2699.160 (+-10.638) | 0.883 (+-0.000) | 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2613.491 (+-9.769) | 3176.493 (+-13.366) | 2730.193 (+-9.573) | 0.859 (+-0.000) | 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2614.946 (+-34.129) | 3465.398 (+-11.165) | 3044.396 (+-11.447) | 0.879 (+-0.000) | 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 10784.549 (+-58.181) | 18292.452 (+-59.344) | 15909.922 (+-49.864) | 0.870 (+-0.000) | 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 10786.513 (+-52.308) | 20449.038 (+-56.204) | 18295.997 (+-54.522) | 0.895 (+-0.000) | 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 17532.699 (+-64.807) | 20425.699 (+-80.271) | 17517.040 (+-79.705) | 0.858 (+-0.000) | 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 17530.816 (+-55.131) | 22450.080 (+-92.899) | 19827.828 (+-77.649) | 0.883 (+-0.000) | 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 6875.484 (+-40.543) | 11569.509 (+-62.462) | 10053.350 (+-208.136) | 0.869 (+-0.000) | 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 6843.126 (+-44.498) | 12915.236 (+-60.654) | 25335.058 (+-382.640) | 1.962 (+-0.000) | 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 11103.418 (+-51.318) | 28834.389 (+-78.395) | 37405.463 (+-581.646) | 1.297 (+-0.000) | 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 11092.994 (+-70.835) | 36597.023 (+-118.988) | 45761.267 (+-85.051) | 1.250 (+-0.000) | 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 7106.791 (+-63.666) | 11191.071 (+-45.402) | 9786.037 (+-75.781) | 0.874 (+-0.000) | 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 7146.519 (+-28.376) | 12443.571 (+-39.425) | 20147.067 (+-74.771) | 1.619 (+-0.000) | 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10533.849 (+-44.227) | 34814.909 (+-138.127) | 42803.001 (+-114.326) | 1.229 (+-0.000) | 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10548.910 (+-44.221) | 42876.940 (+-146.959) | 49711.443 (+-139.276) | 1.159 (+-0.000) | 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 42814.521 (+-103.198) | 73100.489 (+-435.262) | 63587.659 (+-134.266) | 0.870 (+-0.000) | 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 42812.373 (+-103.870) | 81769.160 (+-373.369) | 175159.813 (+-2028.558) | 2.142 (+-0.000) | 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 69955.505 (+-373.373) | 215248.616 (+-2040.775) | 267511.246 (+-2094.161) | 1.243 (+-0.000) | 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 69852.157 (+-490.076) | 242841.484 (+-19645.513) | 317931.678 (+-2016.498) | 1.309 (+-0.000) | 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 97.727 (+-0.018) | 97.765 (+-0.025) | 97.773 (+-0.027) | 1.000 (+-0.000) | 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 97.615 (+-0.066) | 97.332 (+-0.032) | 97.950 (+-0.026) | 1.006 (+-0.000) | 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 100.635 (+-0.033) | 125.883 (+-0.020) | 102.499 (+-0.116) | 0.814 (+-0.000) | 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 100.898 (+-0.036) | 109.717 (+-0.336) | 102.558 (+-0.120) | 0.935 (+-0.000) | 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 462.853 (+-0.028) | 382.475 (+-0.047) | 382.472 (+-0.033) | 1.000 (+-0.000) | 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 462.783 (+-0.021) | 382.806 (+-0.037) | 382.563 (+-0.043) | 0.999 (+-0.000) | 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 466.721 (+-0.022) | 384.438 (+-0.027) | 384.886 (+-0.037) | 1.001 (+-0.000) | 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 466.993 (+-0.032) | 384.212 (+-0.009) | 383.946 (+-0.029) | 0.999 (+-0.000) | 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 190.070 (+-0.082) | 209.353 (+-1.096) | 202.870 (+-0.888) | 0.969 (+-0.000) | 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 190.021 (+-0.018) | 210.504 (+-0.456) | 201.814 (+-0.770) | 0.959 (+-0.000) | 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 188.860 (+-0.207) | 336.635 (+-0.023) | 252.026 (+-0.510) | 0.749 (+-0.000) | 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 188.725 (+-0.214) | 276.329 (+-0.563) | 251.439 (+-0.524) | 0.910 (+-0.000) | 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 781.879 (+-0.086) | 836.389 (+-7.177) | 816.483 (+-6.626) | 0.976 (+-0.000) | 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 781.824 (+-0.099) | 840.406 (+-7.111) | 807.530 (+-6.514) | 0.961 (+-0.000) | 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 769.290 (+-0.309) | 675.498 (+-1.537) | 688.171 (+-4.326) | 1.019 (+-0.000) | 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 769.240 (+-0.179) | 675.800 (+-1.113) | 673.176 (+-1.740) | 0.996 (+-0.000) | 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: #120411 Approved by: https://github.com/lezcano
…120411) Superseeds pytorch#104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: pytorch#120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 613.029 (+-1.590) | 5477.608 (+-9.027) | 3060.314 (+-12.368) | 0.559 (+-0.000) | 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 610.176 (+-1.428) | 5718.503 (+-11.203) | 3424.022 (+-12.836) | 0.599 (+-0.000) | 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 325.001 (+-0.840) | 6183.029 (+-10.893) | 3275.032 (+-7.625) | 0.530 (+-0.000) | 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 325.855 (+-1.108) | 6391.394 (+-11.552) | 3533.410 (+-7.666) | 0.553 (+-0.000) | 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2521.533 (+-14.857) | 5025.217 (+-13.415) | 2814.304 (+-6.742) | 0.560 (+-0.000) | 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2531.204 (+-12.534) | 5294.925 (+-11.994) | 3147.590 (+-6.808) | 0.594 (+-0.000) | 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 758.352 (+-10.362) | 5639.912 (+-14.495) | 3014.123 (+-8.799) | 0.534 (+-0.000) | 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 758.712 (+-5.781) | 5927.541 (+-9.982) | 3249.555 (+-7.226) | 0.548 (+-0.000) | 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 1524.469 (+-12.860) | 34321.641 (+-80.310) | 19373.714 (+-56.351) | 0.564 (+-0.000) | 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 1521.746 (+-13.780) | 35949.711 (+-81.010) | 21782.366 (+-68.938) | 0.606 (+-0.000) | 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 712.311 (+-5.361) | 38826.510 (+-92.267) | 20762.314 (+-59.303) | 0.535 (+-0.000) | 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 715.060 (+-4.757) | 40269.353 (+-92.543) | 22402.114 (+-81.574) | 0.556 (+-0.000) | 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2331.889 (+-29.159) | 21541.096 (+-72.346) | 12181.194 (+-45.288) | 0.565 (+-0.000) | 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2333.697 (+-10.066) | 22514.154 (+-57.798) | 21709.449 (+-98.307) | 0.964 (+-0.000) | 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1198.768 (+-5.364) | 37652.371 (+-101.644) | 42740.413 (+-98.571) | 1.135 (+-0.000) | 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1196.851 (+-5.118) | 39678.341 (+-173.750) | 46807.738 (+-92.744) | 1.180 (+-0.000) | 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10020.978 (+-54.855) | 19955.290 (+-71.891) | 11420.521 (+-53.179) | 0.572 (+-0.000) | 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10066.441 (+-62.700) | 21058.334 (+-183.414) | 19986.577 (+-65.304) | 0.949 (+-0.000) | 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 3171.135 (+-14.635) | 19687.864 (+-54.320) | 23313.699 (+-57.391) | 1.184 (+-0.000) | 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 3181.314 (+-13.784) | 20224.468 (+-50.827) | 30541.963 (+-381.385) | 1.510 (+-0.000) | 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 5879.450 (+-31.551) | 136918.555 (+-480.320) | 77723.568 (+-331.766) | 0.568 (+-0.000) | 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 5882.869 (+-30.325) | 143378.094 (+-513.842) | 137244.074 (+-4827.730) | 0.957 (+-0.000) | 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 2674.937 (+-45.003) | 244829.360 (+-1930.579) | 271283.073 (+-2243.245) | 1.108 (+-0.000) | 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 2676.217 (+-16.601) | 248658.668 (+-2904.952) | 296514.520 (+-2983.281) | 1.192 (+-0.000) | 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 1768.437 (+-6.294) | 2934.013 (+-28.870) | 2520.649 (+-6.797) | 0.859 (+-0.000) | 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 1748.660 (+-5.550) | 3271.104 (+-7.557) | 2891.306 (+-7.632) | 0.884 (+-0.000) | 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 2813.150 (+-6.656) | 3258.973 (+-7.543) | 2766.286 (+-6.473) | 0.849 (+-0.000) | 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 2812.102 (+-8.211) | 3568.780 (+-9.018) | 3125.870 (+-7.324) | 0.876 (+-0.000) | 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 1687.975 (+-9.527) | 2752.085 (+-9.627) | 2373.274 (+-7.888) | 0.862 (+-0.000) | 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 1696.606 (+-8.678) | 3056.317 (+-13.303) | 2699.160 (+-10.638) | 0.883 (+-0.000) | 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 2613.491 (+-9.769) | 3176.493 (+-13.366) | 2730.193 (+-9.573) | 0.859 (+-0.000) | 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 2614.946 (+-34.129) | 3465.398 (+-11.165) | 3044.396 (+-11.447) | 0.879 (+-0.000) | 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 10784.549 (+-58.181) | 18292.452 (+-59.344) | 15909.922 (+-49.864) | 0.870 (+-0.000) | 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 10786.513 (+-52.308) | 20449.038 (+-56.204) | 18295.997 (+-54.522) | 0.895 (+-0.000) | 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 17532.699 (+-64.807) | 20425.699 (+-80.271) | 17517.040 (+-79.705) | 0.858 (+-0.000) | 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 17530.816 (+-55.131) | 22450.080 (+-92.899) | 19827.828 (+-77.649) | 0.883 (+-0.000) | 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 6875.484 (+-40.543) | 11569.509 (+-62.462) | 10053.350 (+-208.136) | 0.869 (+-0.000) | 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 6843.126 (+-44.498) | 12915.236 (+-60.654) | 25335.058 (+-382.640) | 1.962 (+-0.000) | 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) | 11103.418 (+-51.318) | 28834.389 (+-78.395) | 37405.463 (+-581.646) | 1.297 (+-0.000) | 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) | 11092.994 (+-70.835) | 36597.023 (+-118.988) | 45761.267 (+-85.051) | 1.250 (+-0.000) | 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 7106.791 (+-63.666) | 11191.071 (+-45.402) | 9786.037 (+-75.781) | 0.874 (+-0.000) | 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 7146.519 (+-28.376) | 12443.571 (+-39.425) | 20147.067 (+-74.771) | 1.619 (+-0.000) | 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) | 10533.849 (+-44.227) | 34814.909 (+-138.127) | 42803.001 (+-114.326) | 1.229 (+-0.000) | 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) | 10548.910 (+-44.221) | 42876.940 (+-146.959) | 49711.443 (+-139.276) | 1.159 (+-0.000) | 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 42814.521 (+-103.198) | 73100.489 (+-435.262) | 63587.659 (+-134.266) | 0.870 (+-0.000) | 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 42812.373 (+-103.870) | 81769.160 (+-373.369) | 175159.813 (+-2028.558) | 2.142 (+-0.000) | 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) | 69955.505 (+-373.373) | 215248.616 (+-2040.775) | 267511.246 (+-2094.161) | 1.243 (+-0.000) | 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) | 69852.157 (+-490.076) | 242841.484 (+-19645.513) | 317931.678 (+-2016.498) | 1.309 (+-0.000) | 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] | Eager (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git0c61c20) PR | Compiled (2.4.0a0+git069270d) Nightly | speed-up PR vs Nightly | Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 97.727 (+-0.018) | 97.765 (+-0.025) | 97.773 (+-0.027) | 1.000 (+-0.000) | 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 97.615 (+-0.066) | 97.332 (+-0.032) | 97.950 (+-0.026) | 1.006 (+-0.000) | 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 100.635 (+-0.033) | 125.883 (+-0.020) | 102.499 (+-0.116) | 0.814 (+-0.000) | 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 100.898 (+-0.036) | 109.717 (+-0.336) | 102.558 (+-0.120) | 0.935 (+-0.000) | 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 462.853 (+-0.028) | 382.475 (+-0.047) | 382.472 (+-0.033) | 1.000 (+-0.000) | 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 462.783 (+-0.021) | 382.806 (+-0.037) | 382.563 (+-0.043) | 0.999 (+-0.000) | 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) | 466.721 (+-0.022) | 384.438 (+-0.027) | 384.886 (+-0.037) | 1.001 (+-0.000) | 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) | 466.993 (+-0.032) | 384.212 (+-0.009) | 383.946 (+-0.029) | 0.999 (+-0.000) | 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 190.070 (+-0.082) | 209.353 (+-1.096) | 202.870 (+-0.888) | 0.969 (+-0.000) | 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 190.021 (+-0.018) | 210.504 (+-0.456) | 201.814 (+-0.770) | 0.959 (+-0.000) | 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 188.860 (+-0.207) | 336.635 (+-0.023) | 252.026 (+-0.510) | 0.749 (+-0.000) | 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 188.725 (+-0.214) | 276.329 (+-0.563) | 251.439 (+-0.524) | 0.910 (+-0.000) | 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 781.879 (+-0.086) | 836.389 (+-7.177) | 816.483 (+-6.626) | 0.976 (+-0.000) | 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 781.824 (+-0.099) | 840.406 (+-7.111) | 807.530 (+-6.514) | 0.961 (+-0.000) | 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) | 769.290 (+-0.309) | 675.498 (+-1.537) | 688.171 (+-4.326) | 1.019 (+-0.000) | 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) | 769.240 (+-0.179) | 675.800 (+-1.113) | 673.176 (+-1.740) | 0.996 (+-0.000) | 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: pytorch#120411 Approved by: https://github.com/lezcano
…mpatible ranges (#122420) Fixes #122283 Description: PR #120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: #122420 Approved by: https://github.com/lezcano
Description:
node1: (s0, s1, s2)
andnode2: (s0 * s1 * s2)
. Onmain
these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.Example:
Output on
main
:Output on this PR:
Context:
While working on #120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond
config.realize_opcount_threshold
.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang