Performance fix for torch.cat operator on ROCm #46097

ashishfarmer · 2020-10-09T16:52:15Z

Summary:
This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define __HIP_PLATFORM_HCC__

Test plan:
Benchmark
python -m pt.cat_test --tag_filter all --device cuda

Results on ROCm before the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221

Results on ROCm after the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174

jeffdaily · 2020-10-09T17:24:45Z

Requested review from @lly-zero-one as code owner.

jeffdaily · 2020-10-09T17:26:51Z

CC @ezyang @xw285cornell @malfet @sunway513 for awareness. We plan to submit this PR as a cherry-pick against release/1.7 once this passes CI and is merged to master. Refer to the performance comparison as justification.

lly-zero-one · 2020-10-09T17:51:52Z

@jeffdaily Thanks for the fix. I want to learn more why #44833 brings regression for HIP case. Could you help to elaborate that? Basically, that PR got rid of the H2D by using the constant memory instead.

jeffdaily · 2020-10-09T18:42:23Z

@lly-zero-one our compiler does not yet handle this use case efficiently, passing aggregate structs by value as kernel arguments. This PR restores the old behavior of instead passing a pointer to the struct. Our compiler team is working on a fix for this.

malfet · 2020-10-09T20:48:45Z

Since the implementations seems sufficiently diverged now, perhaps a HIP specific implementation can be checked into shapes.hip?

codecov · 2020-10-09T21:03:21Z

Codecov Report

Merging #46097 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #46097   +/-   ##
=======================================
  Coverage   68.17%   68.17%           
=======================================
  Files         410      410           
  Lines       53422    53422           
=======================================
+ Hits        36422    36423    +1     
+ Misses      17000    16999    -1

Impacted Files	Coverage Δ
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 362d9a9...3128c39. Read the comment docs.

jeffdaily · 2020-10-09T22:26:17Z

Certainly not ideal, but there is some precedent established with diverging CUDA/ROCm implementations in the Loops.h file for elementwise operators, or the cublas vs rocblas differences. We make every effort to keep these differences to a minimum. We intend to revert this change once our compiler improves, but we do not have an estimate at this time.

We're adding ~160 lines to this file; it's not a complete duplication since the file was originally ~400 lines, and we're only providing ROCm-specific implementations of CatArrayBatchedCopy and parallel_cat, not the handful of remaining functions. The CUDA path is unchanged and is free to continue changing in the future, as needed. The #ifdef is much easier than adding new sources files. We're hoping the benefit of undoing the massive ROCm performance regression is worth the price of this short-term divergence.

jeffdaily · 2020-10-12T19:18:10Z

@malfet any response to my comment? Again, I want to stress that we will revert this workaround as soon as it becomes possible.

malfet

@jeffdaily having two different implementations of the functions with the same name makes reading this file quite confusing to the developer.
On the other hand, since uninstantiated template is essentially a no-op, so please consider adding HIP_ prefixes to a RocM specific implementation of templates and the have just a single small #ifdef in cat_out_cuda to call hip_parallel_cat if compiling with ROCM

malfet · 2020-10-12T19:30:08Z

aten/src/ATen/native/cuda/Shape.cu

+
+template <typename T, typename IndexType, int Dims>
+C10_LAUNCH_BOUNDS_1(512)
+__global__ void CatArrayBatchedCopy(


Suggested change

__global__ void CatArrayBatchedCopy(

__global__ void HIP_CatArrayBatchedCopy(

malfet · 2020-10-12T19:30:27Z

aten/src/ATen/native/cuda/Shape.cu

@@ -141,6 +186,124 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
  }
 }

+#ifdef __HIP_PLATFORM_HCC__
+template <typename scalar_t>
+void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,


Suggested change

void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

malfet · 2020-10-12T19:30:49Z

aten/src/ATen/native/cuda/Shape.cu

+    }
+    // Template Declarations for dim = 1, 2, 3, 4
+#define HANDLE_CASE(DIMS) \
+    CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\


Suggested change

CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

Updated the branch with suggested changes

jeffdaily · 2020-10-12T19:35:31Z

@ashishfarmer please update based on comments from @malfet. Thank you.

dr-ci · 2020-10-12T23:40:33Z

💊 CI failures summary and remediations

As of commit c29834d (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

malfet · 2020-10-13T00:36:45Z

aten/src/ATen/native/cuda/Shape.cu

+    dim3 catGrid;
+    getCatGrid(batchCounter, catGrid);
+
+


nit

Suggested change

facebook-github-bot

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-14T06:23:53Z

@malfet merged this pull request in d5ca53c.

Summary: This pull request is a partial revert of pytorch#44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__` Pull Request resolved: pytorch#46097 Test Plan: Benchmark `python -m pt.cat_test --tag_filter all --device cuda` Results on ROCm before the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 10828.314 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11888.028 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11898.945 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 11787.744 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11792.479 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 11769.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 11633.882 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 11617.768 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 11625.143 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 13079.204 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 13095.620 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 13403.086 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 118.704 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 263.273 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 463.024 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 23818.032 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 234778.296 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 470288.132 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 704361.221 ``` Results on ROCm after the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 29.292 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 46.320 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 36.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 92.816 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 93.943 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 163.914 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 75.475 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 68.880 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 85.268 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 111.543 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 110.644 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 116.201 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 117.708 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 264.953 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 480.304 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 116.385 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 913.591 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 2003.212 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 3004.174 ``` Reviewed By: bdhirsh Differential Revision: D24286324 Pulled By: malfet fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2

Summary: This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__` Pull Request resolved: #46097 Test Plan: Benchmark `python -m pt.cat_test --tag_filter all --device cuda` Results on ROCm before the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 10828.314 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11888.028 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11898.945 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 11787.744 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11792.479 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 11769.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 11633.882 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 11617.768 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 11625.143 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 13079.204 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 13095.620 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 13403.086 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 118.704 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 263.273 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 463.024 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 23818.032 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 234778.296 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 470288.132 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 704361.221 ``` Results on ROCm after the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 29.292 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 46.320 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 36.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 92.816 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 93.943 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 163.914 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 75.475 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 68.880 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 85.268 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 111.543 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 110.644 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 116.201 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 117.708 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 264.953 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 480.304 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 116.385 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 913.591 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 2003.212 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 3004.174 ``` Reviewed By: bdhirsh Differential Revision: D24286324 Pulled By: malfet fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2

revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel

Summary: revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/14bf20cd922c3ba33c32343c19fd9ac490d4f7a6 Reviewed By: anjali411 Differential Revision: D34990460 fbshipit-source-id: 2bb09b9f60342f7bd23e856d4861d513dd3d104f

revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

ashishfarmer added 2 commits October 9, 2020 16:40

pass struct as a pointer

c58c8ca

update comment

997a378

pytorchbot added the open source label Oct 9, 2020

jeffdaily added the module: rocm AMD GPU support for Pytorch label Oct 9, 2020

jeffdaily requested a review from lly-zero-one October 9, 2020 17:24

ashishfarmer marked this pull request as ready for review October 9, 2020 17:24

ashishfarmer mentioned this pull request Oct 9, 2020

Workaround to pass struct to kernel by ref for performance ROCm/pytorch#752

Closed

agolynski added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 12, 2020

malfet requested a review from ngimel October 12, 2020 19:22

malfet requested changes Oct 12, 2020

View reviewed changes

refactoring into HIP functions

3128c39

jeffdaily mentioned this pull request Oct 12, 2020

Push rocm to slow path #46216

Closed

malfet approved these changes Oct 13, 2020

View reviewed changes

aten/src/ATen/native/cuda/Shape.cu Outdated

dim3 catGrid;

getCatGrid(batchCounter, catGrid);

Copy link

Contributor

malfet Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change

remove unintentional line break

c29834d

facebook-github-bot reviewed Oct 13, 2020

View reviewed changes

facebook-github-bot closed this in d5ca53c Oct 14, 2020

facebook-github-bot added the Merged label Oct 14, 2020

ashishfarmer mentioned this pull request Oct 14, 2020

Performance fix for torch.cat operator on ROCm (#46097) #46323

Merged

jeffdaily mentioned this pull request Oct 14, 2020

[v1.7.0] Release Tracker #45592

Closed

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Mar 11, 2022

revert d5ca53c (pytorch#46097)

4bbd81b

jeffdaily mentioned this pull request Mar 11, 2022

[ROCm] revert cat operator performance work-around #74129

Closed

jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Mar 14, 2022

revert d5ca53c (pytorch#46097)

6cec426

ngimel mentioned this pull request Mar 18, 2022

[ROCm] enable foreach fastpath #74417

Closed

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#986

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#987

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#988

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#989

Merged

jeffdaily mentioned this pull request Apr 8, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance fix for torch.cat operator on ROCm #46097

Performance fix for torch.cat operator on ROCm #46097

ashishfarmer commented Oct 9, 2020

jeffdaily commented Oct 9, 2020

jeffdaily commented Oct 9, 2020 •

edited

lly-zero-one commented Oct 9, 2020 •

edited

jeffdaily commented Oct 9, 2020

malfet commented Oct 9, 2020

codecov bot commented Oct 9, 2020 •

edited

jeffdaily commented Oct 9, 2020

jeffdaily commented Oct 12, 2020

malfet left a comment

malfet Oct 12, 2020

malfet Oct 12, 2020

malfet Oct 12, 2020

ashishfarmer Oct 12, 2020

jeffdaily commented Oct 12, 2020

dr-ci bot commented Oct 12, 2020 •

edited

malfet Oct 13, 2020

facebook-github-bot left a comment

facebook-github-bot commented Oct 14, 2020

	__global__ void CatArrayBatchedCopy(
	__global__ void HIP_CatArrayBatchedCopy(

	void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
	void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

	CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\
	HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

Performance fix for torch.cat operator on ROCm #46097

Performance fix for torch.cat operator on ROCm #46097

Conversation

ashishfarmer commented Oct 9, 2020

jeffdaily commented Oct 9, 2020

jeffdaily commented Oct 9, 2020 • edited

lly-zero-one commented Oct 9, 2020 • edited

jeffdaily commented Oct 9, 2020

malfet commented Oct 9, 2020

codecov bot commented Oct 9, 2020 • edited

Codecov Report

jeffdaily commented Oct 9, 2020

jeffdaily commented Oct 12, 2020

malfet left a comment

Choose a reason for hiding this comment

malfet Oct 12, 2020

Choose a reason for hiding this comment

malfet Oct 12, 2020

Choose a reason for hiding this comment

malfet Oct 12, 2020

Choose a reason for hiding this comment

ashishfarmer Oct 12, 2020

Choose a reason for hiding this comment

jeffdaily commented Oct 12, 2020

dr-ci bot commented Oct 12, 2020 • edited

💊 CI failures summary and remediations

XLA failure

malfet Oct 13, 2020

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 14, 2020

jeffdaily commented Oct 9, 2020 •

edited

lly-zero-one commented Oct 9, 2020 •

edited

codecov bot commented Oct 9, 2020 •

edited

dr-ci bot commented Oct 12, 2020 •

edited