Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fix for torch.cat operator on ROCm #46097

Closed
wants to merge 4 commits into from

Conversation

ashishfarmer
Copy link

Summary:
This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define __HIP_PLATFORM_HCC__

Test plan:
Benchmark
python -m pt.cat_test --tag_filter all --device cuda

Results on ROCm before the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221

Results on ROCm after the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174

@jeffdaily jeffdaily added the module: rocm AMD GPU support for Pytorch label Oct 9, 2020
@ashishfarmer ashishfarmer marked this pull request as ready for review October 9, 2020 17:24
@jeffdaily
Copy link
Collaborator

Requested review from @lly-zero-one as code owner.

@jeffdaily
Copy link
Collaborator

jeffdaily commented Oct 9, 2020

CC @ezyang @xw285cornell @malfet @sunway513 for awareness. We plan to submit this PR as a cherry-pick against release/1.7 once this passes CI and is merged to master. Refer to the performance comparison as justification.

@lly-zero-one
Copy link
Contributor

lly-zero-one commented Oct 9, 2020

@jeffdaily Thanks for the fix. I want to learn more why #44833 brings regression for HIP case. Could you help to elaborate that? Basically, that PR got rid of the H2D by using the constant memory instead.

@jeffdaily
Copy link
Collaborator

@lly-zero-one our compiler does not yet handle this use case efficiently, passing aggregate structs by value as kernel arguments. This PR restores the old behavior of instead passing a pointer to the struct. Our compiler team is working on a fix for this.

@malfet
Copy link
Contributor

malfet commented Oct 9, 2020

Since the implementations seems sufficiently diverged now, perhaps a HIP specific implementation can be checked into shapes.hip?

@codecov
Copy link

codecov bot commented Oct 9, 2020

Codecov Report

Merging #46097 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #46097   +/-   ##
=======================================
  Coverage   68.17%   68.17%           
=======================================
  Files         410      410           
  Lines       53422    53422           
=======================================
+ Hits        36422    36423    +1     
+ Misses      17000    16999    -1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 78.57% <0.00%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 362d9a9...3128c39. Read the comment docs.

@jeffdaily
Copy link
Collaborator

Certainly not ideal, but there is some precedent established with diverging CUDA/ROCm implementations in the Loops.h file for elementwise operators, or the cublas vs rocblas differences. We make every effort to keep these differences to a minimum. We intend to revert this change once our compiler improves, but we do not have an estimate at this time.

We're adding ~160 lines to this file; it's not a complete duplication since the file was originally ~400 lines, and we're only providing ROCm-specific implementations of CatArrayBatchedCopy and parallel_cat, not the handful of remaining functions. The CUDA path is unchanged and is free to continue changing in the future, as needed. The #ifdef is much easier than adding new sources files. We're hoping the benefit of undoing the massive ROCm performance regression is worth the price of this short-term divergence.

@agolynski agolynski added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 12, 2020
@jeffdaily
Copy link
Collaborator

@malfet any response to my comment? Again, I want to stress that we will revert this workaround as soon as it becomes possible.

@malfet malfet requested a review from ngimel October 12, 2020 19:22
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily having two different implementations of the functions with the same name makes reading this file quite confusing to the developer.
On the other hand, since uninstantiated template is essentially a no-op, so please consider adding HIP_ prefixes to a RocM specific implementation of templates and the have just a single small #ifdef in cat_out_cuda to call hip_parallel_cat if compiling with ROCM


template <typename T, typename IndexType, int Dims>
C10_LAUNCH_BOUNDS_1(512)
__global__ void CatArrayBatchedCopy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__global__ void CatArrayBatchedCopy(
__global__ void HIP_CatArrayBatchedCopy(

@@ -141,6 +186,124 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
}
}

#ifdef __HIP_PLATFORM_HCC__
template <typename scalar_t>
void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

}
// Template Declarations for dim = 1, 2, 3, 4
#define HANDLE_CASE(DIMS) \
CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\
HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the branch with suggested changes

@jeffdaily
Copy link
Collaborator

@ashishfarmer please update based on comments from @malfet. Thank you.

@dr-ci
Copy link

dr-ci bot commented Oct 12, 2020

💊 CI failures summary and remediations

As of commit c29834d (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

dim3 catGrid;
getCatGrid(batchCounter, catGrid);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in d5ca53c.

ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Oct 14, 2020
Summary:
This pull request is a partial revert of pytorch#44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__`

Pull Request resolved: pytorch#46097

Test Plan:
Benchmark
`python -m pt.cat_test --tag_filter all --device cuda`

Results on ROCm before the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221
```

Results on ROCm after the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174
```

Reviewed By: bdhirsh

Differential Revision: D24286324

Pulled By: malfet

fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2
malfet pushed a commit that referenced this pull request Oct 14, 2020
Summary:
This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__`

Pull Request resolved: #46097

Test Plan:
Benchmark
`python -m pt.cat_test --tag_filter all --device cuda`

Results on ROCm before the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221
```

Results on ROCm after the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174
```

Reviewed By: bdhirsh

Differential Revision: D24286324

Pulled By: malfet

fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Mar 11, 2022
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Mar 14, 2022
pytorchmergebot pushed a commit that referenced this pull request Mar 21, 2022
revert d5ca53c (#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: #74129
Approved by: https://github.com/ngimel
facebook-github-bot pushed a commit that referenced this pull request Mar 22, 2022
Summary:
revert d5ca53c (#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: #74129
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/14bf20cd922c3ba33c32343c19fd9ac490d4f7a6

Reviewed By: anjali411

Differential Revision: D34990460

fbshipit-source-id: 2bb09b9f60342f7bd23e856d4861d513dd3d104f
shahofblah pushed a commit that referenced this pull request Mar 25, 2022
revert d5ca53c (#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: #74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jithunnair-amd pushed a commit to jithunnair-amd/pytorch that referenced this pull request Sep 20, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Sep 28, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants