Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] revert cat operator performance work-around #74129

Closed
wants to merge 1 commit into from

Conversation

jeffdaily
Copy link
Collaborator

revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed.

python -m pt.cat_test --tag_filter all --device cuda

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch ciflow/default labels Mar 11, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/ROCmSoftwarePlatform/pytorch/blob/4bbd81b0c7337416574a18c4216537aaf399ed50/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 11, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 4bbd81b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@jeffdaily jeffdaily marked this pull request as ready for review March 14, 2022 17:23
@samdow samdow requested a review from ngimel March 14, 2022 19:30
@samdow samdow added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 14, 2022
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Mar 21, 2022

@pytorchbot merge this please

@github-actions
Copy link

Hey @jeffdaily.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@jeffdaily jeffdaily added release notes: rocm mandatorylabel topic: performance topic category labels Mar 21, 2022
facebook-github-bot pushed a commit that referenced this pull request Mar 22, 2022
Summary:
revert d5ca53c (#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: #74129
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/14bf20cd922c3ba33c32343c19fd9ac490d4f7a6

Reviewed By: anjali411

Differential Revision: D34990460

fbshipit-source-id: 2bb09b9f60342f7bd23e856d4861d513dd3d104f
shahofblah pushed a commit that referenced this pull request Mar 25, 2022
revert d5ca53c (#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: #74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 7, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 8, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jithunnair-amd pushed a commit to jithunnair-amd/pytorch that referenced this pull request Sep 20, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Sep 28, 2022
revert d5ca53c (pytorch#46097).  The changes only affect ROCm.  Reverts a work-around for a compiler performance issue that is no longer needed.

`python -m pt.cat_test --tag_filter all --device cuda`

```

OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: rocm AMD GPU support for Pytorch open source release notes: rocm mandatorylabel topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants