Skip to content

Conversation

@pytorch-probot
Copy link

pytorch-probot bot commented Dec 8, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/ce4cc729bd519a0a91b4b0bc25e18647196d424a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-binary-conda ciflow/binaries, ciflow/binaries/conda 🚫 skipped
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-manywheel ciflow/binaries, ciflow/binaries/wheel 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-py3.6-clang9 ciflow/xla 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 8, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 00d8214 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jerryzh168 jerryzh168 requested a review from dzdang January 13, 2022 20:19
@Xia-Weiwen
Copy link
Collaborator

Hi @jerryzh168 We have a few int8-related PRs targeting 1.11. And this is one of them. Could you prioritize to review these PRs? Thanks.

@VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mingfeima
Copy link
Collaborator Author

mingfeima commented Feb 16, 2022

Updates of this PR

Overall ideas of #69600 and #69601 are the same: improves qupsample_nearest2d and qupsample_bilinear2d performance on CPU.

Both torch.contiguous and torch.channels_last memory format are improved.
Both single core and multi core performance are improved.

The original kernel on nchw layout is sequential, loops on an order of {H, W, NC} which lead to non-contiguous memory access on {NC} (usually we need to make sure the most inner loop is contiguous). But the benefit is that each index on the output feature map plane only calculates its corresponding input feature map window only once.

If we simply loops on the order of {NC, H, W}, the input index calculate would be duplicated by NC times, but the memory access is contiguous. This PR did a tradeoff: pre-calculate input index on {W} dimension so that the index calculation would not be overwhelming, parallel is done on {NC, H}, for normal width the pre-calculated index should be L1 reside.

On nhwc layout, the original kernel is sequential, this PR parallels on {N, H, W} and vectorization is done on {C}.

Benchmarking

  • short tag: python -m pt.qinterpolate_test
  • long tag: python -m pt.qinterpolate_test --tag_filter long

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual sockets, 20 cores per socket.
Unit: us

1.a) single core run on short tag (nchw)

The first few cases in the short tag is too small (non-contiguous access issue is alleviated)

name before after speedup
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale0.5_contigTrue 4.299 4.653 92.39%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale0.5_contigTrue 6.678 7.013 95.22%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale2.0_contigTrue 4.107 4.448 92.33%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale2.0_contigTrue 6.56 7.168 91.52%
q_interpolate_M3_N720_K1280_dtypetorch.quint8_modebilinear_scale0.83333_contigTrue 235.253 212.63 110.64%

1.b) single socket run on short tag (nchw)

Again, the first few cases in the short tag is too small...

name before after speedup
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale0.5_contigTrue 4.511 4.881 92.42%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale0.5_contigTrue 7.871 8.038 97.92%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale2.0_contigTrue 4.511 4.66 96.80%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale2.0_contigTrue 7.903 8.246 95.84%
q_interpolate_M3_N720_K1280_dtypetorch.quint8_modebilinear_scale0.83333_contigTrue 235.286 32.511 723.71%

2.a) single core run on long tag (nchw)

the single core performance improves since the non-contiguous access is fixed.

name before after speedup
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigTrue 13863.09 4964.241 279.26%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigTrue 13634.78 5306.871 256.93%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigTrue 13831.59 4882.204 283.31%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigTrue 49545.01 12921.88 383.42%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigTrue 50306.27 13178.12 381.74%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigTrue 51555.32 13009.45 396.29%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigTrue 13874.62 4883.29 284.12%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigTrue 13602.07 5162.799 263.46%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigTrue 13787.98 4944.184 278.87%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigTrue 53256.83 14089.31 377.99%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigTrue 50800.25 13363.12 380.15%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigTrue 52862.44 14174.29 372.95%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigTrue 17129.43 6178.339 277.25%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigTrue 17361.16 6207.839 279.67%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigTrue 17096.63 6541.413 261.36%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigTrue 52785 20156.4 261.88%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigTrue 53095.87 20327.97 261.20%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigTrue 52570.46 21168.51 248.34%

2.b) single socket run on long tag (nchw)

name before after speedup
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigTrue 13300.31 343.232 3875.02%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigTrue 13487.85 337.747 3993.48%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigTrue 13305.43 335.961 3960.41%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigTrue 47771.67 912.114 5237.47%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigTrue 47581.28 931.016 5110.68%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigTrue 48452.82 921.116 5260.23%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigTrue 13286.1 346.403 3835.45%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigTrue 13497.68 336.491 4011.30%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigTrue 13269.88 341.033 3891.08%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigTrue 49827.36 939.765 5302.11%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigTrue 50114.48 952.194 5263.05%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigTrue 49826.7 925.876 5381.57%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigTrue 17416.73 742.857 2344.56%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigTrue 16776.71 894.344 1875.87%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigTrue 17460.01 770.085 2267.28%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigTrue 52246.15 1594.428 3276.80%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigTrue 52580.77 1480.045 3552.65%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigTrue 52396.78 1480.32 3539.56%

3.a) single core run (nchw v.s. nhwc)

qupsample favors nhwc over nchw since it can be vectorized on nhwc.

name nchw nhwc
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigFalse 4964.241 140.476
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigFalse 5306.871 139.838
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigFalse 4882.204 136.656
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigFalse 12921.88 3570.737
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigFalse 13178.12 3576.925
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigFalse 13009.45 3571.357
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigFalse 4883.29 133.408
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigFalse 5162.799 132.818
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigFalse 4944.184 139.25
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigFalse 14089.31 3561.07
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigFalse 13363.12 3576.055
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigFalse 14174.29 3558.409
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigFalse 6178.339 512.442
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigFalse 6207.839 514.772
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigFalse 6541.413 484.328
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigFalse 20156.4 9581.45
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigFalse 20327.97 9357.744
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigFalse 21168.51 9769.816

3.b) single socket run (nchw v.s. nhwc)

name nchw nhwc
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigFalse 343.232 11.9
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigFalse 337.747 11.592
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigFalse 335.961 11.786
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigFalse 912.114 246.446
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigFalse 931.016 246.701
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigFalse 921.116 246.774
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigFalse 346.403 11.721
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigFalse 336.491 11.786
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigFalse 341.033 11.684
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigFalse 939.765 247.561
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigFalse 952.194 247.61
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigFalse 925.876 247.533
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigFalse 742.857 22.685
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigFalse 894.344 22.114
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigFalse 770.085 22.411
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigFalse 1594.428 673.123
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigFalse 1480.045 679.605
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigFalse 1480.32 665.11

@mingfeima mingfeima added release notes: quantization release notes category topic: performance topic category labels Feb 21, 2022
@dzdang
Copy link
Contributor

dzdang commented Apr 7, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dzdang
Copy link
Contributor

dzdang commented Apr 14, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge this

(Initiating merge automatically since Phabricator Diff has merged)

facebook-github-bot pushed a commit that referenced this pull request Apr 14, 2022
Summary: Pull Request resolved: #69600

Differential Revision:
D33353153
D33353153

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Pulled By: dzdang

fbshipit-source-id: 15cb72d043b371f251dc3f2e03e6cb0243c6922c
@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/54/head branch April 18, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed intel priority matters to intel architecture from performance wise open source release notes: quantization release notes category topic: performance topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants