improve multi-core performance of qupsample_nearest2d #69600

mingfeima · 2021-12-08T07:57:58Z

Stack from ghstack:

Differential Revision: D33353153

[ghstack-poisoned]

pytorch-probot · 2021-12-08T07:58:01Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/ce4cc729bd519a0a91b4b0bc25e18647196d424a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries/conda`	🚫 skipped
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries/wheel`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-py3.6-clang9	`ciflow/xla`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-12-08T07:58:04Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/69600
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 00d8214 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]

VitalyFedyunin · 2021-12-29T19:41:30Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Xia-Weiwen · 2022-01-27T03:29:12Z

Hi @jerryzh168 We have a few int8-related PRs targeting 1.11. And this is one of them. Could you prioritize to review these PRs? Thanks.

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

VitalyFedyunin · 2022-02-09T19:06:54Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

mingfeima · 2022-02-16T07:40:59Z

Updates of this PR

Overall ideas of #69600 and #69601 are the same: improves qupsample_nearest2d and qupsample_bilinear2d performance on CPU.

Both torch.contiguous and torch.channels_last memory format are improved.
Both single core and multi core performance are improved.

The original kernel on nchw layout is sequential, loops on an order of {H, W, NC} which lead to non-contiguous memory access on {NC} (usually we need to make sure the most inner loop is contiguous). But the benefit is that each index on the output feature map plane only calculates its corresponding input feature map window only once.

If we simply loops on the order of {NC, H, W}, the input index calculate would be duplicated by NC times, but the memory access is contiguous. This PR did a tradeoff: pre-calculate input index on {W} dimension so that the index calculation would not be overwhelming, parallel is done on {NC, H}, for normal width the pre-calculated index should be L1 reside.

On nhwc layout, the original kernel is sequential, this PR parallels on {N, H, W} and vectorization is done on {C}.

Benchmarking

short tag: python -m pt.qinterpolate_test
long tag: python -m pt.qinterpolate_test --tag_filter long

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual sockets, 20 cores per socket.
Unit: us

1.a) single core run on short tag (nchw)

The first few cases in the short tag is too small (non-contiguous access issue is alleviated)

name	before	after	speedup
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale0.5_contigTrue	4.299	4.653	92.39%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale0.5_contigTrue	6.678	7.013	95.22%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale2.0_contigTrue	4.107	4.448	92.33%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale2.0_contigTrue	6.56	7.168	91.52%
q_interpolate_M3_N720_K1280_dtypetorch.quint8_modebilinear_scale0.83333_contigTrue	235.253	212.63	110.64%

1.b) single socket run on short tag (nchw)

Again, the first few cases in the short tag is too small...

name	before	after	speedup
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale0.5_contigTrue	4.511	4.881	92.42%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale0.5_contigTrue	7.871	8.038	97.92%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modenearest_scale2.0_contigTrue	4.511	4.66	96.80%
q_interpolate_M32_N32_K32_dtypetorch.quint8_modebilinear_scale2.0_contigTrue	7.903	8.246	95.84%
q_interpolate_M3_N720_K1280_dtypetorch.quint8_modebilinear_scale0.83333_contigTrue	235.286	32.511	723.71%

2.a) single core run on long tag (nchw)

the single core performance improves since the non-contiguous access is fixed.

name	before	after	speedup
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigTrue	13863.09	4964.241	279.26%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigTrue	13634.78	5306.871	256.93%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigTrue	13831.59	4882.204	283.31%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigTrue	49545.01	12921.88	383.42%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigTrue	50306.27	13178.12	381.74%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigTrue	51555.32	13009.45	396.29%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigTrue	13874.62	4883.29	284.12%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigTrue	13602.07	5162.799	263.46%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigTrue	13787.98	4944.184	278.87%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigTrue	53256.83	14089.31	377.99%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigTrue	50800.25	13363.12	380.15%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigTrue	52862.44	14174.29	372.95%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigTrue	17129.43	6178.339	277.25%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigTrue	17361.16	6207.839	279.67%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigTrue	17096.63	6541.413	261.36%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigTrue	52785	20156.4	261.88%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigTrue	53095.87	20327.97	261.20%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigTrue	52570.46	21168.51	248.34%

2.b) single socket run on long tag (nchw)

name	before	after	speedup
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigTrue	13300.31	343.232	3875.02%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigTrue	13487.85	337.747	3993.48%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigTrue	13305.43	335.961	3960.41%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigTrue	47771.67	912.114	5237.47%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigTrue	47581.28	931.016	5110.68%
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigTrue	48452.82	921.116	5260.23%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigTrue	13286.1	346.403	3835.45%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigTrue	13497.68	336.491	4011.30%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigTrue	13269.88	341.033	3891.08%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigTrue	49827.36	939.765	5302.11%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigTrue	50114.48	952.194	5263.05%
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigTrue	49826.7	925.876	5381.57%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigTrue	17416.73	742.857	2344.56%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigTrue	16776.71	894.344	1875.87%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigTrue	17460.01	770.085	2267.28%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigTrue	52246.15	1594.428	3276.80%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigTrue	52580.77	1480.045	3552.65%
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigTrue	52396.78	1480.32	3539.56%

3.a) single core run (nchw v.s. nhwc)

qupsample favors nhwc over nchw since it can be vectorized on nhwc.

name	nchw	nhwc
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigFalse	4964.241	140.476
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigFalse	5306.871	139.838
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigFalse	4882.204	136.656
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigFalse	12921.88	3570.737
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigFalse	13178.12	3576.925
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigFalse	13009.45	3571.357
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigFalse	4883.29	133.408
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigFalse	5162.799	132.818
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigFalse	4944.184	139.25
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigFalse	14089.31	3561.07
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigFalse	13363.12	3576.055
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigFalse	14174.29	3558.409
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigFalse	6178.339	512.442
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigFalse	6207.839	514.772
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigFalse	6541.413	484.328
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigFalse	20156.4	9581.45
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigFalse	20327.97	9357.744
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigFalse	21168.51	9769.816

3.b) single socket run (nchw v.s. nhwc)

name	nchw	nhwc
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale0.5_contigFalse	343.232	11.9
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale1.0_contigFalse	337.747	11.592
q_interpolate_M512_N512_K512_dtypetorch.quint8_modenearest_scale2.0_contigFalse	335.961	11.786
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale0.5_contigFalse	912.114	246.446
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale1.0_contigFalse	931.016	246.701
q_interpolate_M512_N512_K512_dtypetorch.quint8_modebilinear_scale2.0_contigFalse	921.116	246.774
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale0.5_contigFalse	346.403	11.721
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale1.0_contigFalse	336.491	11.786
q_interpolate_M512_N512_K512_dtypetorch.qint8_modenearest_scale2.0_contigFalse	341.033	11.684
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale0.5_contigFalse	939.765	247.561
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale1.0_contigFalse	952.194	247.61
q_interpolate_M512_N512_K512_dtypetorch.qint8_modebilinear_scale2.0_contigFalse	925.876	247.533
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale0.5_contigFalse	742.857	22.685
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale1.0_contigFalse	894.344	22.114
q_interpolate_M512_N512_K512_dtypetorch.qint32_modenearest_scale2.0_contigFalse	770.085	22.411
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale0.5_contigFalse	1594.428	673.123
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale1.0_contigFalse	1480.045	679.605
q_interpolate_M512_N512_K512_dtypetorch.qint32_modebilinear_scale2.0_contigFalse	1480.32	665.11

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

dzdang · 2022-04-07T20:31:31Z

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

dzdang · 2022-04-14T03:08:48Z

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-04-14T15:13:47Z

@pytorchbot merge this

(Initiating merge automatically since Phabricator Diff has merged)

Summary: Pull Request resolved: #69600 Differential Revision: D33353153 D33353153 Test Plan: Imported from OSS Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 15cb72d043b371f251dc3f2e03e6cb0243c6922c

improve multi-core performance of qupsample_nearest2d

0035687

[ghstack-poisoned]

pytorch-probot bot added the ciflow/default label Dec 8, 2021

facebook-github-bot added the cla signed label Dec 8, 2021

pytorchbot added the open source label Dec 8, 2021

mingfeima mentioned this pull request Dec 9, 2021

improve qcat_nhwc performance on both multi-core and single-core #69667

Closed

mingfeima added the intel priority matters to intel architecture from performance wise label Dec 16, 2021

Update on "improve multi-core performance of qupsample_nearest2d"

a56de21

[ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

7e8f6ad

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

jerryzh168 requested a review from dzdang January 13, 2022 20:19

mingfeima added 2 commits January 19, 2022 10:15

Update on "improve multi-core performance of qupsample_nearest2d"

d346aec

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

ce4cc72

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

46a77fa

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

mingfeima mentioned this pull request Jan 27, 2022

qcat: use direct memcpy when all the inputs and output share the same scale and zero_point #71903

Closed

mingfeima added 5 commits January 28, 2022 12:25

Update on "improve multi-core performance of qupsample_nearest2d"

8a1a498

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

c79a0ad

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

adb6f75

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

a893bb2

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

c283204

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

mingfeima mentioned this pull request Feb 9, 2022

allow contiguous inputs run into qcat_nhwc_stub when dim is last dimension #72575

Closed

Update on "improve multi-core performance of qupsample_nearest2d"

9cd685c

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

459b9ac

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

3ff27e1

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

mingfeima added release notes: quantization release notes category topic: performance topic category labels Feb 21, 2022

Update on "improve multi-core performance of qupsample_nearest2d"

dc2cdea

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

suo removed the ciflow/default label Mar 22, 2022

dzdang approved these changes Mar 25, 2022

View reviewed changes

Update on "improve multi-core performance of qupsample_nearest2d"

69ed760

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

mingfeima mentioned this pull request Mar 31, 2022

add native kernel for quantized_channel_shuffle on CPU #74999

Closed

mingfeima added 2 commits April 12, 2022 09:37

Update on "improve multi-core performance of qupsample_nearest2d"

db857a4

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

Update on "improve multi-core performance of qupsample_nearest2d"

00d8214

Differential Revision: [D33353153](https://our.internmc.facebook.com/intern/diff/D33353153) [ghstack-poisoned]

pytorchmergebot closed this in 36d622f Apr 14, 2022

facebook-github-bot deleted the gh/mingfeima/54/head branch April 18, 2022 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve multi-core performance of qupsample_nearest2d #69600

improve multi-core performance of qupsample_nearest2d #69600

Uh oh!

mingfeima commented Dec 8, 2021 •

edited

Loading

Uh oh!

pytorch-probot bot commented Dec 8, 2021 •

edited

Loading

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Dec 8, 2021 •

edited

Loading

Uh oh!

VitalyFedyunin commented Dec 29, 2021

Uh oh!

Xia-Weiwen commented Jan 27, 2022

Uh oh!

VitalyFedyunin commented Feb 9, 2022

Uh oh!

mingfeima commented Feb 16, 2022 •

edited

Loading

Uh oh!

dzdang commented Apr 7, 2022

Uh oh!

dzdang commented Apr 14, 2022

Uh oh!

facebook-github-bot commented Apr 14, 2022

Uh oh!

Uh oh!

improve multi-core performance of qupsample_nearest2d #69600

improve multi-core performance of qupsample_nearest2d #69600

Uh oh!

Conversation

mingfeima commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

VitalyFedyunin commented Dec 29, 2021

Uh oh!

Xia-Weiwen commented Jan 27, 2022

Uh oh!

VitalyFedyunin commented Feb 9, 2022

Uh oh!

mingfeima commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updates of this PR

Benchmarking

1.a) single core run on short tag (nchw)

1.b) single socket run on short tag (nchw)

2.a) single core run on long tag (nchw)

2.b) single socket run on long tag (nchw)

3.a) single core run (nchw v.s. nhwc)

3.b) single socket run (nchw v.s. nhwc)

Uh oh!

dzdang commented Apr 7, 2022

Uh oh!

dzdang commented Apr 14, 2022

Uh oh!

facebook-github-bot commented Apr 14, 2022

Uh oh!

Uh oh!

mingfeima commented Dec 8, 2021 •

edited

Loading

pytorch-probot bot commented Dec 8, 2021 •

edited

Loading

facebook-github-bot commented Dec 8, 2021 •

edited

Loading

mingfeima commented Feb 16, 2022 •

edited

Loading