-
Notifications
You must be signed in to change notification settings - Fork 25.7k
add channels last support for ReflectionPad on CPU #99608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99608
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 1eac7e6: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
@cpuhrsch could you please help review this one ? |
|
@mingfeima - This diff is too large (+1,020 −1,714). Can this be split up? Are there some code moves in here that could be split out into separate PRs in this stack? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this large set of changes (+1,020 −1,714) into a stack of smaller PRs
sure, just coming back from holiday, will split this one into smaller PRs |
…nPad on CPU" Fix #96738 This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ; NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; ; NHWC: 82.935 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ; NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ; NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ; NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; ; NHWC: 7.203 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ; NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ; NHWC: 3.174 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
@cpuhrsch could you please help review again ? I have separated the original PR into 2 PRs, one for ReflectionPad and one for ReplicationPad. |
| auto pad_r = padding[1]; | ||
|
|
||
| // allow empty batch size but not other dimensions. | ||
| at::native::padding::check_valid_input<1>(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the updates @mingfeima ! I think it looks much better. I do have to ask for another small refactor so I can more easily accept and defend this change.
While I agree that this unifying function is useful, I have to worry that we're maintaining the same error messages and error behavior across both code paths.
Can you split the refactor of these input sanitization functions into PRs below this stack?
I want to make sure we separate the task of "refactor existing error checking code for both CPU and CUDA" from "add an optimized kernel for a new feature" and from "add a new feature (channels last)". I think this PR currently does those three things at once.
Also the error checking functions should be within this file and not in the cpu subfolder since they're not cpu specific and we don't need to compile them multiple times for various instruction sets.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure will have it done :)
Fix #96738 This patch add channels last support for ReflectionPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` This patch improves padding performance on CPU, which: * original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. * original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ; NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; ; NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ; NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ; NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ; NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; ; NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ; NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ; NHWC: 3.181 ms ``` Notes: * when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW. * when C >= vector length: NCHW and NHWC perf are similar. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
replacement of #99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. Pull Request resolved: #102253 Approved by: https://github.com/cpuhrsch, https://github.com/albanD
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Fix #96738
This patch add channels last support for ReflectionPad2d/3d on CPU backend. The following test cases will pass with this patch:
This patch improves padding performance on CPU, which:
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
single core inference
single socket inference
Notes:
cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10