Skip to content

Conversation

@mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Apr 20, 2023

Stack from ghstack (oldest at bottom):

Fix #96738

This patch add channels last support for ReflectionPad2d/3d on CPU backend. The following test cases will pass with this patch:

python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32

This patch improves padding performance on CPU, which:

  • original kernel has nested paralleled loops, e.g. first on dim of batch and then on dim of channels, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
  • original kernel is scalar logic. This patch did vectorization on dim of width on NCHW, did vectorization on channels on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

single core inference

(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms

single socket inference

(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms

Notes:

  • when C < vector length: on NCHW, the vectorization is done on width on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
  • when C >= vector length: NCHW and NHWC perf are similar.

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99608

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 1eac7e6:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 20, 2023
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```


cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Apr 20, 2023
@mingfeima mingfeima added release notes: memory format release notes category topic: not user facing topic category labels Apr 20, 2023
@mingfeima mingfeima marked this pull request as draft April 20, 2023 07:38
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Apr 21, 2023
…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Apr 23, 2023
@mingfeima mingfeima marked this pull request as ready for review April 23, 2023 05:17
@mingfeima mingfeima requested review from albanD, cpuhrsch and malfet April 23, 2023 07:49
@mingfeima
Copy link
Collaborator Author

@cpuhrsch could you please help review this one ?

@cpuhrsch
Copy link
Contributor

@mingfeima - This diff is too large (+1,020 −1,714). Can this be split up? Are there some code moves in here that could be split out into separate PRs in this stack?

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this large set of changes (+1,020 −1,714) into a stack of smaller PRs

@mingfeima
Copy link
Collaborator Author

@mingfeima - This diff is too large (+1,020 −1,714). Can this be split up? Are there some code moves in here that could be split out into separate PRs in this stack?

sure, just coming back from holiday, will split this one into smaller PRs

…nPad on CPU"


Fix #96738

This patch add channels last support for ReflectionPad2d/3d and ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;  ; NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;  ; NHWC: 82.935 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;  ; NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;  ; NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;  ; NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;  ; NHWC: 7.203 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;  ; NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;  ; NHWC: 3.174 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label May 6, 2023
@mingfeima mingfeima requested a review from cpuhrsch May 6, 2023 04:06
@mingfeima mingfeima changed the title add channels last support for ReflectionPad and ReplicationPad on CPU add channels last support for ReflectionPad on CPU May 6, 2023
@mingfeima
Copy link
Collaborator Author

@cpuhrsch could you please help review again ? I have separated the original PR into 2 PRs, one for ReflectionPad and one for ReplicationPad.

auto pad_r = padding[1];

// allow empty batch size but not other dimensions.
at::native::padding::check_valid_input<1>(input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the updates @mingfeima ! I think it looks much better. I do have to ask for another small refactor so I can more easily accept and defend this change.

While I agree that this unifying function is useful, I have to worry that we're maintaining the same error messages and error behavior across both code paths.

Can you split the refactor of these input sanitization functions into PRs below this stack?

I want to make sure we separate the task of "refactor existing error checking code for both CPU and CUDA" from "add an optimized kernel for a new feature" and from "add a new feature (channels last)". I think this PR currently does those three things at once.

Also the error checking functions should be within this file and not in the cpu subfolder since they're not cpu specific and we don't need to compile them multiple times for various instruction sets.

Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure will have it done :)

Fix #96738

This patch add channels last support for ReflectionPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

This patch improves padding performance on CPU, which:

* original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
* original kernel is scalar logic. This patch did vectorization on dim of **width** on NCHW, did vectorization on **channels** on NHWC.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;  ; NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;  ; NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;  ; NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;  ; NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;  ; NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;  ; NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;  ; NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;  ; NHWC: 3.181 ms
```

Notes:
* when C < vector length: on NCHW, the vectorization is done on **width** on NCHW when the output index is overlapped with the input index; on NHWC, it is scalar logic, so it will be slower than NCHW.
* when C >= vector length: NCHW and NHWC perf are similar.



cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request May 25, 2023
replacement of #99608, breaking the old pr into smaller ones.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request May 30, 2023
replacement of #99608, breaking the old pr into smaller ones.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request May 30, 2023
replacement of #99608, breaking the old pr into smaller ones.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jun 6, 2023
replacement of #99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jun 6, 2023
replacement of #99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jun 7, 2023
replacement of #99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jun 7, 2023
replacement of #99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 7, 2023
replacement of #99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

Pull Request resolved: #102253
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jul 10, 2023
@github-actions github-actions bot closed this Aug 9, 2023
@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/115/head branch September 8, 2023 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: memory format release notes category Stale topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants