New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

optimize sum kernel when input is Half on CPU #96082

Closed

mingfeima wants to merge 12 commits into gh/mingfeima/110/base from gh/mingfeima/110/head

Collaborator

mingfeima commented Mar 6, 2023 •

edited

Stack from ghstack (oldest at bottom):

Originally sum has a specialized path for BFloat16, this patch expands that path to Half as well. Without this path, Half will also be accumulated in Float, but in a slower non-vectorized path in load_reduce_vec.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously Half is much slower than BFloat16, now they are of the same level.

Performance before

### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms

Performance after

### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10


          use float32 as acc type in sum kernel when input is Half on CPU

d63babf

[ghstack-poisoned]

pytorch-bot bot commented Mar 6, 2023 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96082

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 459accf:

NEW FAILURES - The following jobs have failed:

macos-12-py3-arm64-mps / test (default, 1, 1) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This was referenced Mar 6, 2023

add explicit vectorization for Half dtype on CPU #96076

Closed

add Half support for sigmoid on CPU #96077

Closed

add Half to cat fast path on CPU #96078

Closed

use float as accumulate type for reduce Ops: min, max, minmax on CPU #96079

Closed

add Half support for AvgPool2d (both channels last and channels first) on CPU #96080

Closed

github-actions bot added the module: cpu label

mingfeima mentioned this pull request

remove vec_scalar_t, it is a duplicate of Opmath_type #96081

Closed

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

7eeb218

ghstack-source-id: f79bf6a5102134696cee055f552f7f21a0c8e000
Pull Request resolved: #96082

mingfeima marked this pull request as draft

March 6, 2023 05:56

pytorchbot added the open source label


          Update on "use float32 as acc type in sum kernel when input is Half o…

30e55f6

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

7b3a33a

ghstack-source-id: 7c0da5bf206e69998ebd873c651eadd46820d4ef
Pull Request resolved: #96082


          Update on "use float32 as acc type in sum kernel when input is Half o…

ca024e1

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

30acba5

ghstack-source-id: 4f2cc34c6ed8d0d6235153806c294e28dd9651a9
Pull Request resolved: #96082


          Update on "use float32 as acc type in sum kernel when input is Half o…

4c1af23

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

ea83739

ghstack-source-id: 0b8f2289bc9523c59c4be4610fba27cd7f6395a4
Pull Request resolved: #96082


          Update on "use float32 as acc type in sum kernel when input is Half o…

3d45451

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

75ac089

ghstack-source-id: b92d4717f985c42dcbf22324cf3adc903bd22903
Pull Request resolved: #96082


          Update on "use float32 as acc type in sum kernel when input is Half o…

d75ed9c

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

e529747

ghstack-source-id: 80add9030e40766d1ac275c0b1b1279bcc724979
Pull Request resolved: #96082


          Update on "use float32 as acc type in sum kernel when input is Half o…

ef4507d

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

0f8c6a1

ghstack-source-id: 2c0e2e05d1eb4260405baf07704445097a4889fd
Pull Request resolved: #96082

mingfeima marked this pull request as ready for review

March 14, 2023 02:28

mingfeima added release notes: intel module: half labels

mingfeima changed the title ~~use float32 as acc type in sum kernel when input is Half on CPU~~ optimize sum kernel when input is Half on CPU

This was referenced Mar 15, 2023

torch.mean(cpu) is implemented as torch.sum(x)/x.numel() which overflows on float16 #91597

Closed

[RFC] CPU float16 performance optimization on eager mode. #97068

Open


          Update on "optimize sum kernel when input is Half on CPU"

82601ed

Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

96ead5d

ghstack-source-id: 675c1247125e7766359383ad388e5080ecaf188b
Pull Request resolved: #96082


          Update on "optimize sum kernel when input is Half on CPU"

a4626ce

Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

19c99d9

ghstack-source-id: d56a9882c2b8eaa3de0e744a03705cd9fdd96892
Pull Request resolved: #96082


          Update on "optimize sum kernel when input is Half on CPU"

2d4badf

Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

d4df110

ghstack-source-id: a138df09b5dde375494649de6f0a9f8218c78b8e
Pull Request resolved: #96082


          Update on "optimize sum kernel when input is Half on CPU"

2d73aec

Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

304b732

ghstack-source-id: caef839b4a0bb31e18d64a5ee54d0fa7b3c73b47
Pull Request resolved: #96082


          Update on "optimize sum kernel when input is Half on CPU"

459accf

Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

mingfeima added a commit that referenced this pull request


          use float32 as acc type in sum kernel when input is Half on CPU

9b922ed

ghstack-source-id: 979fa19f4325488584684496be2e54c8a9f13613
Pull Request resolved: #96082

mingfeima added the ciflow/trunk label

github-actions bot commented Jun 9, 2023

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions bot added the Stale label

github-actions bot closed this

facebook-github-bot deleted the gh/mingfeima/110/head branch

August 8, 2023 14:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment