Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize sum kernel when input is Half on CPU #96082

Closed
wants to merge 12 commits into from

Conversation

mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Mar 6, 2023

Stack from ghstack (oldest at bottom):

Originally sum has a specialized path for BFloat16, this patch expands that path to Half as well. Without this path, Half will also be accumulated in Float, but in a slower non-vectorized path in load_reduce_vec.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously Half is much slower than BFloat16, now they are of the same level.

Performance before

### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms

Performance after

### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96082

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 459accf:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 7, 2023
ghstack-source-id: 7c0da5bf206e69998ebd873c651eadd46820d4ef
Pull Request resolved: #96082
…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 9, 2023
ghstack-source-id: 4f2cc34c6ed8d0d6235153806c294e28dd9651a9
Pull Request resolved: #96082
…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 9, 2023
ghstack-source-id: 0b8f2289bc9523c59c4be4610fba27cd7f6395a4
Pull Request resolved: #96082
…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 10, 2023
ghstack-source-id: b92d4717f985c42dcbf22324cf3adc903bd22903
Pull Request resolved: #96082
…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 10, 2023
ghstack-source-id: 80add9030e40766d1ac275c0b1b1279bcc724979
Pull Request resolved: #96082
…n CPU"

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 14, 2023
ghstack-source-id: 2c0e2e05d1eb4260405baf07704445097a4889fd
Pull Request resolved: #96082
@mingfeima mingfeima marked this pull request as ready for review March 14, 2023 02:28
@mingfeima mingfeima added release notes: intel release notes category module: half Related to float16 half-precision floats labels Mar 14, 2023
@mingfeima mingfeima changed the title use float32 as acc type in sum kernel when input is Half on CPU optimize sum kernel when input is Half on CPU Mar 14, 2023
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 28, 2023
ghstack-source-id: 675c1247125e7766359383ad388e5080ecaf188b
Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 28, 2023
ghstack-source-id: d56a9882c2b8eaa3de0e744a03705cd9fdd96892
Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Mar 30, 2023
ghstack-source-id: a138df09b5dde375494649de6f0a9f8218c78b8e
Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Apr 2, 2023
ghstack-source-id: caef839b4a0bb31e18d64a5ee54d0fa7b3c73b47
Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`.

Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level.

### Performance before
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.879 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.363 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 13.875 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 43.573 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.743 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 208.055 ms
```

### Performance after
```
### using single socket of 20 cores
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 4.817 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 2.380 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 2.361 ms

### using single core
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float32 ; time per iter: 42.986 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.bfloat16 ; time per iter: 27.766 ms
Input shape:  torch.Size([128, 1024, 1024]) ; dtype:  torch.float16 ; time per iter: 27.753 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Apr 8, 2023
ghstack-source-id: 979fa19f4325488584684496be2e54c8a9f13613
Pull Request resolved: #96082
@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 10, 2023
@github-actions
Copy link

github-actions bot commented Jun 9, 2023

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 9, 2023
@github-actions github-actions bot closed this Jul 9, 2023
@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/110/head branch August 8, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats open source release notes: intel release notes category Stale
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants