-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize sum kernel when input is Half on CPU #96082
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96082
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 FailuresAs of commit 459accf: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: f79bf6a5102134696cee055f552f7f21a0c8e000 Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 7c0da5bf206e69998ebd873c651eadd46820d4ef Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 4f2cc34c6ed8d0d6235153806c294e28dd9651a9 Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 0b8f2289bc9523c59c4be4610fba27cd7f6395a4 Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: b92d4717f985c42dcbf22324cf3adc903bd22903 Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 80add9030e40766d1ac275c0b1b1279bcc724979 Pull Request resolved: #96082
…n CPU" cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 2c0e2e05d1eb4260405baf07704445097a4889fd Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`. Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level. ### Performance before ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.879 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.363 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 13.875 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 43.573 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.743 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 208.055 ms ``` ### Performance after ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.817 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.380 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 2.361 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 42.986 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.766 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 27.753 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 675c1247125e7766359383ad388e5080ecaf188b Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`. Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level. ### Performance before ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.879 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.363 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 13.875 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 43.573 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.743 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 208.055 ms ``` ### Performance after ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.817 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.380 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 2.361 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 42.986 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.766 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 27.753 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: d56a9882c2b8eaa3de0e744a03705cd9fdd96892 Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`. Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level. ### Performance before ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.879 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.363 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 13.875 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 43.573 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.743 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 208.055 ms ``` ### Performance after ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.817 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.380 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 2.361 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 42.986 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.766 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 27.753 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: a138df09b5dde375494649de6f0a9f8218c78b8e Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`. Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level. ### Performance before ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.879 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.363 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 13.875 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 43.573 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.743 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 208.055 ms ``` ### Performance after ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.817 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.380 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 2.361 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 42.986 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.766 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 27.753 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: caef839b4a0bb31e18d64a5ee54d0fa7b3c73b47 Pull Request resolved: #96082
Originally `sum` has a specialized path for `BFloat16`, this patch expands that path to `Half` as well. Without this path, `Half` will also be accumulated in `Float`, but in a slower non-vectorized path in `load_reduce_vec`. Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously `Half` is much slower than `BFloat16`, now they are of the same level. ### Performance before ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.879 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.363 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 13.875 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 43.573 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.743 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 208.055 ms ``` ### Performance after ``` ### using single socket of 20 cores Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 4.817 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 2.380 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 2.361 ms ### using single core Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float32 ; time per iter: 42.986 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.bfloat16 ; time per iter: 27.766 ms Input shape: torch.Size([128, 1024, 1024]) ; dtype: torch.float16 ; time per iter: 27.753 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 979fa19f4325488584684496be2e54c8a9f13613 Pull Request resolved: #96082
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Originally
sum
has a specialized path forBFloat16
, this patch expands that path toHalf
as well. Without this path,Half
will also be accumulated inFloat
, but in a slower non-vectorized path inload_reduce_vec
.Performance result on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz: previously
Half
is much slower thanBFloat16
, now they are of the same level.Performance before
Performance after
cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10