-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add mixed data type support for GroupNorm backward on CPU #88663
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88663
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 FailuresAs of commit 6e34c00: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 96cd113ab39009e646bfef5f440882e7ee7498f5 Pull Request resolved: #88663
[ghstack-poisoned]
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…port for GroupNorm" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing [ghstack-poisoned]
### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 [ghstack-poisoned]
ghstack-source-id: d81889bfa3cfa987ed81a42a1a315f0b598c7644 Pull Request resolved: pytorch#88663
ghstack-source-id: d81889bfa3cfa987ed81a42a1a315f0b598c7644 Pull Request resolved: pytorch#88663
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
// It will keep module parameters in acc dtype i.e. float | ||
// while input/output will be in BFloat16. | ||
// Using parameters in BFloat16 will cause high precision loss. | ||
if(mixed_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback as for #81852 : please consider who this could be constrained to just BFloat16
, perhaps it could be handled outside of the generic switch statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #89485
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05 [10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305 [10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04 [10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165 [10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447 Pull Request resolved: pytorch#88663 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…port for GroupNorm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…rm backward" ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…#89485) ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05 [10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05 [10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317 Single core: * Contiguous: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04 [10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436 * Channels Last: shape | forward / s | forward / s | backward / s | backward / s -- | -- | -- | -- | -- | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16 [10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459 [10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027 Pull Request resolved: #89485 Approved by: https://github.com/jgong5, https://github.com/malfet
T db_val = std::accumulate(db_arr.cbegin(), db_arr.cend(), T(0)); | ||
const PT* ds_ptr = ds + i * D; | ||
const PT* db_ptr = db + i * D; | ||
const PT* gamma_ptr = gamma + g * D; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is UB if gamma is null -- you can't add a nonzero offset to a null pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. The CalcDsDb method also checks whether gamma is null. Will modify this in #100234.
Stack from ghstack (oldest at bottom):
Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.
Testing
Single socket (28cores):
Single core:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10