Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mixed data type support for GroupNorm backward on CPU #88663

Closed
wants to merge 12 commits into from

Conversation

CaoE
Copy link
Collaborator

@CaoE CaoE commented Nov 8, 2022

Stack from ghstack (oldest at bottom):

Motivation

Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

Testing

Single socket (28cores):

  • Contiguous:
shape forward / s forward / s backward / s backward / s
  fp32 mixed fp32 bf16 fp32 mixed fp32 bf16
[10, 128, 20, 20] 3.08E-05 3.50E-05 8.06E-05 7.69E-05
[10, 128, 50, 50] 0.000121 0.000114 0.000358 0.000203
  • Channels Last (inputs and outputs will be converted to contiguous):
shape forward / s forward / s backward / s backward / s
  fp32 mixed fp32 bf16 fp32 mixed fp32 bf16
[10, 128, 20, 20] 4.04E-05 4.41E-05 0.000226 0.000305
[10, 128, 50, 50] 0.000169 0.000166 0.001628 0.001169

Single core:

  • Contiguous:
shape forward / s forward / s backward / s backward / s
  fp32 mixed fp32 bf16 fp32 mixed fp32 bf16
[10, 128, 20, 20] 2.38E-04 2.51E-04 5.94E-04 4.50E-04
[10, 128, 50, 50] 0.00171 0.001395 0.0044455 0.00243
  • Channels Last (inputs and outputs will be converted to contiguous):
shape forward / s forward / s backward / s backward / s
  fp32 mixed fp32 bf16 fp32 mixed fp32 bf16
[10, 128, 20, 20] 2.28E-04 3.26E-04 0.0016528 0.003165
[10, 128, 50, 50] 0.001788 0.001302 0.0276621 0.019447

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88663

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit 6e34c00:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Nov 8, 2022
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 8, 2022
CaoE added a commit that referenced this pull request Nov 8, 2022
ghstack-source-id: 96cd113ab39009e646bfef5f440882e7ee7498f5
Pull Request resolved: #88663
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@CaoE CaoE marked this pull request as draft November 9, 2022 01:29
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@CaoE CaoE requested a review from mingfeima November 9, 2022 01:54
@CaoE CaoE added the intel This tag is for PR from Intel label Nov 9, 2022
cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 9, 2022
…port for GroupNorm"


### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing


[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 9, 2022
### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing


[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 17, 2022
…port for GroupNorm backward"


### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027




[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 17, 2022
…rm backward"


### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027




[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 17, 2022
…port for GroupNorm backward"


### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027




[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 17, 2022
…rm backward"


### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027




[ghstack-poisoned]
@CaoE CaoE requested a review from jgong5 November 17, 2022 11:26
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




[ghstack-poisoned]
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




[ghstack-poisoned]
aten/src/ATen/native/cpu/group_norm_kernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/group_norm_kernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/group_norm_kernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/group_norm_kernel.cpp Outdated Show resolved Hide resolved
CaoE added a commit to CaoE/pytorch that referenced this pull request Nov 22, 2022
ghstack-source-id: d81889bfa3cfa987ed81a42a1a315f0b598c7644
Pull Request resolved: pytorch#88663
CaoE added a commit to CaoE/pytorch that referenced this pull request Nov 22, 2022
ghstack-source-id: d81889bfa3cfa987ed81a42a1a315f0b598c7644
Pull Request resolved: pytorch#88663
CaoE added a commit that referenced this pull request Nov 22, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





[ghstack-poisoned]
CaoE added a commit that referenced this pull request Nov 22, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





[ghstack-poisoned]
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




[ghstack-poisoned]
@CaoE CaoE added the intel priority matters to intel architecture from performance wise label Dec 13, 2022
@CaoE CaoE requested a review from malfet December 13, 2022 05:42
@CaoE CaoE marked this pull request as ready for review December 13, 2022 13:35
@CaoE CaoE removed the large We think that this is a pretty chunky piece of work label Dec 13, 2022
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
// It will keep module parameters in acc dtype i.e. float
// while input/output will be in BFloat16.
// Using parameters in BFloat16 will cause high precision loss.
if(mixed_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback as for #81852 : please consider who this could be constrained to just BFloat16, perhaps it could be handled outside of the generic switch statement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #89485

CaoE added a commit that referenced this pull request Dec 20, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 20, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 21, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 21, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 21, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 21, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203


* Channels Last (inputs and outputs will be converted to contiguous):


shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243


* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447




cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@CaoE
Copy link
Collaborator Author

CaoE commented Dec 22, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ShisuiUzumaki pushed a commit to ShisuiUzumaki/pytorch that referenced this pull request Dec 23, 2022
)

### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447

Pull Request resolved: pytorch#88663
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet
CaoE added a commit that referenced this pull request Dec 26, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 26, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 28, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 28, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 28, 2022
…port for GroupNorm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request Dec 28, 2022
…rm backward"



### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257


* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317


Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027





cc VitalyFedyunin jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 29, 2022
…#89485)

### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027

Pull Request resolved: #89485
Approved by: https://github.com/jgong5, https://github.com/malfet
T db_val = std::accumulate(db_arr.cbegin(), db_arr.cend(), T(0));
const PT* ds_ptr = ds + i * D;
const PT* db_ptr = db + i * D;
const PT* gamma_ptr = gamma + g * D;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is UB if gamma is null -- you can't add a nonzero offset to a null pointer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. The CalcDsDb method also checks whether gamma is null. Will modify this in #100234.

@facebook-github-bot facebook-github-bot deleted the gh/CaoE/2/head branch June 8, 2023 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants