Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fake_quant: add a more memory efficient backward #50561

Closed
wants to merge 9 commits into from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jan 15, 2021

Stack from ghstack:

Summary:

tl;dr; add an alternative implementation of fake_quantize which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math:

# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd. Instead of passing x: FloatTensor, we pass a mask: BoolTensor
with the same number of elements. BoolTensor uses 1 byte per element,
so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory
overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature
of this function to take model.training as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with opt build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar.

# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:

python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5

All CI tests here: #50849

PyTorch microbenchmarks (CUDA performance about the same:

cd benchmarks/operator_benchmark
python -m pt.quantization_test

results: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd)

Unit tests:

python test/test_quantization.py TestFakeQuantize

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D25918519

Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 15, 2021
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 51df6e811e6568efc3e79098ef69c53662641482
Pull Request resolved: #50561
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask during the forward pass and uses it to calculate the backward.

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT memory overhead
by ~73% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor (so upper bound on
overhead savings without packing the bits is 75%, we are pretty close with 73%).

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 15, 2021
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 1b80bd5c92df078c3cca2389b4d5c333fa7fc72d
Pull Request resolved: #50561
@vkuzo vkuzo changed the title fake_quant: add a more memory efficient version fake_quant: add a more memory efficient backward Jan 15, 2021
Summary:

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward.  The math:

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor`
with the same number of elements.  `BoolTensor` uses 1 byte per element, 
so we expect an upper bound of a 75% memory overhead reduction.  We observe a 73% memory 
overhead reduction on torchvision's MobileNetV2 in real world tests.  Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR.  We can adjust the signature
of this function to take `model.training` as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32.  Results for fbgemm settings, qnnpack results are similar.
```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183
```

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 20, 2021
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f932055ee57b6a4e419d3896fb605c58fc063668
Pull Request resolved: #50561
vkuzo added a commit that referenced this pull request Jan 20, 2021
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f932055ee57b6a4e419d3896fb605c58fc063668
Pull Request resolved: #50561
vkuzo added a commit that referenced this pull request Jan 21, 2021
Summary:

Switches the default fake_quant path to use the new memory efficient backward from
#50561.

Separating for clean testing and review, but ideally we combine
this with #50561.

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 21, 2021
Summary:

Switches the default fake_quant path to use the new memory efficient backward from
#50561.

Separating for clean testing and review, but ideally we combine
this with #50561.

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ba2787ee2a6b023bab36733ec32c91fb174f2cc7
Pull Request resolved: #50857
Summary:

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward.  The math:

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor`
with the same number of elements.  `BoolTensor` uses 1 byte per element, 
so we expect an upper bound of a 75% memory overhead reduction.  We observe a 73% memory 
overhead reduction on torchvision's MobileNetV2 in real world tests.  Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR.  We can adjust the signature
of this function to take `model.training` as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32.  Results for fbgemm settings, qnnpack results are similar.
```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183
```

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
```

All CI tests here: #50849

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 21, 2021
Summary:

Switches the default fake_quant path to use the new memory efficient backward from
#50561.

Separating for clean testing and review, but ideally we combine
this with #50561.

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7f15da247757c9423374bc32d0a8df8318308f68
Pull Request resolved: #50857
vkuzo added a commit that referenced this pull request Jan 21, 2021
Summary:

Switches the default fake_quant path to use the new memory efficient backward from
#50561.

Separating for clean testing and review, but ideally we combine
this with #50561.

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernels look good, I left minor comments.

const Tensor& mask) {
TORCH_CHECK(dY.scalar_type() == ScalarType::Float);
TORCH_CHECK(mask.scalar_type() == ScalarType::Bool);
TORCH_CHECK(mask.numel() == dY.numel(), "`mask` and `dY` are not the same size");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can log sizes here to make error message more informative


iter_combined.for_each([&](char** data, const int64_t* strides, int64_t n) {
for (int64_t i = 0; i < n; i++) {
float* output_val = (float*)(data[0] + i * strides[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unusual to see kernels in pytorch that are hardcoded for float only. Could people want to quantize half or bfloat16 models?

@codecov
Copy link

codecov bot commented Jan 21, 2021

Codecov Report

Merging #50561 (194be25) into gh/vkuzo/209/base (5ec2e26) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@                  Coverage Diff                  @@
##           gh/vkuzo/209/base   #50561      +/-   ##
=====================================================
- Coverage              80.88%   80.88%   -0.01%     
=====================================================
  Files                   1931     1931              
  Lines                 210588   210604      +16     
=====================================================
+ Hits                  170339   170343       +4     
- Misses                 40249    40261      +12     

Summary:

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward.  The math:

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor`
with the same number of elements.  `BoolTensor` uses 1 byte per element, 
so we expect an upper bound of a 75% memory overhead reduction.  We observe a 73% memory 
overhead reduction on torchvision's MobileNetV2 in real world tests.  Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR.  We can adjust the signature
of this function to take `model.training` as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32.  Results for fbgemm settings, qnnpack results are similar.
```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183
```

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
```

All CI tests here: #50849

PyTorch microbenchmarks (CUDA performance about the same: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd)

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
Summary:

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward.  The math:

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor`
with the same number of elements.  `BoolTensor` uses 1 byte per element, 
so we expect an upper bound of a 75% memory overhead reduction.  We observe a 73% memory 
overhead reduction on torchvision's MobileNetV2 in real world tests.  Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR.  We can adjust the signature
of this function to take `model.training` as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32.  Results for fbgemm settings, qnnpack results are similar.
```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183
```

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
```

All CI tests here: #50849

PyTorch microbenchmarks (CUDA performance about the same: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd)

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 26, 2021
Summary:

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:

OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 58c79e8e86a59be43d866412c0dd9f432eb2af87
Pull Request resolved: #50561
vkuzo added a commit that referenced this pull request Jan 27, 2021
Summary:

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:

```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 27, 2021
Summary:

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:

```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 1adb5b962a25dd5f89c035e7855fb8eb28bb1706
Pull Request resolved: #51159
Summary:

tl;dr; add an alternative implementation of `fake_quantize` which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward.  The math:

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor`
with the same number of elements.  `BoolTensor` uses 1 byte per element, 
so we expect an upper bound of a 75% memory overhead reduction.  We observe a 73% memory 
overhead reduction on torchvision's MobileNetV2 in real world tests.  Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR.  We can adjust the signature
of this function to take `model.training` as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32.  Results for fbgemm settings, qnnpack results are similar.
```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183
```

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
```

All CI tests here: #50849

PyTorch microbenchmarks (CUDA performance about the same: 
```
cd benchmarks/operator_benchmark
python -m pt.quantization_test
```
results: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd)

Unit tests:

```
python test/test_quantization.py TestFakeQuantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 27, 2021
…sion"

Summary:

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:

```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D26090869](https://our.internmc.facebook.com/intern/diff/D26090869)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 27, 2021
Summary:

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:

```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: de70258e9950f1e7a401c52b7fa2082390319690
Pull Request resolved: #51159
vkuzo added a commit that referenced this pull request Jan 28, 2021
Summary:

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 28, 2021
Summary:

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Memory and performance impact (MobileNetV2): TODO

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations
* note: we can optimize the perf in a future PR by reading once and writing twice

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 28, 2021
Summary:

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7498ee6ff77ae53fe30587cc0efe12f3a3b87428
Pull Request resolved: #51255
facebook-github-bot pushed a commit that referenced this pull request Jan 28, 2021
Summary:
Pull Request resolved: #51159

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26090869

fbshipit-source-id: fda042881f77a993a9d15dafabea7cfaf9dc7c9c
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 983b8e6.

vkuzo added a commit that referenced this pull request Jan 28, 2021
…nel backward"


Summary:

This is the same as #50561, but for per-channel fake_quant.  We add an alternative definition
of fake quantize per channel's backward which computes a mask of what is clipped in the
forward, and reuses that mask in the backward (instead of recomputing it):

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

There is a slight memory efficiency win (75% of whatever per-channel fq contributes,
although it does not contribute much).

There is also a nice side effect that fake_quant_per_channel will now support
a module calling it twice in the same forward.  Previously, this was broken because
(1) scale + zp were passed to the backward as arguments, and
(2) scale + zp were updated in-place during the forward
The combination of (1) and (2) made it illegal to use the same fake_quant twice, since
it would modify in-place the information needed for the backward.  After this PR, (1)
will no longer apply, so this use case can be enabled.

There are two things left for future PRs:
1. kernels for mask and fq value are duplicated, instead of reading once and writing twice.  We will hopefully optimize that in a future PR.  Impact is low in the real world because this is not a bottleneck.
2. we use `BoolTensor` to pass the mask which takes 1 byte per element, in the future we can pack the bits to save more memory

Memory and performance impact (MobileNetV2): 

```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3538 (overhead: 7.1%)
latency (ms):  147 -> 187 (overhead: 27%)

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3532 (overhead: 7.0%)
latency (ms):  147 -> 167 (overhead: 14%)
```

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations.  The data collected from real benchmarks (MobileNetV2 QAT) matches this hypothesis, and we actually see a speedup there.
* note: we can optimize the perf in a future PR by changing the kernels to read once and write twice

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D26117721](https://our.internmc.facebook.com/intern/diff/D26117721)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 28, 2021
Summary:

This is the same as #50561, but for per-channel fake_quant.  We add an alternative definition
of fake quantize per channel's backward which computes a mask of what is clipped in the
forward, and reuses that mask in the backward (instead of recomputing it):

```
# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask
```

There is a slight memory efficiency win (75% of whatever per-channel fq contributes,
although it does not contribute much).

There is also a nice side effect that fake_quant_per_channel will now support
a module calling it twice in the same forward.  Previously, this was broken because
(1) scale + zp were passed to the backward as arguments, and
(2) scale + zp were updated in-place during the forward
The combination of (1) and (2) made it illegal to use the same fake_quant twice, since
it would modify in-place the information needed for the backward.  After this PR, (1)
will no longer apply, so this use case can be enabled.

There are two things left for future PRs:
1. kernels for mask and fq value are duplicated, instead of reading once and writing twice.  We will hopefully optimize that in a future PR.  Impact is low in the real world because this is not a bottleneck.
2. we use `BoolTensor` to pass the mask which takes 1 byte per element, in the future we can pack the bits to save more memory

Memory and performance impact (MobileNetV2): 

```
# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3538 (overhead: 7.1%)
latency (ms):  147 -> 187 (overhead: 27%)

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3532 (overhead: 7.0%)
latency (ms):  147 -> 167 (overhead: 14%)
```

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations.  The data collected from real benchmarks (MobileNetV2 QAT) matches this hypothesis, and we actually see a speedup there.
* note: we can optimize the perf in a future PR by changing the kernels to read once and write twice

Test Plan:

```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D26117721](https://our.internmc.facebook.com/intern/diff/D26117721)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jan 29, 2021
Summary:
Pull Request resolved: #51255

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Memory and performance impact (MobileNetV2): TODO

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations
* note: we can optimize the perf in a future PR by reading once and writing twice

Test Plan:
```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26117721

fbshipit-source-id: 798b59316dff8188a1d0948e69adf9e5509e414c
@facebook-github-bot facebook-github-bot deleted the gh/vkuzo/209/head branch January 31, 2021 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants