fake_quant: add a more memory efficient backward #50561

vkuzo · 2021-01-15T00:19:11Z

Stack from ghstack:

fake_quant: more memory efficient per-channel backward #51255 fake_quant: more memory efficient per-channel backward
memory efficient fq: use it everywhere, delete the old version #51159 memory efficient fq: use it everywhere, delete the old version
fake_quant: add a more memory efficient backward #50561 fake_quant: add a more memory efficient backward

Summary:

tl;dr; add an alternative implementation of fake_quantize which saves
a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math:

# before - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val

# before - backward (pseudocode)
def fq_backward(dy, x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    return dy * mask

# after - forward (pseudocode)
def fq_forward(x, scale, zp, qmin, qmax):
    q_val_unclamped = nearby_int(x / scale) + zp
    mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax
    q_val = clamp(q_val_unclamped, qmin, qmax)
    fq_val = (q_val - zp) * scale
    return fq_val, mask

# after - backward (pseudocode)
def fq_backward(dy, mask):
    return dy * mask

This way the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd. Instead of passing x: FloatTensor, we pass a mask: BoolTensor
with the same number of elements. BoolTensor uses 1 byte per element,
so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory
overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools
into a custom storage format to take 1 bit per element is an optimization left for the future.

Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real.

Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but
might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature
of this function to take model.training as an additional parameter, and skip the mask computation for eval.

Test Plan:

QAT on MobileNetV2 on FB infra, with opt build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar.

# qat_fp32: model with fake_quants turned off (baseline)
# qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants)

# before: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3299 -> 4170 (overhead: 26.4%)
latency (ms):  147 -> 181

# after: fbgemm - qat_fp32 -> qat_1
max memory usage (mib): 3302 -> 3528 (overhead: 7.1%)
latency (ms):  147 -> 183

Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command:

python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5

All CI tests here: #50849

PyTorch microbenchmarks (CUDA performance about the same:

cd benchmarks/operator_benchmark
python -m pt.quantization_test

results: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd)

Unit tests:

python test/test_quantization.py TestFakeQuantize

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D25918519

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 51df6e811e6568efc3e79098ef69c53662641482 Pull Request resolved: #50561

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a mask during the forward pass and uses it to calculate the backward. This way the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT memory overhead by ~73% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor (so upper bound on overhead savings without packing the bits is 75%, we are pretty close with 73%). TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 1b80bd5c92df078c3cca2389b4d5c333fa7fc72d Pull Request resolved: #50561

Summary: tl;dr; add an alternative implementation of `fake_quantize` which saves a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math: ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` This way the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor` with the same number of elements. `BoolTensor` uses 1 byte per element, so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools into a custom storage format to take 1 bit per element is an optimization left for the future. Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real. Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature of this function to take `model.training` as an additional parameter, and skip the mask computation for eval. Test Plan: QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar. ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3299 -> 4170 (overhead: 26.4%) latency (ms): 147 -> 181 # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3528 (overhead: 7.1%) latency (ms): 147 -> 183 ``` Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command: ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f932055ee57b6a4e419d3896fb605c58fc063668 Pull Request resolved: #50561

Summary: Switches the default fake_quant path to use the new memory efficient backward from #50561. Separating for clean testing and review, but ideally we combine this with #50561. Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Switches the default fake_quant path to use the new memory efficient backward from #50561. Separating for clean testing and review, but ideally we combine this with #50561. Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: ba2787ee2a6b023bab36733ec32c91fb174f2cc7 Pull Request resolved: #50857

Summary: tl;dr; add an alternative implementation of `fake_quantize` which saves a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math: ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` This way the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor` with the same number of elements. `BoolTensor` uses 1 byte per element, so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools into a custom storage format to take 1 bit per element is an optimization left for the future. Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real. Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature of this function to take `model.training` as an additional parameter, and skip the mask computation for eval. Test Plan: QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar. ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3299 -> 4170 (overhead: 26.4%) latency (ms): 147 -> 181 # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3528 (overhead: 7.1%) latency (ms): 147 -> 183 ``` Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command: ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 ``` All CI tests here: #50849 Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

Summary: Switches the default fake_quant path to use the new memory efficient backward from #50561. Separating for clean testing and review, but ideally we combine this with #50561. Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7f15da247757c9423374bc32d0a8df8318308f68 Pull Request resolved: #50857

Summary: Switches the default fake_quant path to use the new memory efficient backward from #50561. Separating for clean testing and review, but ideally we combine this with #50561. Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_tensor_cuda ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

ngimel

The kernels look good, I left minor comments.

ngimel · 2021-01-21T17:21:40Z

aten/src/ATen/native/quantized/fake_quant_per_tensor_affine.cpp

+    const Tensor& mask) {
+  TORCH_CHECK(dY.scalar_type() == ScalarType::Float);
+  TORCH_CHECK(mask.scalar_type() == ScalarType::Bool);
+  TORCH_CHECK(mask.numel() == dY.numel(), "`mask` and `dY` are not the same size");


You can log sizes here to make error message more informative

ngimel · 2021-01-21T17:33:23Z

aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp

+
+  iter_combined.for_each([&](char** data, const int64_t* strides, int64_t n) {
+    for (int64_t i = 0; i < n; i++) {
+      float* output_val = (float*)(data[0] + i * strides[0]);


It's unusual to see kernels in pytorch that are hardcoded for float only. Could people want to quantize half or bfloat16 models?

codecov · 2021-01-21T20:44:43Z

Codecov Report

Merging #50561 (194be25) into gh/vkuzo/209/base (5ec2e26) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@                  Coverage Diff                  @@
##           gh/vkuzo/209/base   #50561      +/-   ##
=====================================================
- Coverage              80.88%   80.88%   -0.01%     
=====================================================
  Files                   1931     1931              
  Lines                 210588   210604      +16     
=====================================================
+ Hits                  170339   170343       +4     
- Misses                 40249    40261      +12

Summary: tl;dr; add an alternative implementation of `fake_quantize` which saves a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math: ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` This way the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor` with the same number of elements. `BoolTensor` uses 1 byte per element, so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools into a custom storage format to take 1 bit per element is an optimization left for the future. Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real. Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature of this function to take `model.training` as an additional parameter, and skip the mask computation for eval. Test Plan: QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar. ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3299 -> 4170 (overhead: 26.4%) latency (ms): 147 -> 181 # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3528 (overhead: 7.1%) latency (ms): 147 -> 183 ``` Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command: ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 ``` All CI tests here: #50849 PyTorch microbenchmarks (CUDA performance about the same: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd) Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

Summary: Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 58c79e8e86a59be43d866412c0dd9f432eb2af87 Pull Request resolved: #50561

Summary: This PR is the cleanup after #50561. High level, we make the new definition of fake_quant be the definition used by autograd, but keep the old function around as a thin wrapper to keep the user facing API the same. In detail: 1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask` 2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward 3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything Test Plan: ``` python test/test_quantization.py TestFakeQuantize ``` performance testing was done in the previous PR. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR is the cleanup after #50561. High level, we make the new definition of fake_quant be the definition used by autograd, but keep the old function around as a thin wrapper to keep the user facing API the same. In detail: 1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask` 2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward 3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything Test Plan: ``` python test/test_quantization.py TestFakeQuantize ``` performance testing was done in the previous PR. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 1adb5b962a25dd5f89c035e7855fb8eb28bb1706 Pull Request resolved: #51159

Summary: tl;dr; add an alternative implementation of `fake_quantize` which saves a mask of whether the input was clamped during the forward pass and uses it to calculate the backward. The math: ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` This way the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. Instead of passing `x: FloatTensor`, we pass a `mask: BoolTensor` with the same number of elements. `BoolTensor` uses 1 byte per element, so we expect an upper bound of a 75% memory overhead reduction. We observe a 73% memory overhead reduction on torchvision's MobileNetV2 in real world tests. Packing the bools into a custom storage format to take 1 bit per element is an optimization left for the future. Performance impact of this seems negligible, I observed a 1% to 5% regression on MobileNetV2 but it's unclear if it's real. Adding this as a new function (as opposed to replacing the old implementation) for easy testing, but might be worth deleting the old fake_quant backward in a future PR. We can adjust the signature of this function to take `model.training` as an additional parameter, and skip the mask computation for eval. Test Plan: QAT on MobileNetV2 on FB infra, with `opt` build flags, batch_size = 32. Results for fbgemm settings, qnnpack results are similar. ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3299 -> 4170 (overhead: 26.4%) latency (ms): 147 -> 181 # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3528 (overhead: 7.1%) latency (ms): 147 -> 183 ``` Note: similar metrics are observed in an OSS / torchvision / MobileNetV2 setup, with this command: ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 ``` All CI tests here: #50849 PyTorch microbenchmarks (CUDA performance about the same: ``` cd benchmarks/operator_benchmark python -m pt.quantization_test ``` results: https://gist.github.com/vkuzo/11a7bed73fe60e340862d37e7975e9cd) Unit tests: ``` python test/test_quantization.py TestFakeQuantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25918519](https://our.internmc.facebook.com/intern/diff/D25918519) [ghstack-poisoned]

…sion" Summary: This PR is the cleanup after #50561. High level, we make the new definition of fake_quant be the definition used by autograd, but keep the old function around as a thin wrapper to keep the user facing API the same. In detail: 1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask` 2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward 3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything Test Plan: ``` python test/test_quantization.py TestFakeQuantize ``` performance testing was done in the previous PR. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D26090869](https://our.internmc.facebook.com/intern/diff/D26090869) [ghstack-poisoned]

Summary: This PR is the cleanup after #50561. High level, we make the new definition of fake_quant be the definition used by autograd, but keep the old function around as a thin wrapper to keep the user facing API the same. In detail: 1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask` 2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward 3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything Test Plan: ``` python test/test_quantization.py TestFakeQuantize ``` performance testing was done in the previous PR. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: de70258e9950f1e7a401c52b7fa2082390319690 Pull Request resolved: #51159

Summary: This is the same as #50561, but for per-channel fake_quant. TODO before land write up better Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This is the same as #50561, but for per-channel fake_quant. TODO before land write up better Memory and performance impact (MobileNetV2): TODO Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc * forward pass on cpu: 512ms -> 750ms (+46%) * forward pass on cuda: 99ms -> 128ms (+30%) * note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations * note: we can optimize the perf in a future PR by reading once and writing twice Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This is the same as #50561, but for per-channel fake_quant. TODO before land write up better Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7498ee6ff77ae53fe30587cc0efe12f3a3b87428 Pull Request resolved: #51255

Summary: Pull Request resolved: #51159 This PR is the cleanup after #50561. High level, we make the new definition of fake_quant be the definition used by autograd, but keep the old function around as a thin wrapper to keep the user facing API the same. In detail: 1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask` 2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward 3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything Test Plan: ``` python test/test_quantization.py TestFakeQuantize ``` performance testing was done in the previous PR. Imported from OSS Reviewed By: jerryzh168 Differential Revision: D26090869 fbshipit-source-id: fda042881f77a993a9d15dafabea7cfaf9dc7c9c

facebook-github-bot · 2021-01-28T03:39:43Z

This pull request has been merged in 983b8e6.

…nel backward" Summary: This is the same as #50561, but for per-channel fake_quant. We add an alternative definition of fake quantize per channel's backward which computes a mask of what is clipped in the forward, and reuses that mask in the backward (instead of recomputing it): ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` There is a slight memory efficiency win (75% of whatever per-channel fq contributes, although it does not contribute much). There is also a nice side effect that fake_quant_per_channel will now support a module calling it twice in the same forward. Previously, this was broken because (1) scale + zp were passed to the backward as arguments, and (2) scale + zp were updated in-place during the forward The combination of (1) and (2) made it illegal to use the same fake_quant twice, since it would modify in-place the information needed for the backward. After this PR, (1) will no longer apply, so this use case can be enabled. There are two things left for future PRs: 1. kernels for mask and fq value are duplicated, instead of reading once and writing twice. We will hopefully optimize that in a future PR. Impact is low in the real world because this is not a bottleneck. 2. we use `BoolTensor` to pass the mask which takes 1 byte per element, in the future we can pack the bits to save more memory Memory and performance impact (MobileNetV2): ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3538 (overhead: 7.1%) latency (ms): 147 -> 187 (overhead: 27%) # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3532 (overhead: 7.0%) latency (ms): 147 -> 167 (overhead: 14%) ``` Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc * forward pass on cpu: 512ms -> 750ms (+46%) * forward pass on cuda: 99ms -> 128ms (+30%) * note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations. The data collected from real benchmarks (MobileNetV2 QAT) matches this hypothesis, and we actually see a speedup there. * note: we can optimize the perf in a future PR by changing the kernels to read once and write twice Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D26117721](https://our.internmc.facebook.com/intern/diff/D26117721) [ghstack-poisoned]

Summary: This is the same as #50561, but for per-channel fake_quant. We add an alternative definition of fake quantize per channel's backward which computes a mask of what is clipped in the forward, and reuses that mask in the backward (instead of recomputing it): ``` # before - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val = clamp(nearby_int(x / scale) + zp, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val # before - backward (pseudocode) def fq_backward(dy, x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax return dy * mask # after - forward (pseudocode) def fq_forward(x, scale, zp, qmin, qmax): q_val_unclamped = nearby_int(x / scale) + zp mask = qmin <= q_val_unclamped and q_val_unclamped <= qmax q_val = clamp(q_val_unclamped, qmin, qmax) fq_val = (q_val - zp) * scale return fq_val, mask # after - backward (pseudocode) def fq_backward(dy, mask): return dy * mask ``` There is a slight memory efficiency win (75% of whatever per-channel fq contributes, although it does not contribute much). There is also a nice side effect that fake_quant_per_channel will now support a module calling it twice in the same forward. Previously, this was broken because (1) scale + zp were passed to the backward as arguments, and (2) scale + zp were updated in-place during the forward The combination of (1) and (2) made it illegal to use the same fake_quant twice, since it would modify in-place the information needed for the backward. After this PR, (1) will no longer apply, so this use case can be enabled. There are two things left for future PRs: 1. kernels for mask and fq value are duplicated, instead of reading once and writing twice. We will hopefully optimize that in a future PR. Impact is low in the real world because this is not a bottleneck. 2. we use `BoolTensor` to pass the mask which takes 1 byte per element, in the future we can pack the bits to save more memory Memory and performance impact (MobileNetV2): ``` # qat_fp32: model with fake_quants turned off (baseline) # qat_1: step 2 of qat, with observers disabled and fake_quants enabled (all of the overhead is the fake_quants) # before: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3538 (overhead: 7.1%) latency (ms): 147 -> 187 (overhead: 27%) # after: fbgemm - qat_fp32 -> qat_1 max memory usage (mib): 3302 -> 3532 (overhead: 7.0%) latency (ms): 147 -> 167 (overhead: 14%) ``` Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc * forward pass on cpu: 512ms -> 750ms (+46%) * forward pass on cuda: 99ms -> 128ms (+30%) * note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations. The data collected from real benchmarks (MobileNetV2 QAT) matches this hypothesis, and we actually see a speedup there. * note: we can optimize the perf in a future PR by changing the kernels to read once and write twice Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D26117721](https://our.internmc.facebook.com/intern/diff/D26117721) [ghstack-poisoned]

Summary: Pull Request resolved: #51255 This is the same as #50561, but for per-channel fake_quant. TODO before land write up better Memory and performance impact (MobileNetV2): TODO Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc * forward pass on cpu: 512ms -> 750ms (+46%) * forward pass on cuda: 99ms -> 128ms (+30%) * note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations * note: we can optimize the perf in a future PR by reading once and writing twice Test Plan: ``` python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D26117721 fbshipit-source-id: 798b59316dff8188a1d0948e69adf9e5509e414c

facebook-github-bot added the cla signed label Jan 15, 2021

vkuzo changed the title ~~fake_quant: add a more memory efficient version~~ fake_quant: add a more memory efficient backward Jan 15, 2021

vkuzo requested review from raghuramank100 and ngimel January 15, 2021 17:19

This was referenced Jan 20, 2021

[not for land, ci only] fake_quant: add a more memory efficient version #50849

Closed

mem-efficient fake_quant: delete old version #50857

Closed

ngimel approved these changes Jan 21, 2021

View reviewed changes

vkuzo mentioned this pull request Jan 27, 2021

memory efficient fq: use it everywhere, delete the old version #51159

Closed

vkuzo mentioned this pull request Jan 28, 2021

fake_quant: more memory efficient per-channel backward #51255

Closed

facebook-github-bot closed this in 983b8e6 Jan 28, 2021

facebook-github-bot added the Merged label Jan 28, 2021

vkuzo mentioned this pull request Jan 28, 2021

memory efficient per-channel fq: use it everywhere, delete old version #51265

Closed

facebook-github-bot deleted the gh/vkuzo/209/head branch January 31, 2021 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fake_quant: add a more memory efficient backward #50561

fake_quant: add a more memory efficient backward #50561

vkuzo commented Jan 15, 2021 •

edited

ngimel left a comment

ngimel Jan 21, 2021

ngimel Jan 21, 2021

codecov bot commented Jan 21, 2021 •

edited

facebook-github-bot commented Jan 28, 2021

fake_quant: add a more memory efficient backward #50561

fake_quant: add a more memory efficient backward #50561

Conversation

vkuzo commented Jan 15, 2021 • edited

ngimel left a comment

Choose a reason for hiding this comment

ngimel Jan 21, 2021

Choose a reason for hiding this comment

ngimel Jan 21, 2021

Choose a reason for hiding this comment

codecov bot commented Jan 21, 2021 • edited

Codecov Report

facebook-github-bot commented Jan 28, 2021

vkuzo commented Jan 15, 2021 •

edited

codecov bot commented Jan 21, 2021 •

edited