Skip to content

Conversation

@angelayi
Copy link
Contributor

@angelayi angelayi commented Jun 14, 2021

Stack from ghstack:

Summary: When converting, before quantizing the nodes, we call update_obs_for_equalization() and convert_eq_obs(). This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before:

                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer

After:

    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer

In addition to updating the input and weight values, the input quantization observers will be updated so that the scale and zero_point qparams reflect the scaled input values. These qparams will be used later to create a quantize_per_tensor node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to from_float with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What update_obs_for_equalization does:

  1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
  2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
  3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What convert_eq_obs does:
For every InputEqualizationObserver, we will do the following:

  1. Create a node (ex. x0_activation_post_process_scale) containing the equalization scale constant.
  2. Create another node containing a mul operator multiplying the equalization scale and the input.
  3. Remove the current InputEqualizationObserver node, and replace it with the mul node.

For every WeightEqualizationObserver, we will do the following:

  1. Get the next equalization scale (we may need this for equalizing connected linear layers).
  2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with nn.Linear layers, but does not support connecting linear layers.

Test Plan: python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert

Original Model:

.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)

Graph after prepare_fx:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0

Graph after equalization functions:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0

Graph after convert_fx:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D29135358

Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 14, 2021

💊 CI failures summary and remediations

As of commit 23c92d1 (more details on the Dr. CI page and at hud.pytorch.org/pr/59963):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jun 23 00:59:09 RuntimeError: test_quantization failed!
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_op.TestQNNPackOps-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.fx.test_quantize_fx.TestQuantizeFx-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.fx.test_quantize_fx.TestQuantizeFxOps-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.eager.test_quantize_eager_ptq.TestQuantizeONNXExport-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_op.TestQuantizedEmbeddingOps-20210623005617.xml
Jun 23 00:59:09 Traceback (most recent call last):
Jun 23 00:59:09   File "test/run_test.py", line 1313, in <module>
Jun 23 00:59:09     main()
Jun 23 00:59:09   File "test/run_test.py", line 1292, in main
Jun 23 00:59:09     raise RuntimeError(err_message)
Jun 23 00:59:09 RuntimeError: test_quantization failed!
Jun 23 00:59:10 + cleanup
Jun 23 00:59:10 + retcode=1
Jun 23 00:59:10 + set +x


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 23 01:08:48 AssertionError: False is not tr... was 1.0 (1.0 vs. 0.0), which occurred at index 0.
Jun 23 01:08:48 ----------------------------------------------------------------------
Jun 23 01:08:48 Traceback (most recent call last):
Jun 23 01:08:48   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 397, in instantiated_test
Jun 23 01:08:48     result = test_fn(self, *args)
Jun 23 01:08:48   File "/var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py", line 458, in test_transpose_inplace_view
Jun 23 01:08:48     self.assertEqual(t[1, 0], v[0, 1])
Jun 23 01:08:48   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 605, in assertEqual
Jun 23 01:08:48     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Jun 23 01:08:48   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1407, in assertEqual
Jun 23 01:08:48     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jun 23 01:08:48 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 1 element(s) (out of 1) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.0 (1.0 vs. 0.0), which occurred at index 0.
Jun 23 01:08:48 
Jun 23 01:08:48 ----------------------------------------------------------------------
Jun 23 01:08:48 Ran 138 tests in 3.433s
Jun 23 01:08:48 
Jun 23 01:08:48 FAILED (failures=2, skipped=102)
Jun 23 01:08:48 
Jun 23 01:08:48 Generating XML reports...
Jun 23 01:08:48 Generated XML report: test-reports/python-unittest/test.......test.test_view_ops/TEST-TestViewOpsXLA-20210623010845.xml
Jun 23 01:08:49 + cleanup
Jun 23 01:08:49 + retcode=1

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
angelayi added 2 commits June 14, 2021 18:06
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
prepared = prepare_fx(m, qconfig_dict, equalization_qconfig_dict=equalization_qconfig_dict)
self.checkGraphModuleNodes(prepared, expected_node_occurrence=node_occurrence)

def test_input_weight_equalization_convert(self):
Copy link
Contributor

@supriyar supriyar Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a new test that verifies the graph structure after equalization is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be after the equalization functions or after all of convert?

Copy link
Contributor

@supriyar supriyar Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do both - check the output of _convert_equalization_ref and convert_fx (with equalization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was broken on macos

Traceback (most recent call last):
  File "/Users/distiller/project/test/quantization/fx/test_equalize_fx.py", line 294, in test_input_weight_equalization_convert
    convert_fx(prepared)  # Check if compile?
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 543, in convert_fx
    return _convert_fx(graph_module, is_reference, convert_custom_config_dict, _remove_qconfig=_remove_qconfig)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 477, in _convert_fx
    is_standalone_module, _remove_qconfig_flag=_remove_qconfig)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/convert.py", line 446, in convert
    convert_custom_config_dict=convert_custom_config_dict)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/quantization_patterns.py", line 687, in convert
    quantized = qlinear.from_float(self.linear)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 276, in from_float
    dtype=dtype)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 151, in __init__
    self._packed_params = LinearPackedParams(dtype)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 19, in __init__
    self.set_weight_bias(wq, None)
  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 24, in set_weight_bias
    self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
Didn't find engine for operation quantized::linear_prepack NoQEngine

Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
angelayi added 3 commits June 18, 2021 09:55
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: `x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer`

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: `x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer`

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
Copy link
Contributor

@supriyar supriyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. @jerryzh168 please review once as well.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, had a question about the order of equalization observer and quant observer

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
angelayi added a commit that referenced this pull request Jun 22, 2021
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 3885de2
Pull Request resolved: #59963
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before: 
```
                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```

After: 
```
    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```

In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.

Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)

[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3de79b7.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by e60f9cf.

@facebook-github-bot facebook-github-bot deleted the gh/angelayi/13/head branch June 26, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants