[quant] Input-Weight Equaliaztion - convert modifications #59963

angelayi · 2021-06-14T17:59:56Z

Stack from ghstack:

[quant] Input-Weight Equalization - tests #60378 [quant] Input-Weight Equalization - tests
[quant] Input-Weight Equalization - support for connected F.linear layer #60272 [quant] Input-Weight Equalization - support for connected F.linear layer
[quant] Input-Weight Equalization - support for connected linear layers #60034 [quant] Input-Weight Equalization - support for connected linear layers
[quant] Input-Weight Equalization - support for F.linear layers #59964 [quant] Input-Weight Equalization - support for F.linear layers
[quant] Input-Weight Equaliaztion - convert modifications #59963 [quant] Input-Weight Equaliaztion - convert modifications

Summary: When converting, before quantizing the nodes, we call update_obs_for_equalization() and convert_eq_obs(). This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.

Before:

                                                                  weights
                                                                     |
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer

After:

    equalization_scale                             weights (scaled)
          |                                               |
    x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer

In addition to updating the input and weight values, the input quantization observers will be updated so that the scale and zero_point qparams reflect the scaled input values. These qparams will be used later to create a quantize_per_tensor node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to from_float with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.

What update_obs_for_equalization does:

For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.

What convert_eq_obs does:
For every InputEqualizationObserver, we will do the following:

Create a node (ex. x0_activation_post_process_scale) containing the equalization scale constant.
Create another node containing a mul operator multiplying the equalization scale and the input.
Remove the current InputEqualizationObserver node, and replace it with the mul node.

For every WeightEqualizationObserver, we will do the following:

Get the next equalization scale (we may need this for equalizing connected linear layers).
Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale

Currently, this supports models with nn.Linear layers, but does not support connecting linear layers.

Test Plan: python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert

Original Model:

.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)

Graph after prepare_fx:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0

Graph after equalization functions:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0

Graph after convert_fx:

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D29135358

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

facebook-github-bot · 2021-06-14T18:00:02Z

💊 CI failures summary and remediations

As of commit 23c92d1 (more details on the Dr. CI page and at hud.pytorch.org/pr/59963):

3/3 failures possibly* introduced in this PR
- 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_macos_10_13_py3_test (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jun 23 00:59:09 RuntimeError: test_quantization failed!

Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_op.TestQNNPackOps-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.fx.test_quantize_fx.TestQuantizeFx-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.fx.test_quantize_fx.TestQuantizeFxOps-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.eager.test_quantize_eager_ptq.TestQuantizeONNXExport-20210623005617.xml
Jun 23 00:59:09 Generated XML report: test-reports/python-unittest/test_quantization/TEST-quantization.core.test_quantized_op.TestQuantizedEmbeddingOps-20210623005617.xml
Jun 23 00:59:09 Traceback (most recent call last):
Jun 23 00:59:09   File "test/run_test.py", line 1313, in <module>
Jun 23 00:59:09     main()
Jun 23 00:59:09   File "test/run_test.py", line 1292, in main
Jun 23 00:59:09     raise RuntimeError(err_message)
Jun 23 00:59:09 RuntimeError: test_quantization failed!
Jun 23 00:59:10 + cleanup
Jun 23 00:59:10 + retcode=1
Jun 23 00:59:10 + set +x


Exited with code exit status 1

pytorch_xla_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 23 01:08:48 AssertionError: False is not tr... was 1.0 (1.0 vs. 0.0), which occurred at index 0.

Jun 23 01:08:48 ----------------------------------------------------------------------
Jun 23 01:08:48 Traceback (most recent call last):
Jun 23 01:08:48   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 397, in instantiated_test
Jun 23 01:08:48     result = test_fn(self, *args)
Jun 23 01:08:48   File "/var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py", line 458, in test_transpose_inplace_view
Jun 23 01:08:48     self.assertEqual(t[1, 0], v[0, 1])
Jun 23 01:08:48   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 605, in assertEqual
Jun 23 01:08:48     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Jun 23 01:08:48   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1407, in assertEqual
Jun 23 01:08:48     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jun 23 01:08:48 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 1 element(s) (out of 1) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.0 (1.0 vs. 0.0), which occurred at index 0.
Jun 23 01:08:48 
Jun 23 01:08:48 ----------------------------------------------------------------------
Jun 23 01:08:48 Ran 138 tests in 3.433s
Jun 23 01:08:48 
Jun 23 01:08:48 FAILED (failures=2, skipped=102)
Jun 23 01:08:48 
Jun 23 01:08:48 Generating XML reports...
Jun 23 01:08:48 Generated XML report: test-reports/python-unittest/test.......test.test_view_ops/TEST-TestViewOpsXLA-20210623010845.xml
Jun 23 01:08:49 + cleanup
Jun 23 01:08:49 + retcode=1

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

torch/quantization/fx/_equalize.py

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

angelayi · 2021-06-15T15:28:40Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

torch/quantization/fx/_equalize.py

supriyar · 2021-06-15T20:36:40Z

test/quantization/fx/test_equalize_fx.py

            prepared = prepare_fx(m, qconfig_dict, equalization_qconfig_dict=equalization_qconfig_dict)
            self.checkGraphModuleNodes(prepared, expected_node_occurrence=node_occurrence)
+
+    def test_input_weight_equalization_convert(self):


Can we add a new test that verifies the graph structure after equalization is done?

Would this be after the equalization functions or after all of convert?

I think we can do both - check the output of _convert_equalization_ref and convert_fx (with equalization)

this was broken on macos

Traceback (most recent call last): File "/Users/distiller/project/test/quantization/fx/test_equalize_fx.py", line 294, in test_input_weight_equalization_convert convert_fx(prepared) # Check if compile? File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 543, in convert_fx return _convert_fx(graph_module, is_reference, convert_custom_config_dict, _remove_qconfig=_remove_qconfig) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 477, in _convert_fx is_standalone_module, _remove_qconfig_flag=_remove_qconfig) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/convert.py", line 446, in convert convert_custom_config_dict=convert_custom_config_dict) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/quantization_patterns.py", line 687, in convert quantized = qlinear.from_float(self.linear) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 276, in from_float dtype=dtype) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 151, in __init__ self._packed_params = LinearPackedParams(dtype) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 19, in __init__ self.set_weight_bias(wq, None) File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 24, in set_weight_bias self._packed_params = torch.ops.quantized.linear_prepack(weight, bias) Didn't find engine for operation quantized::linear_prepack NoQEngine

torch/quantization/fx/_equalize.py

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

torch/quantization/fx/_equalize.py

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: `x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

angelayi · 2021-06-21T16:34:33Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

torch/quantization/fx/_equalize.py

supriyar

looks good to me. @jerryzh168 please review once as well.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

jerryzh168

looks good overall, had a question about the order of equalization observer and quant observer

torch/quantization/fx/_equalize.py

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

angelayi · 2021-06-22T17:32:36Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

angelayi · 2021-06-22T17:43:26Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. `update_obs_for_equalization`: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. `convert_eq_obs`: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3885de2 Pull Request resolved: #59963

angelayi · 2021-06-22T19:00:18Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value. Before: ``` weights | x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer ``` After: ``` equalization_scale weights (scaled) | | x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer ``` In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values. These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights. What `update_obs_for_equalization` does: 1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver. 2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights. 3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver. What `convert_eq_obs` does: For every InputEqualizationObserver, we will do the following: 1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant. 2. Create another node containing a `mul` operator multiplying the equalization scale and the input. 3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node. For every WeightEqualizationObserver, we will do the following: 1. Get the next equalization scale (we may need this for equalizing connected linear layers). 2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers. Test Plan: `python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convert` Original Model: ``` .LinearModule( (linear): Linear(in_features=2, out_features=2, bias=True) ) ``` Graph after `prepare_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after equalization functions: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {}) %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {}) return linear_activation_post_process_0 ``` Graph after `convert_fx`: ``` graph(): %x : [#users=1] = placeholder[target=x] %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale] %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {}) %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0] %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0] %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {}) %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {}) %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {}) return dequantize ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358) [ghstack-poisoned]

angelayi · 2021-06-22T23:52:43Z

@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-23T03:45:00Z

This pull request has been merged in 3de79b7.

walterddr

reverting this because it broke mac os test https://app.circleci.com/pipelines/github/pytorch/pytorch/340406/workflows/f48c759f-2242-4907-b4d2-a1f35cfbde68/jobs/14316635

facebook-github-bot · 2021-06-23T18:24:54Z

This pull request has been reverted by e60f9cf.

This was referenced Jun 14, 2021

[quant] EqualizationQConfig to distinguish input/output activations #59739

Closed

[quant] Input Weight Equalization - prepare modifications #59747

Closed

facebook-github-bot added the fx label Jun 14, 2021

angelayi mentioned this pull request Jun 14, 2021

[quant] Equalization Observer modifications #59953

Closed

facebook-github-bot added the cla signed label Jun 14, 2021

angelayi mentioned this pull request Jun 14, 2021

[quant] Input-Weight Equalization - support for F.linear layers #59964

Closed

angelayi requested review from jerryzh168, supriyar and vkuzo June 14, 2021 18:10

vkuzo reviewed Jun 15, 2021

View reviewed changes

angelayi added 2 commits June 14, 2021 18:06

angelayi mentioned this pull request Jun 15, 2021

[quant] Input-Weight Equalization - support for connected linear layers #60034

Closed