-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[quant] Input-Weight Equaliaztion - convert modifications #59963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 23c92d1 (more details on the Dr. CI page and at hud.pytorch.org/pr/59963):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_equalization_process_0_activation_post_process_0 : [#users=1] = call_module[target=x_equalization_process_0_activation_post_process_0](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_equalization_process_0_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
| prepared = prepare_fx(m, qconfig_dict, equalization_qconfig_dict=equalization_qconfig_dict) | ||
| self.checkGraphModuleNodes(prepared, expected_node_occurrence=node_occurrence) | ||
|
|
||
| def test_input_weight_equalization_convert(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a new test that verifies the graph structure after equalization is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be after the equalization functions or after all of convert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do both - check the output of _convert_equalization_ref and convert_fx (with equalization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was broken on macos
Traceback (most recent call last):
File "/Users/distiller/project/test/quantization/fx/test_equalize_fx.py", line 294, in test_input_weight_equalization_convert
convert_fx(prepared) # Check if compile?
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 543, in convert_fx
return _convert_fx(graph_module, is_reference, convert_custom_config_dict, _remove_qconfig=_remove_qconfig)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/quantize_fx.py", line 477, in _convert_fx
is_standalone_module, _remove_qconfig_flag=_remove_qconfig)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/convert.py", line 446, in convert
convert_custom_config_dict=convert_custom_config_dict)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/quantization/fx/quantization_patterns.py", line 687, in convert
quantized = qlinear.from_float(self.linear)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 276, in from_float
dtype=dtype)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 151, in __init__
self._packed_params = LinearPackedParams(dtype)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 19, in __init__
self.set_weight_bias(wq, None)
File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/modules/linear.py", line 24, in set_weight_bias
self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
Didn't find engine for operation quantized::linear_prepack NoQEngine
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before: `x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer`
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before: `x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer`
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. @jerryzh168 please review once as well.
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall, had a question about the order of equalization observer and quant observer
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
Summary: When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.
`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.
`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_scale_0 : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale_0), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: 3885de2
Pull Request resolved: #59963
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: When converting, before quantizing the nodes, we call `update_obs_for_equalization()` and `convert_eq_obs()`. This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.
Before:
```
weights
|
x -> input_quantization_observer -> input_equalization_observer -> linear -> output_quantization_observer
```
After:
```
equalization_scale weights (scaled)
| |
x -> mul -> input_quantization_observer (scaled) -> linear -> output_quantization_observer
```
In addition to updating the input and weight values, the input quantization observers will be updated so that the `scale` and `zero_point` qparams reflect the scaled input values. These qparams will be used later to create a `quantize_per_tensor` node which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call to `from_float` with the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.
These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What `update_obs_for_equalization` does:
1. For each InputEqualizationObserver, we find the corresponding WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the WeightEqualizationObserver, run forward on the observer with the given weights.
3. Calculate the equalization scale between the InputEqualizationObserver and WeightEqualizationObserver.
What `convert_eq_obs` does:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the equalization scale constant.
2. Create another node containing a `mul` operator multiplying the equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it with the `mul` node.
For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the current equalization scale and the next equalization scale
Currently, this supports models with `nn.Linear` layers, but does not support connecting linear layers.
Test Plan: `python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`
Original Model:
```
.LinearModule(
(linear): Linear(in_features=2, out_features=2, bias=True)
)
```
Graph after `prepare_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after equalization functions:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
%linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
return linear_activation_post_process_0
```
Graph after `convert_fx`:
```
graph():
%x : [#users=1] = placeholder[target=x]
%x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
%mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
%linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
%linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
%quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
%linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
%dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
return dequantize
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D29135358](https://our.internmc.facebook.com/intern/diff/D29135358)
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
This pull request has been merged in 3de79b7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverting this because it broke mac os test https://app.circleci.com/pipelines/github/pytorch/pytorch/340406/workflows/f48c759f-2242-4907-b4d2-a1f35cfbde68/jobs/14316635
|
This pull request has been reverted by e60f9cf. |
Stack from ghstack:
Summary: When converting, before quantizing the nodes, we call
update_obs_for_equalization()andconvert_eq_obs(). This will find input and weight equalization observers pairs and calculate the equalization scale. Using this equalization scale, we will scale the inputs by inserting a mul node into the graph to multiply the inputs by the equalization scale, and we will scale the weights by multiplying it by the reciprocal of the equalization scale and manually updating the weight value.Before:
After:
In addition to updating the input and weight values, the input quantization observers will be updated so that the
scaleandzero_pointqparams reflect the scaled input values. These qparams will be used later to create aquantize_per_tensornode which converts floats to quantized tensors. The weight quantization observers will be re-calibrated during the call tofrom_floatwith the scaled weights as inputs, causing their qparams to reflect changes made to the weight values.These updated quantization observers will then be used to construct the final quantized model based along with the scaled inputs and weights.
What
update_obs_for_equalizationdoes:What
convert_eq_obsdoes:For every InputEqualizationObserver, we will do the following:
x0_activation_post_process_scale) containing the equalization scale constant.muloperator multiplying the equalization scale and the input.mulnode.For every WeightEqualizationObserver, we will do the following:
Currently, this supports models with
nn.Linearlayers, but does not support connecting linear layers.Test Plan:
python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_convertOriginal Model:
Graph after
prepare_fx:Graph after equalization functions:
Graph after
convert_fx:Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D29135358