<font size=10>**Quantization Aware Training**</font>
+ 以lenet为例介绍量化感知训练的基本流程
+ 部署量化感知训练后的网络模型

In [1]:
import tensorflow as tf
print(f"tf verion = {tf.__version__}")

import tensorflow_model_optimization as tfmot
from tensorflow.keras.layers import InputLayer,Reshape,Conv2D,MaxPool2D,Flatten,Dense,Dropout
from tensorflow.keras.models import load_model
import numpy as np

tf verion = 2.2.0


## 解决GPU内存不足报错，对GPU进行按需分配

In [2]:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

## 加载数据集

In [3]:
# 加载 MNIST 数据集
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 归一化输入图片，这样每个像素的值都在[0, 1]之间
x_train, x_test = x_train / 255.0, x_test / 255.0

# 扩张输入数据维度[height, width, channels(depth)]
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

## 构建LeNet模型

In [4]:
model = tf.keras.models.Sequential([
        Conv2D(filters=6,kernel_size=5,strides=(1,1),padding='same',activation='relu',use_bias=False,input_shape=(28,28,1)),
        MaxPool2D(pool_size=(3,3),strides=2,padding="same"),
        Conv2D(filters=16,kernel_size=5,strides=(1,1),padding='same',activation='relu',use_bias=False),
        MaxPool2D(pool_size=(3,3),strides=2,padding="same"),
        Flatten(input_shape=(7, 7)),
        Dense(120, activation='relu'),
        Dense(84, activation='relu'),
        Dropout(0.2),
        Dense(10, activation='softmax')
    ])
'''
model = tf.keras.models.Sequential([
        InputLayer(input_shape=(28, 28, 1)),
        Conv2D(filters=12, kernel_size=(3, 3),activation='relu'),
        MaxPool2D(pool_size=(2,2)),
        Flatten(),
        Dense(10)
    ])
'''

"\nmodel = tf.keras.models.Sequential([\n        InputLayer(input_shape=(28, 28, 1)),\n        Conv2D(filters=12, kernel_size=(3, 3),activation='relu'),\n        MaxPool2D(pool_size=(2,2)),\n        Flatten(),\n        Dense(10)\n    ])\n"

## 模型训练和评估（普通方式）

In [5]:
print("float32 model:")
model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
model.summary()
print("==> training")
model.fit(x_train, y_train, epochs=1, validation_split=0.1)
print("==> evaluate")
model.evaluate(x_test, y_test, verbose=2)

float32 model:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 6)         150       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 16)        2400      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 120)               94200     
_________________________________________________________________
dense_1 (Dense)              (None, 84)  

[1.4923863410949707, 0.97079998254776]

## 保存预训练模型

In [6]:
model.save("./model/lenet_normal.hdf5")
model_json = model.to_json()
with open('./model/lenet_normal.json', 'w') as file:
    file.write(model_json)

## 量化感知训练
参考(量化感知训练综合指南)[https://tensorflow.google.cn/model_optimization/guide/quantization/training_comprehensive_guide]
为了提高模型的准确率，建议：
+ 与从头开始训练相比，使用量化感知训练进行微调的效果一般更好
+ 尝试“量化某些层”以跳过量化对准确率影响最大的层
+ 尝试量化后面的层而不是前面的层
+ 避免量化关键层
*实验的时候对整个预训练模型进行QAT，测试精度很差，所以只对全连接层进行了QAT*

In [7]:
# 量化感知训练
print("quantized model:")

# 量化整个模型
#qat_model = tfmot.quantization.keras.quantize_model(pretrained_model)


# 量化某些层
# Helper function uses `quantize_annotate_layer` to annotate that only the 
# Dense layers should be quantized.
def apply_quantization_to_dense(layer):
    if isinstance(layer, tf.keras.layers.Dense):
        return tfmot.quantization.keras.quantize_annotate_layer(layer)
    return layer

# Use `tf.keras.models.clone_model` to apply `apply_quantization_to_dense` 
# to the layers of the model.
annotated_model = tf.keras.models.clone_model(
    model,
    clone_function=apply_quantization_to_dense,
)

# Now that the Dense layers are annotated,
# `quantize_apply` actually makes the model quantization aware.
qat_model = tfmot.quantization.keras.quantize_apply(annotated_model)


qat_model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
qat_model.summary()

print("==> training")
x_train_subset = x_train[0:1000]
y_train_subset = y_train[0:1000]
qat_model.fit(x_train_subset, y_train_subset,
              batch_size=500,
              epochs=1,
              validation_split=0.1)
print("==> evaluate")
qat_model.evaluate(x_test, y_test, verbose=2)

quantized model:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 6)         150       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 16)        2400      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
quant_dense (QuantizeWrapper (None, 120)               94205     
_________________________________________________________________
quant_dense_1 (QuantizeWrapp (None, 84)

[1.5283045768737793, 0.9506000280380249]

## 保存量化感知训练模型

In [15]:
qat_model.save("./model/lenet_qat.hdf5")
qat_model_json = model.to_json()
with open('./model/lenet_qat.json', 'w') as file:
    file.write(qat_model_json)

## 采用量化感知训练权重更新模型
+ 提取QAT后各层的权重(weights,bias)
+ 采用这些权重替换预训练模型的相应层的权重
+ 评估更新权重后的模型精度

In [9]:
# 查看模型各层的name
'''
for i in model.layers:
    print(i.name)
for i in qat_model.layers:
    print(i.name)
'''

'\nfor i in model.layers:\n    print(i.name)\nfor i in qat_model.layers:\n    print(i.name)\n'

In [10]:
# 获取QAT模型的weights和bias
# aa,bb,cc,dd,ee是一些QAT在做fake quantization时用到的值，可以通过netron工具查看

weights_0 = qat_model.get_layer('conv2d').get_weights()
weights_1 = qat_model.get_layer('conv2d_1').get_weights()
bias_2, weights_2, aa, bb, cc, dd, ee = qat_model.get_layer('quant_dense').get_weights()
bias_3, weights_3, aa, bb, cc, dd, ee = qat_model.get_layer('quant_dense_1').get_weights()
bias_4, weights_4, aa, bb, cc, dd, ee = qat_model.get_layer('quant_dense_2').get_weights()
new_weights_2 = []
new_weights_3 = []
new_weights_4 = []
new_weights_2.append(weights_2)
new_weights_2.append(bias_2)
new_weights_3.append(weights_3)
new_weights_3.append(bias_3)
new_weights_4.append(weights_4)
new_weights_4.append(bias_4)

# 把权重覆盖到相应的层
opt_model = load_model("./model/lenet_normal.hdf5") 
opt_model.get_layer('conv2d').set_weights(weights_0)
opt_model.get_layer('conv2d_1').set_weights(weights_1)
opt_model.get_layer('dense').set_weights(new_weights_2)
opt_model.get_layer('dense_1').set_weights(new_weights_3)
opt_model.get_layer('dense_2').set_weights(new_weights_4)
print("==> evaluate")
opt_model.evaluate(x_test, y_test, verbose=2)

==> evaluate
313/313 - 4s - loss: 1.4977 - accuracy: 0.9641


[1.4977458715438843, 0.9641000032424927]

## 保存QAT优化后模型

In [11]:
opt_model.save("./model/lenet_opt.hdf5")
opt_model_json = opt_model.to_json()
with open('./model/lenet_opt.json', 'w') as file:
    file.write(opt_model_json)

## 将keras模型转换为TFLite模型，并执行量化
我们后续要对TFLite模型执行训练后量化，从而评估经过QAT后的模型在量化后精度损失更小

In [12]:
def representative_data_gen():
    for input_value in tf.data.Dataset.from_tensor_slices(x_train.astype(np.float32)).batch(1).take(100):
    # Model has only one input so each data point has one element.
        yield [input_value]

# 将原始keras模型转换为tflite模型，并执行量化（float fallback quantization, tf2.3之后才支持full integer quant）
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
quantized_tflite_model = converter.convert()

opt_converter = tf.lite.TFLiteConverter.from_keras_model(opt_model)
opt_converter.optimizations = [tf.lite.Optimize.DEFAULT]
opt_converter.representative_dataset = representative_data_gen
opt_quantized_tflite_model = opt_converter.convert()
print('convert TFLite done')

convert TFLite done


## 保存TFLite训练后量化模型

In [13]:
with open('./model/lenet_normal.tflite', 'wb') as file:
    file.write(quantized_tflite_model)
with open('./model/lenet_opt.tflite', 'wb') as file:
    file.write(opt_quantized_tflite_model)

## 评估QAT对量化带来的影响

In [14]:
def evaluate_model(interpreter):
    input_index = interpreter.get_input_details()[0]["index"]
    output_index = interpreter.get_output_details()[0]["index"]

    # Run predictions on every image in the "test" dataset.
    prediction_digits = []
    for i, test_image in enumerate(x_test):
        if i % 1000 == 0:
            print('Evaluated on {n} results so far.'.format(n=i))
        # Pre-processing: add batch dimension and convert to float32 to match with
        # the model's input data format.
        test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
        interpreter.set_tensor(input_index, test_image)

        # Run inference.
        interpreter.invoke()

        # Post-processing: remove batch dimension and find the digit with highest
        # probability.
        output = interpreter.tensor(output_index)
        digit = np.argmax(output()[0])
        prediction_digits.append(digit)

    print('\n')
    # Compare prediction results with ground truth labels to calculate accuracy.
    prediction_digits = np.array(prediction_digits)
    accuracy = (prediction_digits == y_test).mean()
    return accuracy

interpreter = tf.lite.Interpreter(model_content=quantized_tflite_model)
interpreter.allocate_tensors()
test_accuracy = evaluate_model(interpreter)

opt_interpreter = tf.lite.Interpreter(model_content=opt_quantized_tflite_model)
opt_interpreter.allocate_tensors()
opt_test_accuracy = evaluate_model(opt_interpreter)

print('Quant TFLite test_accuracy:', test_accuracy)
print('QAT Optimized Quant TFLite test_accuracy:', opt_test_accuracy)


Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Quant TFLite test_accuracy: 0.9706
QAT Optimized Quant TFLite test_accuracy: 0.9635
