<font size=10>**Magnitude-based Kernel Pruning**</font>

+ 构建、训练lenet网络
+ 以lenet为例介绍基于权重幅值的剪枝基本流程
+ 训练剪枝后的模型
+ 比较模型剪枝前后的推理速度、精度差异
+ 比较量化后的预训练模型和量化后的剪枝训练后的模型的推理速度、精度差异(本文不描述了)

In [1]:
import tensorflow as tf
print(f"tf verion = {tf.__version__}")

import tensorflow_model_optimization as tfmot
from tensorflow.keras.layers import InputLayer,Reshape,Conv2D,MaxPool2D,Flatten,Dense,Dropout
from tensorflow.keras.models import load_model
import numpy as np
from numpy import linalg as LA
import tempfile

tf verion = 2.2.0


*解决GPU内存不足报错，对GPU进行按需分配

In [2]:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

**小工具：查看一下各层的saprsity**

只有两个conv和三个dense层有weights，其中dense层还有biases
通过打印出的结果可以看到:
+ 所有的bias都没有剪枝
+ 剪枝是按层进行的，每层都剪掉了60%

In [3]:
def get_sparsity(weights):
    return 1.0 - np.count_nonzero(weights) / float(weights.size)

def list_sparsity(_model):
    for layer in _model.layers:
        for weight in layer.get_weights():
            '''
            print(np.allclose(
                target_sparsity, get_sparsity(tf.keras.backend.get_value(weight)), 
                rtol=1e-6, atol=1e-6)
            )
            '''
            print('%s sparsity:%f' %(layer.name, get_sparsity(tf.keras.backend.get_value(weight))))

# 构建、训练Lenet网络

## 加载数据集

In [4]:
# 加载 MNIST 数据集
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 归一化输入图片，这样每个像素的值都在[0, 1]之间
x_train, x_test = x_train / 255.0, x_test / 255.0

# 扩张输入数据维度[height, width, channels(depth)]
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

## 构建LeNet模型

In [5]:
model = tf.keras.models.Sequential([
        Conv2D(filters=6,kernel_size=5,strides=(1,1),padding='same',activation='relu',use_bias=False,input_shape=(28,28,1)),
        MaxPool2D(pool_size=(3,3),strides=2,padding="same"),
        Conv2D(filters=16,kernel_size=5,strides=(1,1),padding='same',activation='relu',use_bias=False),
        MaxPool2D(pool_size=(3,3),strides=2,padding="same"),
        Flatten(input_shape=(7, 7)),
        Dense(120, activation='relu'),
        Dense(84, activation='relu'),
        Dropout(0.2),
        Dense(10, activation='softmax')
    ])

print("float32 model:")
model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
model.summary()

float32 model:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 6)         150       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 16)        2400      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 120)               94200     
_________________________________________________________________
dense_1 (Dense)              (None, 84)  

## 模型训练和评估

In [6]:
print("==> training")
model.fit(x_train, y_train, epochs=1, validation_split=0.1)
print("==> evaluate")
baseline_model_accuracy = model.evaluate(x_test, y_test, verbose=2)

==> training
==> evaluate
313/313 - 2s - loss: 1.4998 - accuracy: 0.9621


## 保存预训练模型

In [7]:
model.save("./model/lenet_normal.hdf5")
model_json = model.to_json()
with open('./model/lenet_normal.json', 'w') as file:
    file.write(model_json)

# 网络剪枝(magnitude-base pruning)&Fine-tune
开始修剪掉10%，最终修剪掉60%(60%的dense layer kernel的weights和bias为0)

+ 剪枝方式是自定义的，渐进的修剪，不要修剪的太频繁，在修剪过程中给模型留出修剪后恢复精度的充裕时间
+ 只修剪dense层，不修剪分类头

## 剪枝是如何渐进进行的
设定起始剪枝sparsity和最终剪枝sparsity，譬如0.1~0.6。将整个剪枝过程分为11个（可自定义）[0.1, 0.15, 0.2, ..., 0.6]，将训练过程的所有batch分为16段（可自定义，但要大于剪枝的段数）。只在trainning阶段进行剪枝。在每段batch中的每个train_batch中使用相同的剪枝sparsity（每个train_batch都需要剪枝）。对于具体的每一个train_batch，需要在batch_end的时候剪枝，这样才能保证在每个train_batch结束后sparsity是我们期望的。设想假如我们在train_batch_begin时剪枝，剪枝后经过这个batch的训练，权重会发生变化，某些被我们修剪为0的权重可能变为非零值。

举例说明：
batch segment:      [0, 10, 20, 30, ..., 150, 160] 共16段
sparsity segment:   [0.1, 0.15, 0.2, 0.25, ..., 0.6] 共11个

+ 在train_batch [10, 20]之间的每个batch我们的剪枝sparsity都是0.15
+ [90, 100]之间的每个batch剪枝sparsity都是0.55
+ [100, 160]之间的每个batch我们的剪枝saprsity都是0.6
+ 在我们进行最后一个sparsity 0.6的剪枝时，我们当然是希望能够用尽量多一点的batch去训练它，以期望得到一个较好的训练效果
+ 在每个train_batch_end时剪枝，考虑到最后一次剪枝，为了保证sparsity使我们期望的，那么不能在剪枝后再训练，所以选择在train_batch_end时剪枝

## 剪枝哪些权重
我们构建的剪枝工具只对dense layer进行剪枝，对每个神经元的weights权重列向量按照L2范数进行排序，按照sparsity将最小的那部分weights列向量全部设置为0，这些weights列向量对应的bias值也设置为0。

**加载预训练网络，以此为基础进行剪枝，剪枝后的model_for_pruning引用了这一网络，会造成网络权重数据的覆盖，所以我们这边加载一个临时的网络用来剪枝，后续引用它时需要谨慎**

In [8]:
pretrained_model = load_model("./model/lenet_normal.hdf5")

## 构建剪枝网络

In [9]:
# Compute end step to finish pruning after 2 epochs.
batch_size = 128
epochs = 2
validation_split = 0.1  # 10% of training set will be used for validation set.

num_images = x_train.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs


def unit_prune_dense_layer(k_weights, b_weights, k_sparsity):
    """
    Takes in matrices of kernel and bias weights (for a dense
      layer) and returns the unit-pruned versions of each
    Args:
      k_weights: 2D matrix of the 
      b_weights: 1D matrix of the biases of a dense layer
      k_sparsity: percentage of weights to set to 0
    Returns:
      kernel_weights: sparse matrix with same shape as the original
        kernel weight matrix
      bias_weights: sparse array with same shape as the original
        bias array
    """

    # Copy the kernel weights and get ranked indeces of the
    # column-wise L2 Norms
    kernel_weights = np.copy(k_weights)
    ind = np.argsort(LA.norm(kernel_weights, axis=0))

    # Number of indexes to set to 0
    cutoff = int(len(ind) * k_sparsity)
    # The indexes in the 2D kernel weight matrix to set to 0
    sparse_cutoff_inds = ind[0:cutoff]
    kernel_weights[:, sparse_cutoff_inds] = 0.

    # Copy the bias weights and get ranked indeces of the abs
    bias_weights = np.copy(b_weights)
    # The indexes in the 1D bias weight matrix to set to 0
    # Equal to the indexes of the columns that were removed in this case
    #sparse_cutoff_inds
    bias_weights[sparse_cutoff_inds] = 0.

    return kernel_weights, bias_weights


# Define model for pruning.
class Prune_dense_layer(tf.keras.callbacks.Callback):
    def __init__(self, layer_index, start_batch, end_batch, start_sparsity,
                 end_sparsity):
        super(Prune_dense_layer, self).__init__()
        self.layer_index = layer_index
        self.global_batch = 0
        self.idx = 0 
        '''
        这里将整个训练过程的batch划分为了16组，剪枝渐进进行，剪枝共划分为11组；
        '''
        self.prune_batch = np.arange(start_batch, end_batch,
                                     int((end_batch - start_batch) / 16))
        self.prune_sparsity = np.arange(start_sparsity, end_sparsity,
                                        (end_sparsity - start_sparsity) / 10)
        self.prune_sparsity = np.append(self.prune_sparsity, end_sparsity)
        
    def prune_dense_layer(self, sparsity):
        for idx in self.layer_index:
            layer = self.model.get_layer(index=idx)
            new_weights = []
            k_weights, k_bias= layer.get_weights()
            k_weights_pruned, k_bias_pruned = unit_prune_dense_layer(
                k_weights, k_bias, sparsity)
            new_weights.append(k_weights_pruned)
            new_weights.append(k_bias_pruned)
            layer.set_weights(new_weights)

    def on_train_batch_end(self, batch, logs={}):
        batch = self.global_batch
        if batch in self.prune_batch:
            idx = np.where(self.prune_batch == batch)
            self.idx = idx[0].item()
            
        index = self.idx
        if self.idx >= np.size(self.prune_sparsity):
            index = np.size(self.prune_sparsity) - 1
            
        self.prune_dense_layer(sparsity=self.prune_sparsity[index])
        '''
        print("######prunning")
        print("batch %d" %(self.prune_batch[index]))
        print("sparsity %f" %(self.prune_sparsity[index]))
        '''
            
        self.global_batch = self.global_batch + 1


prune_dense_layer_callback = Prune_dense_layer(layer_index=[5, 6],
                                               start_batch=0,
                                               end_batch=end_step,
                                               start_sparsity=0.1,
                                               end_sparsity=0.6)

# 只修剪我们指定的层
model_for_pruning = tf.keras.models.clone_model(pretrained_model)

# `prune_low_magnitude` requires a recompile.
model_for_pruning.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

model_for_pruning.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 6)         150       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 16)        2400      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 120)               94200     
_________________________________________________________________
dense_1 (Dense)              (None, 84)                1

## Fine-tune剪枝网络&评估

In [10]:
logdir = tempfile.mkdtemp()
callbacks = [
    prune_dense_layer_callback
]

print("==> training")
model_for_pruning.fit(x_train, y_train,
                  batch_size=batch_size, epochs=epochs, validation_split=validation_split,
                  callbacks=callbacks)
print("==> evaluate")
model_for_pruning_accuracy = model_for_pruning.evaluate(x_test, y_test, verbose=2)

==> training
Epoch 1/2
Epoch 2/2
==> evaluate
313/313 - 2s - loss: 1.4911 - accuracy: 0.9723


## 查看一下剪枝strip后模型各层的sparsity

只有两个conv和三个dense层有weights，其中dense层还有biases
通过打印出的结果可以看到:
+ 剪枝是按层进行的，每层都剪掉了60%

这里dense_1层的sparsity不是0.6是因为这一层有84个神经元，86 * 0.6 = 50.4不是整数，我们只能修剪掉50个神经元，50 / 84 = 0.595238

In [11]:
list_sparsity(model_for_pruning)    

conv2d sparsity:0.000000
conv2d_1 sparsity:0.000000
dense sparsity:0.600000
dense sparsity:0.600000
dense_1 sparsity:0.595238
dense_1 sparsity:0.595238
dense_2 sparsity:0.000000
dense_2 sparsity:0.000000


## 保存strip后的Fine-tune模型
+ 模型的压缩率得到了提升：
由于有60%的权重被我们剪掉了(值为0)，因此模型压缩后可以获得更小的体积
采用"bz2"压缩，压缩前预训练模型和剪枝后模型大小都是1338456bytes，压缩后分别为1239917bytes和969408bytes。可以看出剪枝后模型小了很多。
+ 网络的运行时间得到了减少：

In [12]:
model_for_pruning.save("./model/lenet_prune.hdf5")
model_for_pruning_json = model_for_pruning.to_json()
with open('./model/lenet_prune.json', 'w') as file:
    file.write(model_for_pruning_json)

# 量化Fine-tune模型
+ 将预训练模型和Fine-tune模型转换为TFLite格式
+ 量化模型
+ 比较剪枝前后的模型在量化后的精度差异
+ 查看量化后的Fine-tune模型的sparsity是否还是60%

## 转换和量化：预训练模型和Fine-tune模型

In [13]:
def representative_data_gen():
    for input_value in tf.data.Dataset.from_tensor_slices(x_train.astype(np.float32)).batch(1).take(100):
    # Model has only one input so each data point has one element.
        yield [input_value]

# 将原始keras模型转换为tflite模型，并执行量化（float fallback quantization, tf2.3之后才支持full integer quant）
base_converter = tf.lite.TFLiteConverter.from_keras_model(model)
base_converter.optimizations = [tf.lite.Optimize.DEFAULT]
base_converter.representative_dataset = representative_data_gen
base_tflite_model = base_converter.convert()

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_pruning)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
pruned_tflite_model = converter.convert()

print('convert TFLite done')

convert TFLite done


## 保存TFLite训练后量化模型

In [14]:
with open('./model/lenet_normal.tflite', 'wb') as file:
    file.write(base_tflite_model)
with open('./model/lenet_prune.tflite', 'wb') as file:
    file.write(pruned_tflite_model)

## 评估预训练模型和Fine-tune模型量化后的精度

In [15]:
def evaluate_model(interpreter):
    input_index = interpreter.get_input_details()[0]["index"]
    output_index = interpreter.get_output_details()[0]["index"]

    # Run predictions on every image in the "test" dataset.
    prediction_digits = []
    for i, test_image in enumerate(x_test):
        if i % 1000 == 0:
            print('Evaluated on {n} results so far.'.format(n=i))
        # Pre-processing: add batch dimension and convert to float32 to match with
        # the model's input data format.
        test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
        interpreter.set_tensor(input_index, test_image)

        # Run inference.
        interpreter.invoke()

        # Post-processing: remove batch dimension and find the digit with highest
        # probability.
        output = interpreter.tensor(output_index)
        digit = np.argmax(output()[0])
        prediction_digits.append(digit)

    print('\n')
    # Compare prediction results with ground truth labels to calculate accuracy.
    prediction_digits = np.array(prediction_digits)
    accuracy = (prediction_digits == y_test).mean()
    return accuracy

interpreter = tf.lite.Interpreter(model_content=base_tflite_model)
interpreter.allocate_tensors()
test_accuracy = evaluate_model(interpreter)

prune_interpreter = tf.lite.Interpreter(model_content=pruned_tflite_model)
prune_interpreter.allocate_tensors()
prune_test_accuracy = evaluate_model(prune_interpreter)

print("Quantized TFLite model accuracy compare:===>")
print('Base(pre-trained) TFLite test_accuracy:', test_accuracy)
print('Pruned TFLite test_accuracy:', prune_test_accuracy)


Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Quantized TFLite model accuracy compare:===>
Base(pre-trained) TFLite test_accuracy: 0.9623
Pruned TFLite test_accuracy: 0.9725


**查看剪枝训练量化后模型各层的信息，从而找出权重所在层的index，便于我们提取权重**

In [16]:
all_layer_details = prune_interpreter.get_tensor_details()
for layer in all_layer_details:
    print(layer['name'])
    print(layer['shape'])
    print(layer['quantization'])
    print(layer['index'])

conv2d_input_int8
[ 1 28 28  1]
(0.003921568859368563, -128)
0
sequential/dense/BiasAdd/ReadVariableOp
[120]
(8.208009239751846e-05, 0)
1
sequential/dense_1/BiasAdd/ReadVariableOp
[84]
(0.00020712010154966265, 0)
2
sequential/dense_2/BiasAdd/ReadVariableOp
[10]
(0.0003023018944077194, 0)
3
sequential/flatten/Const
[2]
(0.0, 0)
4
sequential/dense/MatMul
[120 784]
(0.0029068407602608204, 0)
5
sequential/dense_1/MatMul
[ 84 120]
(0.002946030581369996, 0)
6
sequential/dense_2/MatMul
[10 84]
(0.0033457628451287746, 0)
7
sequential/conv2d/Conv2D
[6]
(0.0, 0)
8
sequential/conv2d/Conv2D1
[6 5 5 1]
(0.0, 0)
9
sequential/conv2d_1/Conv2D
[16]
(0.0, 0)
10
sequential/conv2d_1/Conv2D1
[16  5  5  6]
(0.0, 0)
11
sequential/conv2d/Relu;sequential/conv2d/Conv2D
[ 1 28 28  6]
(0.011796954087913036, -128)
12
sequential/max_pooling2d/MaxPool
[ 1 14 14  6]
(0.011796954087913036, -128)
13
sequential/conv2d_1/Relu;sequential/conv2d_1/Conv2D
[ 1 14 14 16]
(0.02823687344789505, -128)
14
sequential/max_pooling2d

## 统计剪枝量化后模型的稀疏度(sparsity)
+ 通过前面提取的信息可以知道权重所处的层，统计一下量化后权重的sparsity，结果显示量化没有改变sparsity，此处也即：量化后的权重层仍有80%的权重值是0.保持sparsity量化后不变的本质是：量化前0权重(float)量化为0权重(uint8)。根据我们的量化公式r=s(q - z)可知要想保证这一点，zero_point应该为0，进一步可以推导出min(r)应该为0，即权重的最小值必须是0。此处应该是特殊情况
+ 目前没有发现有资料说明量化不改变sparsity
+ 剪枝的使用应该更侧重于提高模型的可压缩性，便于在边缘设备上的部署。tensorflow给出的例子上也是只关注了模型的可压缩度（譬如lenet网络在剪枝后稀疏度为80%，通过zip压缩可以获得3倍的压缩效果，keras转换为tflite后再量化加上之前的剪枝一共可以获得10倍的压缩效果），没有关注模型的inference时间有无改善。

In [17]:
# layers: [5]:dense weights, [6]:dense1 weights, [7]:dense2 weights, [9]conv2d weights, [11]:conv2d_1 weights
weight_layer_index = [5, 6 ,7 ,9 ,11]
weight_layer_name = ["dense", "dense1" ,"dense2" ,"conv2d" ,"conv2d_1"]
weight_layers = {"dense" : 5, "dense1" : 6,"dense2" : 7, "conv2d" : 9, "conv2d_1" : 11}

for name, index in weight_layers.items():
    weight = prune_interpreter.get_tensor(index)
    print('%s sparsity:%f' %(name, get_sparsity(weight)))

dense sparsity:0.606420
dense1 sparsity:0.598611
dense2 sparsity:0.002381
conv2d sparsity:0.006667
conv2d_1 sparsity:0.009583
