##### Copyright 2021 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用 XNNPACK 针对设备端推断进行剪枝

<table class="tfo-notebook-buttons" align="left">
  <td><a target="_blank" href="https://tensorflow.google.cn/model_optimization/guide/pruning/pruning_for_on_device_inference">     <img src="https://tensorflow.google.cn/images/tf_logo_32px.png">     在 TensorFlow.org 上查看</a></td>
  <td>     <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/model_optimization/guide/pruning/pruning_for_on_device_inference.ipynb"><img src="https://tensorflow.google.cn/images/colab_logo_32px.png">在 Google Colab 运行</a>
</td>
  <td>     <a target="_blank" href="https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/model_optimization/guide/pruning/pruning_for_on_device_inference.ipynb"><img src="https://tensorflow.google.cn/images/GitHub-Mark-32px.png">在 GitHub 上查看源代码</a>
</td>
  <td>     <a href="https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/model_optimization/guide/pruning/pruning_for_on_device_inference.ipynb"><img src="https://tensorflow.google.cn/images/download_logo_32px.png">下载笔记本</a>   </td>
</table>

欢迎阅读 Keras 权重剪枝指南，了解如何通过 [XNNPACK](https://github.com/google/XNNPACK) 改善设备端推断的延迟。

本指南将介绍新引入的 `tfmot.sparsity.keras.PruningPolicy` API 的用法，并演示如何在现代 CPU 上使用该 API 通过 [XNNPACK 稀疏推断](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/xnnpack/README.md#sparse-inference)加速大多数卷积模型。

本指南涵盖模型创建过程的以下步骤：

- 构建并训练密集基线
- 使用剪枝微调模型
- 转换为 TFLite
- 设备端基准测试

本指南未涵盖使用剪枝进行微调的最佳做法。有关此主题的更多详细信息，请查看我们的[综合指南](https://tensorflow.google.cn/model_optimization/guide/pruning/comprehensive_guide.md)。

## 安装

In [None]:
! pip install -q tensorflow
! pip install -q tensorflow-model-optimization

In [None]:
import tempfile

import tensorflow as tf
import numpy as np

from tensorflow import keras
import tensorflow_datasets as tfds
import tensorflow_model_optimization as tfmot

%load_ext tensorboard

## 构建和训练密集模型

我们为基于 [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) 数据集的分类任务构建并训练一个简单的基准 CNN。

In [None]:
# Load CIFAR10 dataset.
(ds_train, ds_val, ds_test), ds_info = tfds.load(
    'cifar10',
    split=['train[:90%]', 'train[90%:]', 'test'],
    as_supervised=True,
    with_info=True,
)

# Normalize the input image so that each pixel value is between 0 and 1.
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.image.convert_image_dtype(image, tf.float32), label

# Load the data in batches of 128 images.
batch_size = 128
def prepare_dataset(ds, buffer_size=None):
  ds = ds.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  ds = ds.cache()
  if buffer_size:
    ds = ds.shuffle(buffer_size)
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
  return ds

ds_train = prepare_dataset(ds_train,
                           buffer_size=ds_info.splits['train'].num_examples)
ds_val = prepare_dataset(ds_val)
ds_test = prepare_dataset(ds_test)

# Build the dense baseline model.
dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Compile and train the dense model for 10 epochs.
dense_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

dense_model.fit(
  ds_train,
  epochs=10,
  validation_data=ds_val)

# Evaluate the dense model.
_, dense_model_accuracy = dense_model.evaluate(ds_test, verbose=0)

## 构建稀疏模型

使用[综合指南](https://tensorflow.google.cn/model_optimization/guide/pruning/comprehensive_guide.md)中的说明，我们应用 `tfmot.sparsity.keras.prune_low_magnitude` 函数，其参数旨在通过剪枝来实现设备端加速，即 `tfmot.sparsity.keras.PruneForLatencyOnXNNPack` 策略。

In [None]:
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Compute end step to finish pruning after after 5 epochs.
end_epoch = 5

num_iterations_per_epoch = len(ds_train)
end_step =  num_iterations_per_epoch * end_epoch

# Define parameters for pruning.
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.25,
                                                               final_sparsity=0.75,
                                                               begin_step=0,
                                                               end_step=end_step),
      'pruning_policy': tfmot.sparsity.keras.PruneForLatencyOnXNNPack()
}

# Try to apply pruning wrapper with pruning policy parameter.
try:
  model_for_pruning = prune_low_magnitude(dense_model, **pruning_params)
except ValueError as e:
  print(e)

调用 `prune_low_magnitude` 导致 `ValueError` 并显示消息 `Could not find a GlobalAveragePooling2D layer with keepdims = True in all output branches`。该消息表示采用 `tfmot.sparsity.keras.PruneForLatencyOnXNNPack` 策略的剪枝不支持该模型，具体来说，层 `GlobalAveragePooling2D` 要求参数 `keepdims = True`。让我们解决这个问题并重新应用 `prune_low_magnitude` 函数。

In [None]:
fixed_dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(keepdims=True),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Use the pretrained model for pruning instead of training from scratch.
fixed_dense_model.set_weights(dense_model.get_weights())

# Try to reapply pruning wrapper.
model_for_pruning = prune_low_magnitude(fixed_dense_model, **pruning_params)

`prune_low_magnitude` 调用完成且没有任何错误，表示 `tfmot.sparsity.keras.PruneForLatencyOnXNNPack` 策略完全支持该模型，并且可以使用 [XNNPACK 稀疏推断](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/xnnpack/README.md#sparse-inference)进行加速。

### 微调稀疏模型

在[剪枝示例](https://tensorflow.google.cn/model_optimization/guide/pruning/pruning_with_keras.md)之后，我们使用密集模型的权重来微调稀疏模型。我们以 25% 稀疏性（25% 的权重设置为零）开始对模型进行微调并以 75% 稀疏性结束。

In [None]:
logdir = tempfile.mkdtemp()

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

model_for_pruning.fit(
  ds_train,
  epochs=15,
  validation_data=ds_val,
  callbacks=callbacks)

# Evaluate the dense model.
_, pruned_model_accuracy = model_for_pruning.evaluate(ds_test, verbose=0)

print('Dense model test accuracy:', dense_model_accuracy)
print('Pruned model test accuracy:', pruned_model_accuracy)

日志按层显示稀疏度的进度。

In [None]:
#docs_infra: no_execute
%tensorboard --logdir={logdir}

使用剪枝进行微调之后，与密集模型相比，测试准确率表现出一定程度的改善（43% 到 44%）。让我们使用 [TFLite 基准](https://tensorflow.google.cn/lite/performance/measurement)来比较设备端延迟。

## 模型转换和基准测试

要将剪枝后的模型转换为 TFLite，我们需要通过 `strip_pruning` 函数将 `PruneLowMagnitude` 包装器替换为原始层。此外，由于剪枝后的模型 (`model_for_pruning`) 的权重大多为零，我们可以应用优化 `tf.lite.Optimize.EXPERIMENTAL_SPARSITY` 来有效存储生成的 TFLite 模型。密集模型不需要此优化标志。

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(dense_model)
dense_tflite_model = converter.convert()

_, dense_tflite_file = tempfile.mkstemp('.tflite')
with open(dense_tflite_file, 'wb') as f:
  f.write(dense_tflite_model)

model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
pruned_tflite_model = converter.convert()

_, pruned_tflite_file = tempfile.mkstemp('.tflite')
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)

按照 [TFLite 模型基准测试工具](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)的说明，我们编译该工具，将其与密集模型和剪枝后的 TFLite 模型一起上传到 Android 设备，并在设备上对这两个模型进行基准测试。

In [None]:
! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/dense_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1

In [None]:
! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/pruned_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1

Pixel 4 上的基准测试结果显示，密集模型的平均推断时间为 *17us*，剪枝后的模型的平均推断时间为 *12us*。设备端基准测试表明，即使是此类小型模型，延迟也有明显的 **5us** 或 **30%** 改善。根据我们的经验，基于 [MobileNetV3](https://tensorflow.google.cn/api_docs/python/tf/keras/applications/mobilenet_v3) 或 [EfficientNet-lite](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite) 的大型模型也有类似的性能改进。加速效果取决于 1x1 卷积对整体模型的相对贡献。


## 结论

在本教程中，我们展示了如何使用 TF MOT API 和 XNNPack 引入的新功能创建稀疏模型来提高设备端性能。这些稀疏模型比对应的密集模型更小更快，同时质量相同甚至更高。

我们鼓励您试用这项新功能，这对于在设备上部署模型可能特别重要。