From 5ef1124324381424b821b306040453d7782340f0 Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Tue, 17 Mar 2020 16:43:14 +0000
Subject: [PATCH 1/6] Add overview document for clustering

---
 .../g3doc/guide/clustering/index.md           | 103 ++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 tensorflow_model_optimization/g3doc/guide/clustering/index.md
diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
new file mode 100644
index 000000000..7de7fdf65
--- /dev/null
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -0,0 +1,103 @@
+# Weight Clustering
+
+This document provides an overview on weight clustering to help you determine how it fits with your use case. To dive right into the code, see the [Clustering with Keras](clustering_with_keras.ipynb) tutorial and the [API docs](../../api_docs/python). For additional details on how to use the Keras API, a deep dive into weight clustering, and documentation on more advanced usage patterns, see the [API usage guide](train_clustered_models.md).
+
+## Overview
+
+Clustering, or weight sharing, is a technique for reducing the number of unique weight values in a model. It first groups the weights of each layer into *N* clusters, then shares the cluster's centroid value for all the weights belonging to the cluster.
+
+This technique brings improvements in terms of model compression. It reduces the number of bits used to represent the weights, thus saving memory bandwidth. This can be critical for deploying deep learning models on embedded systems with limited resources.
+
+We've seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below.
+
+### API Compatibility Matrix
+
+Users can apply clustering with the following APIs:
+
+*   Model building: `tf.keras` with only Sequential and Functional models
+*   TensorFlow versions: TF 1.x for versions 1.14+ and 2.x.
+    *   `tf.compat.v1` with a TF 2.X package and `tf.compat.v2` with a TF 1.X
+        package are not supported.
+*   TensorFlow execution mode: both graph and eager
+*   Distributed training: `tf.distribute` with only graph execution
+
+## Results
+
+### Image Classification
+
+<table>
+  <tr>
+    <th rowspan=2>Model</th>
+    <th colspan=2>Original</th>
+    <th colspan=3>Clustered</th>
+  </tr>
+  <tr>
+    <th>Top-1 accuracy</th>
+    <th>Size of compressed .tflite</th>
+    <th># of clusters</th>
+    <th>Top-1 accuracy</th>
+    <th>Size of compressed .tflite</th>
+  </tr>
+  <tr>
+    <td>MobileNetV2</td>
+    <td>72.29%</td>
+    <td>13.0 MB</td>
+    <td>32</td>
+    <td>69.33%</td>
+    <td>2.6 MB</td>
+  </tr>
+</table>
+
+The models were trained and tested on ImageNet.
+
+### Keyword Spotting
+
+<table>
+  <tr>
+    <th rowspan=2>Model</th>
+    <th colspan=2>Original</th>
+    <th colspan=3>Clustered</th>
+  </tr>
+  <tr>
+    <th>Top-1 accuracy</th>
+    <th>Size of compressed .tflite</th>
+    <th># of clusters</th>
+    <th>Top-1 accuracy</th>
+    <th>Size of compressed .tflite</th>
+  </tr>
+  <tr>
+    <td>DS-CNN-L</td>
+    <td>95.03%</td>
+    <td>1.5 MB</td>
+    <td>32</td>
+    <td>94.71%</td>
+    <td>0.3 MB</td>
+  </tr>
+</table>
+
+The models were trained and tested on SpeechCommands v0.02.
+
+NOTE: *Size of compressed .tflite* refers to the size of the zipped .tflite file obtained from the model through the following process:
+1. Serialize the Keras model into .h5 file
+2. Convert the .h5 file into .tflite using `TFLiteConverter.from_keras_model_file()`
+3. Compress the .tflite file into a zip
+
+## Examples
+
+In addition to the [Clustering with Keras](clustering_with_keras.ipynb) tutorial, see the following examples:
+
+* Cluster the weights of a CNN model trained on the MNIST handwritten digit classification databaset:
+[code](https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/clustering/keras/mnist/mnist_cnn.py)
+
+## Tips
+
+1. The current clustering API works only with pre-trained models. Don't forget to train your model before attempting to cluster it.
+2. The centroid initialization technique you opt for has a significant impact on the accuracy of the clustered model. Experiments have shown that linear initialization outperforms density-based and random initialization in most cases.
+
+## References
+
+The weight clustering implementation is based on the technique described in chapter 3, entitled *Trained Quantization and Weight Sharing*, of the conference paper referenced below.
+
+1.  **Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding** <br/>
+    Song Han, Huizi Mao, William J. Dally <br/>
+    [https://arxiv.org/abs/1510.00149](https://arxiv.org/abs/1510.00149). ICLR, 2016 <br/>

From 8dd83a82dc681af84682c75eef2b65ce7c45e0d3 Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Tue, 14 Apr 2020 11:01:04 +0100
Subject: [PATCH 2/6] Addressed PR review comments

---
 .../g3doc/guide/clustering/index.md             | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
index 7de7fdf65..bdf9f8d8f 100644
--- a/tensorflow_model_optimization/g3doc/guide/clustering/index.md
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -1,16 +1,16 @@
-# Weight Clustering
+# Weight clustering
 
-This document provides an overview on weight clustering to help you determine how it fits with your use case. To dive right into the code, see the [Clustering with Keras](clustering_with_keras.ipynb) tutorial and the [API docs](../../api_docs/python). For additional details on how to use the Keras API, a deep dive into weight clustering, and documentation on more advanced usage patterns, see the [API usage guide](train_clustered_models.md).
+This document provides an overview on weight clustering to help you determine how it fits with your use case. To dive right into the code, see the [weight clustering end-to-end example](clustering_example.ipynb) and the [API docs](../../api_docs/python). For additional details on how to use the Keras API, a deep dive into weight clustering, and documentation on more advanced usage patterns, see the [weight clustering comprehensive guide](clustering_comprehensive_guide.ipynb).
 
 ## Overview
 
-Clustering, or weight sharing, is a technique for reducing the number of unique weight values in a model. It first groups the weights of each layer into *N* clusters, then shares the cluster's centroid value for all the weights belonging to the cluster.
+Clustering, or weight sharing, reduces the number of unique weight values in a model, leading to benefits for deployment. It first groups the weights of each layer into *N* clusters, then shares the cluster's centroid value for all the weights belonging to the cluster.
 
-This technique brings improvements in terms of model compression. It reduces the number of bits used to represent the weights, thus saving memory bandwidth. This can be critical for deploying deep learning models on embedded systems with limited resources.
+This technique brings improvements in terms of model compression. By reducing the number of unique weight values, weigth clustering renders the weights suitable for compression via Huffman coding and similar techniques. Future framework support will, therefore, be able to provide memory bandwith improvements. This can be critical for deploying deep learning models on embedded systems with limited resources.
 
 We've seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below.
 
-### API Compatibility Matrix
+### API compatibility matrix
 
 Users can apply clustering with the following APIs:
 
@@ -19,11 +19,10 @@ Users can apply clustering with the following APIs:
     *   `tf.compat.v1` with a TF 2.X package and `tf.compat.v2` with a TF 1.X
         package are not supported.
 *   TensorFlow execution mode: both graph and eager
-*   Distributed training: `tf.distribute` with only graph execution
 
 ## Results
 
-### Image Classification
+### Image classification
 
 <table>
   <tr>
@@ -50,7 +49,7 @@ Users can apply clustering with the following APIs:
 
 The models were trained and tested on ImageNet.
 
-### Keyword Spotting
+### Keyword spotting
 
 <table>
   <tr>
@@ -96,7 +95,7 @@ In addition to the [Clustering with Keras](clustering_with_keras.ipynb) tutorial
 
 ## References
 
-The weight clustering implementation is based on the technique described in chapter 3, entitled *Trained Quantization and Weight Sharing*, of the conference paper referenced below.
+The weight clustering implementation is based on the technique described in chapter 3, titled *Trained Quantization and Weight Sharing*, of the conference paper referenced below.
 
 1.  **Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding** <br/>
     Song Han, Huizi Mao, William J. Dally <br/>

From 19b5ea28962ae156cb10359595fb954e5ffc1613 Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Tue, 21 Apr 2020 18:25:45 +0100
Subject: [PATCH 3/6] Add more details about compression improvements and
 trade-offs

---
 tensorflow_model_optimization/g3doc/guide/clustering/index.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
index bdf9f8d8f..aa8ef46a6 100644
--- a/tensorflow_model_optimization/g3doc/guide/clustering/index.md
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -8,7 +8,9 @@ Clustering, or weight sharing, reduces the number of unique weight values in a m
 
 This technique brings improvements in terms of model compression. By reducing the number of unique weight values, weigth clustering renders the weights suitable for compression via Huffman coding and similar techniques. Future framework support will, therefore, be able to provide memory bandwith improvements. This can be critical for deploying deep learning models on embedded systems with limited resources.
 
-We've seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below.
+We have seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below. The compression gains depend on the model and the accuracy targets in each specific use case. For example, for the MobileNetV2 image classification model, one can choose to reduce all non-depthwise convolutional layers to use just 32 unique weigth values and obtain a float32 tflite model that is approximately 4.8 times more compressible using ZIP Deflate algorithm than the original model. However, that will result in about 3% drop of the top-1 classification accuracy. On the other hand, the same model clustered less agressively, using 256 clusters for two internal layers and 32 clusters for the final convolutional layer, maintains virtually the same accuracy as the original model, yet still yields a respectable 1.8x improvement in compression ratio.
+
+Clustering works well with TFLiteConverter, providing an easy path to produce deployment-ready models that can be easily compressed using either an off-the-shelf compression algorithm, similar to the ZIP Deflate we use for demonstration in this document, or a custom method optimized for a special target hardware. When converting the clustered model with TFLiteConverter, the actual number of unique weight values per tensor may increase. This happens for the models with batch normalization layers that are folded into the preceding convolutional layers during the conversion, and also due to different scale factors in the per-channel weight quantization scheme. Both techniques may alter the same weight value differently, depending on the channel it appears in and the associated batch-normalization and quantization parameters. While this side effect may result in a slightly lower compression ratio, the overall benefits of using clustering and post-training conversion and quantization are still tangible, as demonstrated by the examples in this document.
 
 ### API compatibility matrix
 

From 69dc8ebb42888f542eb878fd971cf117689e3542 Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Wed, 6 May 2020 16:03:12 +0100
Subject: [PATCH 4/6] Remove "Tips" section from clustering overview document

---
 .../g3doc/guide/clustering/index.md                          | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
index aa8ef46a6..d8b9a00e4 100644
--- a/tensorflow_model_optimization/g3doc/guide/clustering/index.md
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -90,11 +90,6 @@ In addition to the [Clustering with Keras](clustering_with_keras.ipynb) tutorial
 * Cluster the weights of a CNN model trained on the MNIST handwritten digit classification databaset:
 [code](https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/clustering/keras/mnist/mnist_cnn.py)
 
-## Tips
-
-1. The current clustering API works only with pre-trained models. Don't forget to train your model before attempting to cluster it.
-2. The centroid initialization technique you opt for has a significant impact on the accuracy of the clustered model. Experiments have shown that linear initialization outperforms density-based and random initialization in most cases.
-
 ## References
 
 The weight clustering implementation is based on the technique described in chapter 3, titled *Trained Quantization and Weight Sharing*, of the conference paper referenced below.

From 6f8842611338ab9552ae9512d92e8705ee252160 Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Thu, 7 May 2020 14:16:56 +0100
Subject: [PATCH 5/6] Update clustering overview document to address reviewers'
 comments

---
 .../g3doc/guide/clustering/index.md           | 25 ++++++++-----------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
index d8b9a00e4..bbc475450 100644
--- a/tensorflow_model_optimization/g3doc/guide/clustering/index.md
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -1,16 +1,19 @@
 # Weight clustering
 
-This document provides an overview on weight clustering to help you determine how it fits with your use case. To dive right into the code, see the [weight clustering end-to-end example](clustering_example.ipynb) and the [API docs](../../api_docs/python). For additional details on how to use the Keras API, a deep dive into weight clustering, and documentation on more advanced usage patterns, see the [weight clustering comprehensive guide](clustering_comprehensive_guide.ipynb).
+This document provides an overview on weight clustering to help you determine how it fits with your use case.
+
+- To dive right into an end-to-end example, see the [weight clustering example](clustering_example.ipynb).
+- To quickly find the APIs you need for your use case, see the [weight clustering comprehensive guide](clustering_comprehensive_guide.ipynb).
 
 ## Overview
 
 Clustering, or weight sharing, reduces the number of unique weight values in a model, leading to benefits for deployment. It first groups the weights of each layer into *N* clusters, then shares the cluster's centroid value for all the weights belonging to the cluster.
 
-This technique brings improvements in terms of model compression. By reducing the number of unique weight values, weigth clustering renders the weights suitable for compression via Huffman coding and similar techniques. Future framework support will, therefore, be able to provide memory bandwith improvements. This can be critical for deploying deep learning models on embedded systems with limited resources.
+This technique brings improvements via model compression. Future framework support can unlock memory footprint improvements that can make a crucial difference for deploying deep learning models on embedded systems with limited resources.
 
-We have seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below. The compression gains depend on the model and the accuracy targets in each specific use case. For example, for the MobileNetV2 image classification model, one can choose to reduce all non-depthwise convolutional layers to use just 32 unique weigth values and obtain a float32 tflite model that is approximately 4.8 times more compressible using ZIP Deflate algorithm than the original model. However, that will result in about 3% drop of the top-1 classification accuracy. On the other hand, the same model clustered less agressively, using 256 clusters for two internal layers and 32 clusters for the final convolutional layer, maintains virtually the same accuracy as the original model, yet still yields a respectable 1.8x improvement in compression ratio.
+We have experimented with clustering across vision and speech tasks. We've seen up to 5x improvements in model compression with minimal loss of accuracy, as demonstrated by the [results](#results) presented below.
 
-Clustering works well with TFLiteConverter, providing an easy path to produce deployment-ready models that can be easily compressed using either an off-the-shelf compression algorithm, similar to the ZIP Deflate we use for demonstration in this document, or a custom method optimized for a special target hardware. When converting the clustered model with TFLiteConverter, the actual number of unique weight values per tensor may increase. This happens for the models with batch normalization layers that are folded into the preceding convolutional layers during the conversion, and also due to different scale factors in the per-channel weight quantization scheme. Both techniques may alter the same weight value differently, depending on the channel it appears in and the associated batch-normalization and quantization parameters. While this side effect may result in a slightly lower compression ratio, the overall benefits of using clustering and post-training conversion and quantization are still tangible, as demonstrated by the examples in this document.
+Please note that clustering will provide reduced benefits for convolution and dense layers that precede a batch normalization layer, as well as in combination with per-axis post-training quantization.
 
 ### API compatibility matrix
 
@@ -78,22 +81,16 @@ The models were trained and tested on ImageNet.
 
 The models were trained and tested on SpeechCommands v0.02.
 
-NOTE: *Size of compressed .tflite* refers to the size of the zipped .tflite file obtained from the model through the following process:
+NOTE: *Size of compressed .tflite* refers to the size of the zipped .tflite file obtained from the model from the following process:
 1. Serialize the Keras model into .h5 file
 2. Convert the .h5 file into .tflite using `TFLiteConverter.from_keras_model_file()`
 3. Compress the .tflite file into a zip
 
 ## Examples
 
-In addition to the [Clustering with Keras](clustering_with_keras.ipynb) tutorial, see the following examples:
+In addition to the [Weight clustering in Keras example](clustering_example.ipynb.ipynb), see the following examples:
 
-* Cluster the weights of a CNN model trained on the MNIST handwritten digit classification databaset:
+* Cluster the weights of a CNN model trained on the MNIST handwritten digit classification dataset:
 [code](https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/clustering/keras/mnist/mnist_cnn.py)
 
-## References
-
-The weight clustering implementation is based on the technique described in chapter 3, titled *Trained Quantization and Weight Sharing*, of the conference paper referenced below.
-
-1.  **Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding** <br/>
-    Song Han, Huizi Mao, William J. Dally <br/>
-    [https://arxiv.org/abs/1510.00149](https://arxiv.org/abs/1510.00149). ICLR, 2016 <br/>
+The weight clustering implementation is based on the *Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization and Huffman Coding* [paper](https://arxiv.org/abs/1510.00149). See chapter 3, titled *Trained Quantization and Weight Sharing*.
\ No newline at end of file

From 874120bf83d8d8f230d2e1e8cc7c71a7e04ca00e Mon Sep 17 00:00:00 2001
From: Aron Virginas-Tar <Aron.Virginas-Tar@arm.com>
Date: Tue, 19 May 2020 12:31:02 +0100
Subject: [PATCH 6/6] Add latest experimental results to clustering overview
 document

---
 .../g3doc/guide/clustering/index.md           | 73 +++++++++++++------
 1 file changed, 51 insertions(+), 22 deletions(-)

diff --git a/tensorflow_model_optimization/g3doc/guide/clustering/index.md b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
index bbc475450..d1398ad70 100644
--- a/tensorflow_model_optimization/g3doc/guide/clustering/index.md
+++ b/tensorflow_model_optimization/g3doc/guide/clustering/index.md
@@ -31,24 +31,51 @@ Users can apply clustering with the following APIs:
 
 <table>
   <tr>
-    <th rowspan=2>Model</th>
-    <th colspan=2>Original</th>
-    <th colspan=3>Clustered</th>
+    <th rowspan="2">Model</th>
+    <th colspan="2">Original</th>
+    <th colspan="4">Clustered</th>
   </tr>
   <tr>
-    <th>Top-1 accuracy</th>
-    <th>Size of compressed .tflite</th>
+  <th>Top-1 accuracy (%)</th>
+    <th>Size of compressed .tflite (MB)</th>
+    <th>Configuration</th>
     <th># of clusters</th>
-    <th>Top-1 accuracy</th>
-    <th>Size of compressed .tflite</th>
+    <th>Top-1 accuracy (%)</th>
+    <th>Size of compressed .tflite (MB)</th>
   </tr>
   <tr>
-    <td>MobileNetV2</td>
-    <td>72.29%</td>
-    <td>13.0 MB</td>
-    <td>32</td>
-    <td>69.33%</td>
-    <td>2.6 MB</td>
+    <td rowspan="3">MobileNetV1</td>
+    <td rowspan="3">71.02</td>
+    <td rowspan="3">14.96</td>
+  </tr>
+  <tr>
+    <td>Selective (last 3 Conv2D layers)</td>
+    <td>256, 256, 32</td>
+    <td>70.62</td>
+    <td>8.42</td>
+  </tr>
+  <tr>
+    <td>Full (all Conv2D layers)</td>
+    <td>64</td>
+    <td>66.07</td>
+    <td>2.98</td>
+  </tr>
+  <tr>
+    <td rowspan="3">MobileNetV2</td>
+    <td rowspan="3">72.29</td>
+    <td rowspan="3">12.90</td>
+  </tr>
+  <tr>
+    <td>Selective (last 3 Conv2D layers)</td>
+    <td>256, 256, 32</td>
+    <td>72.31</td>
+    <td>7.00</td>
+ </tr>
+ <tr>
+   <td>Full (all Conv2D layers)</td>
+   <td>32</td>
+   <td>69.33</td>
+   <td>2.60</td>
   </tr>
 </table>
 
@@ -60,22 +87,24 @@ The models were trained and tested on ImageNet.
   <tr>
     <th rowspan=2>Model</th>
     <th colspan=2>Original</th>
-    <th colspan=3>Clustered</th>
+    <th colspan=4>Clustered</th>
   </tr>
   <tr>
-    <th>Top-1 accuracy</th>
-    <th>Size of compressed .tflite</th>
+    <th>Top-1 accuracy (%)</th>
+    <th>Size of compressed .tflite (MB)</th>
+    <th>Configuration</th>
     <th># of clusters</th>
-    <th>Top-1 accuracy</th>
-    <th>Size of compressed .tflite</th>
+    <th>Top-1 accuracy (%)</th>
+    <th>Size of compressed .tflite (MB)</th>
   </tr>
   <tr>
     <td>DS-CNN-L</td>
-    <td>95.03%</td>
-    <td>1.5 MB</td>
+    <td>95.03</td>
+    <td>1.5</td>
+    <td>Full</td>
     <td>32</td>
-    <td>94.71%</td>
-    <td>0.3 MB</td>
+    <td>94.71</td>
+    <td>0.3</td>
   </tr>
 </table>
 

Model	Original		Clustered
Model	Top-1 accuracy	Size of compressed .tflite	# of clusters	Top-1 accuracy	Size of compressed .tflite
MobileNetV2	72.29%	13.0 MB	32	69.33%	2.6 MB