# Pruning (Custom Trainloop) and Quantization on IMDB Sentiment Analysis with Streaming LSTM Model

This notebook demonstrates the application of pruning and quantization techniques on a streaming LSTM model for the task of IMDB sentiment analysis. The key components and steps of the process are outlined below.

## Overview

- **Task:** IMDB Sentiment Analysis
- **Model:** Streaming LSTM Model
- **Layers to Prune:** RNN and Dense layers
- **Pencil Size:** 4

## Tutorial Flow

1. **Pruning Using a Custom Training Loop:** This notebook showcases the process of pruning using a custom training loop. The goal is to sparsify the model while maintaining its performance.

2. **Training and Sparsification in Non-Streaming Mode:** We initially train and sparsify the LSTM model in non-streaming mode. This allows us to obtain a set of sparse weights.

3. **Conversion to Streaming Mode:** For TFLite conversion, we create a "streaming version" of the model with `batch_size=1` and `time_steps=1`. The sparse weights obtained from the non-streaming model are loaded onto this streaming version of the model.

4. **TFLite Conversion:** The streaming version of the model, equipped with the sparse weights, is then converted to TFLite format for deployment on edge devices.

5. Compiling TFLite model using [femtocrux](https://femtocrux.femtosense.ai/en/latest/)

6. Generating Program Files from the Compiled Model using [femtodriverpub](https://github.com/femtosense/femtodriverpub) , and loading them onto SPU. 


## Installation


In [None]:
# ! pip install femtoflow --quiet

## Imports

In [None]:
import os
from tensorflow import keras
import math
import numpy as np
import tempfile
import tensorflow as tf
import tensorflow_model_optimization as tfmot
from typing import List

from femtoflow.quantization.quantize_tflite import TFLiteModelWrapper
from femtoflow.sparsity.prune import PruneHelper
from femtoflow.utils.plot import plot_prune_mask
from femtoflow.utils.metrics import calculate_sparsity, get_gzipped_model_size

import femtocrux


In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Change 0 to the index of the de
print(os.environ["CUDA_VISIBLE_DEVICES"])
# os.environ['GH_PACKAGE_KEY']

In [None]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("Found the following GPUs:")
    for gpu in gpus:
        print(gpu)
else:
    print("No GPUs found")

In [None]:
import warnings 
warnings.filterwarnings('ignore')

## Define Parameters

### IMDB Data Parameters
- `imdb_num_words`: Number of words to keep from the IMDB dataset based on frequency.
- `imdb_skip_top_words`: Number of most frequently occurring words to skip.
- `imdb_num_classes`: Number of classes for sentiment classification (positive/negative).
- `imdb_max_seq_len`: Maximum sequence length for each review. Reviews with more words than this value will be discarded.

### Model Parameters
- `num_layers`: Number of layers in the LSTM model.
- `hidden_dimension`: Dimension of the hidden state in the LSTM layers.
- `embedding_dimension`: Dimension of the word embeddings.
- `regularization`: L2 regularization strength for model weights.
- `prob_output_name`: Name of the output layer for predicted probabilities.
- `mask_input_name`: Name of the input layer for mask values.
- `word_input_name`: Name of the input layer for word sequences.
- `hidden_input_names`: List of input layer names for hidden states in each LSTM layer.
- `hidden_output_names`: List of output layer names for hidden states in each LSTM layer.
- `cell_input_names`: List of input layer names for cell states in each LSTM layer.
- `cell_output_names`: List of output layer names for cell states in each LSTM layer.

### Training Parameters
- `batch_size`: Batch size for training.
- `epochs`: Number of epochs for training.

In [None]:
# IMDB data parameters
imdb_num_words = 2 ** 10 # Only these many words are kept
imdb_skip_top_words = 2 # Skip this many most freqently occurring words
imdb_num_classes = 2
imdb_max_seq_len = 64 # Discard all data with more than this # of words
    
# Model parameters
num_layers = 2
hidden_dimension = 8
embedding_dimension = 8
regularization = 1e-5
prob_output_name = 'prob'
mask_input_name = 'mask'
word_input_name = 'word'
hidden_input_names = ['hidden_input_%d' % idx for idx in range(num_layers)]
hidden_output_names = ['hidden_output_%d' % idx for idx in range(num_layers)]
cell_input_names = ['cell_input_%d' % idx for idx in range(num_layers)]
cell_output_names = ['cell_output_%d' % idx for idx in range(num_layers)]

# Training parameters
batch_size=32
epochs=5


## Define the Model

The `sentiment_model` function defines the architecture of the LSTM model for sentiment analysis. The model consists of the following key components:
- Embedding layer: Transforms one-hot encoded word inputs into dense embeddings.
- LSTM layers: A sequence of LSTM layers for capturing temporal dependencies.
- Final classification layer: A fully connected layer with a sigmoid activation function for binary classification.

The function allows for two modes of operation:
- Training mode: Variable batch size and time steps, along with a mask, are used for training the model.
- Inference mode: A fixed batch size of 1 and time steps of 1 are used for inference, with no mask.

The function returns a Keras model object with the specified architecture.

In [None]:

def sentiment_model(batch_size: int = None, timesteps: int = None, training: bool = True):
    """
        Creates a model: embedding -> LSTMs -> FC -> sigmoid

        For training, we use variable batch size and time steps, along with a mask.

        For inference, we use a batch size = 1 and time steps = 1, with no mask.
    """
    # regularizer = tf.keras.regularizers.L2(0.01)
    
    # Input layers: one-hot word encoding, hidden features (for future timesteps), initial state
    initial_hiddens = [keras.layers.Input(
        batch_size=batch_size,
        shape=(hidden_dimension,),
        name=hidden_input_names[idx]
        ) for idx in range(num_layers)]
    initial_cells = [keras.layers.Input(
        batch_size=batch_size,
        shape=(hidden_dimension,), 
        name=cell_input_names[idx]
        ) for idx in range(num_layers)]
    words_one_hot = keras.layers.Input(
        batch_size=batch_size,
        shape=(timesteps, imdb_num_words), 
        name=word_input_name
    )
    mask = keras.layers.Input(
        batch_size=batch_size,
        shape=(timesteps,),
        name=mask_input_name
    ) if training else None


    # Learned embedding layer to map words to features.
    # We use fully connected rather than keras.layers.Embedding because our model will have int8 inputs
    embedding = keras.layers.Dense(
        units=embedding_dimension,
        use_bias=False,
        activity_regularizer=None,
        kernel_regularizer=None
    )(words_one_hot)

    # Reshape the inputs to (timesteps, inputs)
    embedding = tf.keras.layers.Reshape(
        target_shape=(
            timesteps if timesteps is not None else -1, 
            embedding_dimension
        )
    )(embedding)

    # LSTM layers with initial state
    rnn_output = embedding
    final_hiddens = []
    final_cells = []
    for layer_idx, (initial_hidden, initial_cell) in enumerate(zip(initial_hiddens, initial_cells)):
        rnn_output, final_hidden, final_cell = tf.keras.layers.RNN(
                tf.keras.layers.LSTMCell(
                    units=hidden_dimension,
                    kernel_regularizer=None
                    ),
                return_sequences = True,
                return_state = True,
                unroll = not training
                )(rnn_output, mask=mask, initial_state=[initial_hidden, initial_cell])

        # Pass the hidden and cell states into identity layers, to rename them
        final_hiddens.append(keras.layers.Layer(name=hidden_output_names[layer_idx])(final_hidden))
        final_cells.append(keras.layers.Layer(name=cell_output_names[layer_idx])(final_cell))

    # Final classification layer
    final_rnn_output = rnn_output[:, -1]
    prob = keras.layers.Dense(
            units=imdb_num_classes, 
            activation='sigmoid',
            name=prob_output_name
            )(final_rnn_output)

    # Choose the model inputs/outputs for training and inference mode
    inputs=[words_one_hot] + initial_hiddens + initial_cells
    outputs={
        prob_output_name: prob
    }
    if training:
        inputs += [mask,]
    else:
        for idx, (hidden_output_name, cell_output_name) in enumerate(zip(hidden_output_names, cell_output_names)):
            outputs[hidden_output_name] = final_hiddens[idx]
            outputs[cell_output_name] = final_cells[idx]

    return keras.models.Model(
            inputs=inputs,
            outputs=outputs
    )


## Define Utility Functions

This section defines utility functions used for data processing, sequence prediction, and model conversion:

- `prepare_inputs`: Prepares the input sequences for the model in a dictionary format understood by the TensorFlow model. This function performs padding, one-hot encoding, and initializes hidden states.

- `__process_sequences__`: Runs the model on sequence data and processes the inputs and initial states as separate time slices. This function can be used as a data generator for TensorFlow to TFLite conversion or as an inference engine.

- `__batch_process_sequences__`: A wrapper function for `__process_sequences__` that performs batch processing of sequences.


In [None]:
def prepare_inputs(x: List[int], y: List[int] = None):
    """
    Process inputs from dataloader, in a dictionary format
    understood by the Tensorflow Model.
    """

    have_y = y is not None

    # Preprocessing: get a batch and pad to a common length
    max_seq_len = max(len(seq) for seq in x)
    x_words = np.zeros((len(x), max_seq_len))
    x_words.fill(np.nan)
    for x_idx, seq in enumerate(x):
        x_words[x_idx, :len(seq)] = seq
    mask = ~np.isnan(x_words)
    x_words[~mask] = 0

    # Convert words and labels to one-hot
    x_one_hot = keras.utils.to_categorical(x_words, imdb_num_words)
    y_one_hot = keras.utils.to_categorical(y, imdb_num_classes) if have_y else None

    # Initialize the hidden state to zeros
    initial_hiddens = {
            hidden_input_names[idx]: np.zeros((len(x), hidden_dimension), dtype=np.float32) for idx in range(num_layers)
            }
    initial_cells = {
            cell_input_names[idx]: np.zeros_like(initial_hiddens[hidden_input_names[idx]]) for idx in range(num_layers)
            }

    # Collect all the inputs
    inputs = {
        mask_input_name: mask,
        word_input_name: x_one_hot
    }
    inputs.update(initial_hiddens)
    inputs.update(initial_cells)

    return (inputs, y_one_hot) if have_y else inputs


In [None]:
def __process_sequences__(predict_fun, x, yield_inputs: bool = False) -> List[np.array]:
    """
        Runs the model on sequence data, then yields the inputs and initial 
        state as separate time slices.

        This can be used as a data generator for TF -> TFLite conversion, if y is None,
        or as an inference engine, if y is not None.

        Arguments:
            predict_fun: Runs prediction on the inputs. Same signature as predict() method of 
                keras.Model.
            x: Input data
            yield_intputs: If true, acts as a generator yielding model inputs. Used for TFLite quantization.
    """
    # Process each sequence individually
    sequence_inputs = prepare_inputs(x)
    batch_size, num_timesteps, num_words = sequence_inputs[word_input_name].shape
    print("Input stats:")
    print("\tNum sequences: %d" % batch_size)
    print("\tMax sequence length: %d" % num_timesteps)
    print("\tNum words: %d" % num_words)
    print("\tYielding inputs: %d" % yield_inputs)
    for batch_idx in range(batch_size):
        # Process each time step individually
        # For the first iteration, use the sequence input hidden state
        print("Generating sequence %d / %d..." % (batch_idx, batch_size))
        initial_hidden = [sequence_inputs[hidden_input_name][None, batch_idx] for hidden_input_name in hidden_input_names]
        initial_cell = [sequence_inputs[cell_input_name][None, batch_idx] for cell_input_name in cell_input_names]
        for timestep in range(num_timesteps):

            # Quit if the sequence is over
            mask_seq = sequence_inputs[mask_input_name][batch_idx]
            if not mask_seq[timestep]:
                break

            # Extract the inputs for this time slice
            inputs = {
                word_input_name: sequence_inputs[word_input_name][None, None, batch_idx, timestep],
            }
            for hidden_name, hidden_val in zip(hidden_input_names, initial_hidden):
                inputs[hidden_name] = hidden_val
            for cell_name, cell_val in zip(cell_input_names, initial_cell):
                inputs[cell_name] = cell_val

            # In quantization mode, return inputs to the TF -> TFlite converter
            if yield_inputs: 
                yield inputs

            # Proces the time step
            outputs = predict_fun(inputs)

            # Record the next hidden state
            initial_hidden = [outputs[hidden_output_name] for hidden_output_name in hidden_output_names]
            initial_cell = [outputs[cell_output_name] for cell_output_name in cell_output_names]

        # Collect the prediction at the end of each sequence
        if not yield_inputs:
            yield outputs[prob_output_name]


In [None]:
def __batch_process_sequences__(predict_fun, x: List) -> List:
    """
        Wrapper for process_sequences, using batch processing rather than a generator.
    """
    yield_inputs = False
    return [x for x in __process_sequences__(predict_fun, x, yield_inputs=yield_inputs)]

## Download IMDB Sentiment Dataset

The IMDB Sentiment Dataset is downloaded and preprocessed to be used for model training, evaluation, and quantization calibration. The dataset consists of movie reviews labeled as either positive or negative sentiment.

We use the `imdb.load_data` function from the Keras datasets module to load the data. The dataset is split into training and test sets, with the following considerations:
- Only a subset of the most frequent words (`imdb_num_words`) is considered, with the top `imdb_skip_top_words` being skipped.
- Reviews with more than `imdb_max_seq_len` words are truncated.
- The test set is further abbreviated for faster processing during evaluation.


In [None]:
import keras.datasets.imdb as imdb

# Get MNIST train and test splits
# From the training split, we will also take an 'eval' set 
(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words=imdb_num_words,
    skip_top=imdb_skip_top_words,
    maxlen=imdb_max_seq_len
)

# Abbreviate the test set, for speed 
num_test_samples = int(1e3)
x_test = x_test[:num_test_samples]
y_test = y_test[:num_test_samples]


## Train Non-Streaming Version of the Model

### Training Loop

To train the non-streaming version of the LSTM model, we first create an instance of the model using the `sentiment_model` function with `training=True`. The model is then compiled with the specified optimizer, loss function, and evaluation metrics.

The training loop runs for a specified number of epochs. In each epoch, the training data is divided into batches, and the model is trained on each batch using the `train_on_batch` function. The training process involves preprocessing the batch data and updating the model weights.


In [None]:
# Create an network: embedding -> LSTM -> FC
train_model = sentiment_model(training=True)

# Train the model
optimizer='adam'
loss_fn='categorical_crossentropy'
metrics=['accuracy']
train_model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)
for epoch in range(epochs):
    print("Epoch %d / %d ..." % (epoch, epochs))

    rng = np.random.default_rng()
    num_batches = math.ceil(len(x_train) / batch_size)
    for batch_idx in range(num_batches):
        print("Batch %d / %d ..." % (batch_idx, num_batches))

        # Preprocess batch data
        batch_inds = rng.choice(len(x_train), size=batch_size) 
        x_batch = x_train[batch_inds]
        y_batch = y_train[batch_inds] 
        inputs, y_batch_one_hot = prepare_inputs(x_batch, y_batch)

        # Train on this batch
        train_model.train_on_batch(inputs, y_batch_one_hot)

### Evaluation of Baseline Model

In [None]:
# Evaluate on each split
train_inputs, y_train_one_hot = prepare_inputs(x_train, y_train)
test_inputs, y_test_one_hot = prepare_inputs(x_test, y_test)
tf_train_loss, tf_train_accuracy  = train_model.evaluate(train_inputs, y_train_one_hot, batch_size=batch_size)
tf_test_loss, tf_test_accuracy  = train_model.evaluate(test_inputs, y_test_one_hot, batch_size=batch_size)
print("Tensorflow loss:")
print("\tTrain: %f" % tf_train_loss)
print("\tTest: %f" % tf_test_loss)
print("Tensorflow accuracy:")
print("\tTrain: %f" % tf_train_accuracy)
print("\tTest: %f" % tf_test_accuracy)


## Sparsify Non-Streaming Model Using Custom Training Loop

### Define the `prune_helper = PruneHelper()` Class

We start by creating a TensorFlow dataset for training and testing, and define the `PruneHelper` class. The `PruneHelper` class is used to apply pruning to specific layers of the model during training. It does this by wrapping the specified Keras layers with `tfmot.sparsity.keras.pruning_wrapper.prune_low_magnitude`, which enables the layers to be sparsified during training.

The `PruneHelper` class is configured with parameters such as `pencil_size`, `pencil_pooling_type`, and `prune_scheduler`.


In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs, y_train_one_hot)).batch(batch_size, drop_remainder=True)
test_dataset  = tf.data.Dataset.from_tensor_slices((test_inputs, y_test_one_hot)).batch(batch_size, drop_remainder=True)

In [None]:
pencil_size = 4
pencil_pooling_type = 'AVG'
prune_scheduler = 'poly_decay'  # 'constant' # 'poly_decay'
prune_helper = PruneHelper(pencil_size=pencil_size,
                             pencil_pooling_type=pencil_pooling_type,
                             prune_scheduler=prune_scheduler)

### Create `model_to_prune` with Pruning Wrappers

The `model_to_prune` is created by applying pruning wrappers to the layers we want to sparsify in the original `train_model` using the `PruneHelper` class. Layers to be pruned include dense and RNN layers. Pruning parameters such as `initial_sparsity`, `final_sparsity`, `begin_step`, `end_step`, `prune_frequency`, and `power` are defined.

The `model_to_prune` is returned with the specified layers wrapped with `tfmot.sparsity.pruning_wrapper.prune_low_magnitude`, allowing for sparsity to be induced during training.


In [None]:
"""
Define additional parameters for pruning
"""
layers_to_prune = [tf.keras.layers.Dense, tf.keras.layers.RNN] # Layers we want to prune 
initial_sparsity = 0.2
final_sparsity = 0.6
begin_step = 0
end_step =  len(train_dataset)*epochs
prune_frequency = 100
power = 3

model_to_prune = prune_helper(model=train_model,
                                layers_to_prune=layers_to_prune,
                                initial_sparsity=initial_sparsity,
                                final_sparsity=final_sparsity,
                                begin_step=begin_step,
                                end_step=end_step,
                                prune_frequency=prune_frequency,
                                power=power)

In [None]:
model_to_prune.layers

### Perform Training with Sparsity on `model_to_prune` Using a Custom Training Loop

We define a custom training loop `custom_sparsity_train_loop` to train the model with sparsity. The loop includes calls to `tfmot.sparsity.keras.UpdatePruningStep()` at specific points to induce sparsity during training. The loop runs for a specified number of epochs, and pruning is applied to the model during each epoch.


### Define the A Custom Sparsity-Train Loop

Here, we want to define a custom training Loop, and Perform Pruning within this Custom Training Loop.
Reference: https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide.md

In [None]:

def custom_sparsity_train_loop(model_to_prune, 
                                optimizer, 
                                metrics, 
                                loss_fn, 
                                train_dataset, 
                                val_dataset=None,
                                num_epochs=2):
    """
    Custom Training Loop, with instances of tfmot.sparsity.keras.UpdatePruningStep() 
    called in specific points of the training loop (marked 1-4) to induce sparsity.
    Reference: https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide.md
    """
    model_to_prune.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

    """
    1) Define prune_step and attach Model to prune_step callback
    """
    prune_step = tfmot.sparsity.keras.UpdatePruningStep()
    prune_step.set_model(model_to_prune)

    """
    2) prune_step.on_train_begin() call
    """
    prune_step.on_train_begin() # call bac
    for epoch in range(0, num_epochs):
        print(f"Processing epoch {epoch}")
        for batch_id, (x_batch, y_batch) in enumerate(train_dataset):
            
            """
            3) prune_step.on_train_batch_begin() call
            """
            prune_step.on_train_batch_begin(batch=-1) # run pruning callback

            model_to_prune.train_on_batch(x_batch, y_batch)

        """
        4) prune_step.on_epoch_end() call
        """
        prune_step.on_epoch_end(batch=-1) # run pruning callback

    if val_dataset:
        _, val_acc = model_to_prune.evaluate(val_dataset, verbose=0)
        print(f"Validation Accuracy Epoch {epoch}: {val_acc}")

    return model_to_prune

In [None]:
model_pruned = custom_sparsity_train_loop(model_to_prune=model_to_prune,
                                          optimizer=optimizer,
                                          metrics=metrics, 
                                            loss_fn=loss_fn, 
                                            train_dataset=train_dataset, 
                                            val_dataset=test_dataset,
                                            num_epochs=epochs)

### Apply `tfmot.sparsity.keras.strip_pruning()` to Remove Sparse Layer Wrappers

The pruned model (`model_to_prune`) contains a `tfmot.sparsity.keras.prune_low_magnitude()` wrapper around the layers that were sparsified during training. To finalize the pruning process, we use the `strip_pruning()` function, which removes the pruning wrappers and returns the pruned model with the sparse weights. The resulting `model_pruned_stripped` contains the final pruning mask applied to the weights, which represents the actual sparsity achieved by the pruning process.


In [None]:
model_pruned_stripped = tfmot.sparsity.keras.strip_pruning(model_pruned)

### Evaluate Pruned Non-Streaming Model Accuracy

After the pruning process, it is important to evaluate the performance of the pruned model (`model_pruned`) on both the training and test datasets. We measure the loss and accuracy of the pruned model and report the results. This helps us understand how the pruning process has affected the model's accuracy and generalization.


In [None]:
tf_prune_train_loss, tf_prune_train_accuracy  = model_pruned.evaluate(train_dataset)
tf_prune_test_loss, tf_prune_test_accuracy  = model_pruned.evaluate(test_dataset)
print("Tensorflow-Pruned loss:")
print("\tTrain: %f" % tf_prune_train_loss)
print("\tTest: %f" % tf_prune_test_loss)
print("Tensorflow-Pruned accuracy:")
print("\tTrain: %f" % tf_prune_train_accuracy)
print("\tTest: %f" % tf_prune_test_accuracy)

### Visualize Pruned Weights
Visualizing the pruned weights of the model can provide insights into the sparsity pattern and the distribution of non-zero weights in the layers. We use the `plot_prune_mask` function to create visualizations of the pruned weights for each layer in the model. The visualizations are saved as images for further analysis and inspection.

In [None]:
trainable_weights_dict = {weight.name: weight.numpy() for weight in model_pruned_stripped.trainable_weights}
trainable_weights_dict.keys()

In [None]:

base_path = 'imdb_lstm_dense'
os.makedirs(base_path, exist_ok=True)
max_len_axis = 24
for layer_name, layer in trainable_weights_dict.items():
    title = f"{layer_name}-shape-opxip-{layer.T.shape}"
    save_path = f"{base_path}/{layer_name.replace('/', '-')}-prune-mask.png"
    # print(title)
    plot_prune_mask(data=layer, axis_stride=pencil_size, max_xlen=max_len_axis, max_ylen=max_len_axis, title=title, save_path=save_path)

### Calculate Sparsity Metrics

Sparsity refers to the proportion of zero-valued elements in the model's weights. After the pruning process, it is important to calculate the sparsity for each layer in the pruned model to understand the level of sparsity achieved.

We use the `calculate_sparsity` function to calculate the sparsity for each layer's weights. The results are stored in the `sparsity_dict` dictionary, where the keys represent the layer names and the values represent the calculated sparsity.


In [None]:
sparsity_dict = {}
for layer_name, layer_weight in trainable_weights_dict.items():
    sparsity_dict[layer_name] = calculate_sparsity(layer_weight)
print('sparsity_dict', sparsity_dict)

## Define the Streaming Inference Model and Load Pruned Weights

For real-time or streaming inference, we need to create a version of the model that is capable of processing data in single-batch and single-time-step increments. This model is often referred to as the streaming model.

### Create Streaming Inference Model

We use the `sentiment_model` function to create an inference version of the model. We set the `batch_size` and `timesteps` to 1 to allow for single-batch and single-time-step processing. We also set `training=False` to indicate that this is an inference model.

### Load Pruned Weights into Streaming Model

To use the sparsity achieved during training, we load the pruned weights from the `model_pruned_stripped` model (the pruned non-streaming model) into the streaming inference model. This allows the streaming model to benefit from the sparsity and compression achieved through the pruning process.


In [None]:
# Create an inference version of the model, for single batch size and time steps (streaming model)
inference_model = sentiment_model(batch_size=1, timesteps=1, training=False)
print("Setting Weights of Streaming Inference Model to the Sparsified Trained Weights of the Non-Streaming Model")
inference_model.set_weights(model_pruned_stripped.get_weights())

## Quantize the Pruned Streaming Model Using TFLite

Quantization is a technique used to reduce the memory footprint and improve the computational efficiency of a model by converting the weights and activations from floating-point representation to fixed-point representation. In this section, we will quantize the pruned streaming model using TensorFlow Lite (TFLite).

### Quantization Modes

We can choose between two quantization modes for the pruned streaming model:

1. **8-bit Quantization (`quantize_mode = '8x8'`)**: This mode quantizes both the weights and activations to 8-bit integer values (Int8). It significantly reduces the model size and offers good computational efficiency while maintaining reasonable accuracy.

2. **Hybrid Quantization (`quantize_mode = '8x16'`)**: This mode quantizes the weights to 8-bit integer values (Int8) and the activations to 16-bit integer values (Int16). It offers higher precision for activations compared to 8-bit quantization, making it suitable for use cases where higher accuracy is required.

The choice of quantization mode depends on the specific requirements of the application, such as the acceptable level of accuracy and the computational resources available on the target deployment platform.

### Quantize Using `TFLiteModelWrapper()` Class

To perform quantization, we will use the `TFLiteModelWrapper()` class. This class provides a convenient interface for converting a TensorFlow model to a quantized TFLite model. We will configure the class to perform the selected quantization mode and apply it to the pruned streaming model.

We first define a `keras_predict_fun` function to perform prediction using the inference model. Next, we create a `representative_dataset` function that generates the representative dataset for quantization calibration. This dataset is generated using the `__process_sequences__` function and reflects the typical input data distribution.

We then set the quantization mode (`quantize_mode`) to either `'8x8'` or `'8x16'` based on the desired quantization type. We provide the model, representative dataset, and quantization mode to the `TFLiteModelWrapper()` class and specify the save path for the quantized TFLite model (`tflite_save_path`).

The `model_tflite` is the quantized TFLite model that is optimized for deployment on resource-constrained devices. It benefits from both the sparsity achieved during pruning and the reduced memory footprint achieved through quantization. 


In [None]:
# Truncate the datasets, for speed
num_quantization_sequences = 16
x_quant = x_train[:num_quantization_sequences]
y_quant = y_train[:num_quantization_sequences]


In [None]:
# Preprocess inference inputs
quant_inputs = prepare_inputs(x_quant)
test_inputs = prepare_inputs(x_test)

In [None]:
keras_predict_fun = lambda inputs: inference_model.predict(inputs, verbose=False)
representative_dataset = lambda: __process_sequences__(keras_predict_fun, x_quant, yield_inputs=True) 

quantize_mode = '8x8' # or '8x16' 

tflite_save_path = 'tflite_imdb.tflite'
model_tflite = TFLiteModelWrapper(quantize_mode=quantize_mode,
                                  model=inference_model, 
                                  representative_dataset=representative_dataset, 
                                  tflite_save_path=tflite_save_path)

### Evaluate Performance of TFLite Streaming Model (Pruned and Quantized)

In this section, we assess the performance of the pruned and quantized TFLite streaming model generated in the previous steps. To achieve this, we measure the model's classification accuracy on the test dataset.

We define a function `tflite_predict_fun` as a lambda function to perform predictions using the TFLite model `model_tflite`. We process the test dataset using the `__batch_process_sequences__` function, which performs batch processing of sequences and returns the predictions.

The `classification_accuracy` function calculates the accuracy by comparing the predicted class labels (obtained by taking the argmax of the logits) with the ground truth labels. The computed accuracy reflects the effectiveness of the pruned and quantized TFLite streaming model in performing sentiment analysis.

Finally, we print the TFLite model's classification accuracy, which provides insight into the model's performance on the IMDB sentiment analysis task. It is important to note that the model has undergone both pruning and quantization, optimizing it for deployment on resource-constrained devices while maintaining reasonable accuracy.

In [None]:
tflite_predict_fun = lambda x: model_tflite(x) 

# Measure the TFLite model's accuracy
tflite_labels = []
reference_tflite_preds = []
print("Measuring TFLite accuracy...")
tflite_preds = __batch_process_sequences__(
    tflite_predict_fun, 
    x_test
)

def classification_accuracy(logits: np.array, labels: np.array):
    predictions = np.argmax(logits, axis=1)
    return np.mean(labels == predictions)

tflite_accuracy = classification_accuracy(tflite_preds, y_test)

In [None]:
print("TFLite reference accuracy: %f" % tflite_accuracy)

### Compare Model Sizes: Baseline, Pruned, and Pruned/Quantized

In this section, we evaluate and compare the sizes of different versions of the model: the baseline model, the pruned model, and the pruned and quantized TFLite model. This comparison helps us understand the impact of pruning and quantization on the model's memory footprint, which is critical for deployment on resource-constrained devices.

We start by saving the baseline model and the pruned model (with pruning wrappers stripped) to temporary files. Next, we retrieve the path of the pruned and quantized TFLite model, which was generated in previous steps.

To calculate and compare the model sizes, we use the `get_gzipped_model_size` function, which computes the size of a gzipped model file. We print the sizes of the gzipped baseline Keras model, the gzipped pruned Keras model, and the gzipped pruned/quantized TFLite model.

The results provide insights into the effectiveness of pruning and quantization in reducing the model's memory footprint while maintaining its performance on the IMDB sentiment analysis task.


In [None]:

_, keras_file = tempfile.mkstemp('.h5')
tf.keras.models.save_model(train_model, keras_file, include_optimizer=False)
print('Saved baseline model to:', keras_file)

_, pruned_keras_file = tempfile.mkstemp('.h5')
tf.keras.models.save_model(model_pruned_stripped, pruned_keras_file, include_optimizer=False)
print('Saved pruned Keras model to:', pruned_keras_file)

pruned_tflite_file = tflite_save_path
print("TFLite File was generated at:", pruned_tflite_file)

In [None]:
print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))
print("Size of gzipped pruned/Quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(pruned_tflite_file)))

## Compiling TFLite model using [femtocrux](https://femtocrux.femtosense.ai/en/latest/) and generating Memory Image BitFile

Next, we will compile the generated TFLite model with Femtocrux. Compiling the model using Femtocrux is a necessary step in deploying the Tensorflow model on Femtosense's SPU.

We will need to have Docker installed and instantiate a CompilerClient. This will allow us to make API calls to the Femtosense's compiler. We can call the `compile` method of femtocrux to produce the `Memory Image bitfile`. 

In [None]:
from femtocrux import CompilerClient, TFLiteModel
client = CompilerClient()

In [None]:
flatbuffer = model_tflite.instance.flatbuffer
signature_name = model_tflite.instance.signature_name

In [None]:
bitstream = client.compile(    
    TFLiteModel(flatbuffer=flatbuffer, signature_name=signature_name)
)
# Write to a file for later use
with open('my_bitfile.zip', 'wb') as f: 
    f.write(bitstream)

## Generating Program Files, and loading onto SPU. 
The memory image bitfile zip can then be converted to Program Files using [femtodriverpub](https://github.com/femtosense/femtodriverpub).

Once generated, these Program Files can be transferred to an SD card, which can then be inserted into Femtosense's SPU. 


#### Install `Femtodriverpub`
##### Step One - Clone femtodriverpub from Github and Install it


In [None]:
! git clone https://github.com/femtosense/femtodriverpub.git; cd femtodriverpub; pip install -e .

#### Unzip the `my_bitfile.zip` zip folder


In [None]:
! rm -rf 'my_bitfile'
! unzip 'my_bitfile.zip' -d 'my_bitfile'

#### Generate the Program Files to load onto SPU


In [None]:
! python femtodriverpub/femtodriverpub/run/sd_from_femtocrux.py 'my_bitfile'


#### The Program Files should be generated at `apb_records` folder


In [None]:
! ls 'apb_records'


The contents inside `apb_records` can be loaded onto a SD card, which can then be inserted onto the SPU! Congratulations, your model is now ready to deploy on the SPU!
