<h1 align="center" style="color:green;font-size: 3em;">Homework 3:
Implementing Quantization Techniques</h1>

# Part 0: Instructions

- Follow the notebook sections to implement various fine-tuning techniques.
- Complete the code cells marked with `TODO`.
- Ensure your code runs correctly by the end of the notebook.

In [None]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m14.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch

from transformers import BertModel, BertTokenizer, DistilBertForSequenceClassification, DistilBertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
from tqdm import tqdm
from torch.optim import AdamW
import torch.quantization


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Part 1: Introduction to Quantization


Quantization is a model compression technique people use to reduce the size and the computational requirements of LLMs. The central idea behind quantization is to represent the model’s weights and activations using lower-precision data types, such as `int8` or `float16`, instead of the standard `float32`. This significantly reduces the memory footprint and allows for faster computations, as lower-precision arithmetic operations are generally less computationally expensive.

There are various types of quantization techniques, including post-training quantization (PTQ), where the model is quantized after training, and quantization-aware training (QAT), where the model is trained with quantization in mind. While quantization often results in some loss of model accuracy, advances like QAT help to somewhat eliminate this by adjusting weights during training to account for the reduced precision. By having a balance between computational efficiency and model performance, quantization enables LLMs to run effectively in real-world applications without the need for extensive hardware resources.

## 1.1: Basic Data Type Conversion

First, we will explore the memory usage of different tensor data types in PyTorch. Understanding how the choice of data type affects memory consumption is crucial when working with large datasets or models in deep learning.

<hr> <h3>Task:</h3>

1. Create a PyTorch tensor of type float32 with a size of your choice (e.g., 1000x1000).

2. Create another tensor of type float16 by converting the float32 tensor.

3. Create another tensor of type int8 by converting the float16 tensor.
4. Print out the memory consumption of all the tensors



In [None]:
## TODO: Create a tensor of type float32
tensor = torch.randn(1000, 1000)  #
print(f"Memory (float32): {tensor.element_size() * tensor.nelement()} bytes")

## TODO: Create a tensor of the same shape of type float 16
tensor_fp16 = tensor.to(dtype=torch.float16)
print(f"Memory (float16): {tensor_fp16.element_size() * tensor_fp16.nelement()} bytes")

## TODO: Create a tensor of the same shape of type int 8
tensor_int8 = tensor.to(dtype=torch.int8)
print(f"Memory (int8): {tensor_int8.element_size() * tensor_int8.nelement()} bytes")

Memory (float32): 4000000 bytes
Memory (float16): 2000000 bytes
Memory (int8): 1000000 bytes


## 1.2: Quantize a Small NN Model

Next, we will explore the impact of data type conversion on the output of a BERT model using PyTorch. Specifically, we will compare the output shapes and memory usage of the BERT model when using different tensor data types: float32 and float16.

<hr> <h3>Task:</h3>

1. Load the BERT model and tokenizer using the `prajjwal1/bert-small` pretrained model.
2. Tokenize a sample input text (e.g., "Quantization is useful!") and prepare it for the model by using `return_tensors='pt'`.
3. Run the model on the input data using `float32` tensors and store the output.
4. Convert the model to use `float16` tensors by calling the `.half()` method.
5. Run the model again on the same input data using the `float16` tensors and store the output.
6. Print out the bytes used of the outputs for both `float32` and `float16` tensors.

In [None]:
## TODO: Load the model and tokenizer
model = BertModel.from_pretrained('prajjwal1/bert-small')
tokenizer = BertTokenizer.from_pretrained('prajjwal1/bert-small')

## TODO: Tokenize a random sentence and run it through the model
text = "Quantization is useful!"
inputs = tokenizer(text, return_tensors='pt')
outputs_fp32 = model(**inputs)
last_hidden_state_fp32 = outputs_fp32.last_hidden_state

## TODO: Quantize the model and run the sentence through the new model
model_fp16 = model.half()  # Convert to float16
#nputs_fp16 = {k: v.half() for k, v in inputs.items() if torch.is_tensor(v)}  # Convert inputs to float16
outputs_fp16 = model_fp16(**inputs)
last_hidden_state_fp16 = outputs_fp16.last_hidden_state


## TODO: Print the bytes used for both
print(f"Memory (float32): {last_hidden_state_fp32.element_size() * last_hidden_state_fp32.nelement()} bytes")
print(f"Memory (float16): {last_hidden_state_fp16.element_size() * last_hidden_state_fp16.nelement()} bytes")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Memory (float32): 16384 bytes
Memory (float16): 8192 bytes


## 1.3: Conceptual Questions

- What is quantization and why is it important for large language models?

Quantization is a technique in machine learning used to reduce the memory and computational requirements of a model by converting its weights and activations from high-precision floating-point numbers (e.g., 32-bit or 16-bit floats) to lower-precision integer values (e.g., 8-bit integers). This process compresses the model, leading to faster inference speeds and lower memory usage with minimal impact on accuracy if done properly.

- How does reducing precision from float32 to int8 impact memory usage?

It costs less memory usage.

- Explain the difference between per-layer and per-channel quantization. Why might per-channel quantization be more effective for certain tasks?

In per-layer quantization, a single scale and zero-point (quantization parameters) are used for all weights and activations within a layer.
This means that all weights or activations in a given layer are scaled by the same factor and share the same zero-point, which simplifies the computation and reduces memory usage.
However, it may lead to accuracy loss, especially in convolutional layers, where weights may vary significantly between channels.

In per-channel quantization, each channel (usually the output channel) in a layer has its own scale and zero-point.
This is commonly used in convolutional and depthwise separable convolution layers, where weights within different channels might have distinct distributions.
By assigning different quantization parameters to each channel, per-channel quantization can better represent the weight and activation distributions, leading to higher accuracy compared to per-layer quantization.

The reason why per-channel quantization performs better for some specific tasks is that Per-channel quantization allows each channel to retain more information about its specific data distribution, which can help maintain model accuracy, especially for tasks with complex spatial and feature representations like image classification.



# Part 2: Post-Training Quantization

## 2.1: Overview

Post-Training Quantization (PTQ) optimizes pretrained neural network models by reducing the precision of weights and activations, thereby decreasing memory usage and improving inference speed while preserving accuracy. There are two main types of quantization: static and dynamic. Static quantization computes scaling factors for weights and activations during a calibration phase using a representative dataset, enabling fixed quantized values for more efficient inference. Conversely, dynamic quantization quantizes weights at runtime, leaving activations in their original precision, making it easier to implement without needing a calibration dataset. Together, these strategies enhance model performance for deployment in resource-constrained environments.

## 2.2: Implementing Dynamic Quantization

For dynamic quantization, first, we will load a pre-trained DistilBERT model and its corresponding tokenizer, which will be used for sequence classification tasks. Note: this is exactly the same as 2.2's first step so you can just re-use your code.

<hr>
<h3>Task:</h3>

1. Load the model and tokenizer (`distilbert-base-uncased`)

In [None]:
## TODO: Load a pre-trained DistilBERT model and tokenizer
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load model for sequence classification (with 2 classes for IMDB sentiment)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                          num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we will apply dynamic quantization to the pre-trained DistilBERT model to reduce its size and improve inference speed without significant loss in accuracy.

<hr>
<h3>Task:</h3>

1. Set the model to evaluation mode using model.eval().
2. Use torch.quantization.quantize_dynamic() to quantize the model:
    - Specify the model to be quantized.
    - Indicate which layers to quantize (in this case, torch.nn.Linear layers).
    - Set the quantized data type to torch.qint8.
3. Return the quantized model.

In [None]:
## TODO: Finish this method
def apply_dynamic_quantization(model):

    # Set model to evaluation mode
    model.eval()

    # Apply dynamic quantization
    quantized_model = torch.quantization.quantize_dynamic(
        model,  # the model to quantize
        {torch.nn.Linear},  # specify layers to quantize
        dtype=torch.qint8  # specify quantization data type
    )

    return model.to('cpu') ## NEED THIS!!!!


Next, we will evaluate the performance of the quantized DistilBERT model using various metrics to gain a comprehensive understanding of its effectiveness.

<hr>
<h3>Task:</h3>

1. Set the model to evaluation mode with model.eval().
2. Initialize lists to store all predictions and labels.
3. Iterate through the data loader to:
    - Tokenize the input text.
    - Perform a forward pass through the model to obtain predictions.
    - Collect predictions and true labels for evaluation.

4. Calculate evaluation metrics:
    - Accuracy using accuracy_score.
    - F1 Score using f1_score with a weighted average.
    - Classification Report using classification_report.
5. Return the accuracy, F1 score, and classification report.

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report
import torch

In [None]:
## TODO: Finish this method
def evaluate_model(model, data_loader):
    # Set model to evaluation mode
    model.eval()

    # Initialize lists to store predictions and labels
    all_predictions = []
    all_labels = []

    # Disable gradient calculations for inference
    with torch.no_grad():
        for batch in data_loader:
            # Get inputs and labels from batch
            inputs = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            labels = batch['labels'].cpu().numpy()

            # Forward pass
            outputs = model(input_ids=inputs, attention_mask=attention_mask)
            logits = outputs.logits

            # Convert logits to predictions
            predictions = torch.argmax(logits, dim=1).cpu().numpy()

            # Collect predictions and labels
            all_predictions.extend(predictions)
            all_labels.extend(labels)

    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_predictions)
    f1 = f1_score(all_labels, all_predictions, average='weighted')
    report = classification_report(all_labels, all_predictions)

    return accuracy, f1, report

Finally, we want to combine all the steps together. We will load a dataset, apply dynamic quantization to the pre-trained DistilBERT model, and evaluate its performance.

<hr>
<h3>Task:</h3>

1. Load the Dataset:
    - Use the IMDB dataset.
    - Shuffle and select a subset for training (e.g., 2,000 samples) and evaluation (e.g., 500 samples).

2. Create DataLoaders:
    - Set up DataLoader for both training and testing datasets with appropriate batch sizes.

3. Apply Dynamic Quantization:
    - Use the apply_dynamic_quantization function to quantize the pre-trained model.

4. Evaluate the Quantized Model:
    - Use the evaluate_model function to assess the accuracy, F1 score, and generate a classification report for the quantized model on the test dataset.
    - Print the accuracy, F1 score, and classification report.

In [None]:
import datasets

In [None]:
## TODO: Load the IMDB dataset
imdb = load_dataset('imdb')
train_dataset = imdb['train'].shuffle(seed=42).select(range(2000))
test_dataset = imdb['test'].shuffle(seed=42).select(range(500))

## TODO: Create DataLoader for training and testing
class IMDBDataset():
   def __init__(self, dataset, tokenizer, max_length=512):
       self.dataset = dataset
       self.tokenizer = tokenizer
       self.max_length = max_length

   def __len__(self):
       return len(self.dataset)

   def __getitem__(self, idx):
       text = self.dataset[idx]['text']
       label = self.dataset[idx]['label']

       encoding = self.tokenizer(
           text,
           truncation=True,
           padding='max_length',
           max_length=self.max_length,
           return_tensors='pt'
       )

       return {
           'input_ids': encoding['input_ids'].squeeze(),
           'attention_mask': encoding['attention_mask'].squeeze(),
           'labels': torch.tensor(label)
       }

train_dataset = IMDBDataset(train_dataset, tokenizer)
test_dataset = IMDBDataset(test_dataset, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

## TODO: Apply dynamic quantization
quantized_model = apply_dynamic_quantization(model)
## TODO: Evaluate the quantized model
accuracy, f1, report = evaluate_model(quantized_model, test_loader)

print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Classification Report:")
print(report)

Accuracy: 0.3920
F1 Score: 0.3901
Classification Report:
              precision    recall  f1-score   support

           0       0.41      0.45      0.43       254
           1       0.37      0.33      0.35       246

    accuracy                           0.39       500
   macro avg       0.39      0.39      0.39       500
weighted avg       0.39      0.39      0.39       500



## 2.3: Conceptual Questions


- What are the trade-offs between static and dynamic quantization in terms of model accuracy, inference speed, and implementation complexity? (also explain why this might be the case)

For accuracy: static quantization generally achieves higher accuracy than dynamic quantization. This is because static quantization uses calibration data to determine the optimal quantization ranges for activations, reducing quantization errors

For inference speed: : static quantization typically provides faster inference since both weights and activations are pre-quantized into integer representations, enabling the model to leverage efficient integer-only computations throughout the entire network.
Dynamic quantization often results in slower inference relative to static quantization because only weights are quantized ahead of time. Activations are quantized during inference, requiring additional computation steps to convert them from floating-point to integer in real-time. This added processing, especially in models with many layers, can slow down inference.

For implementation complexity: static quantization generally requires more implementation effort. This is because it needs a calibration step with representative data to compute the quantization parameters for activations. Also, extra considerations like layer fusion and the need to carefully choose calibration data to capture the full activation range add to the setup complexity.
Dynamic quantization is simpler to implement as it bypasses the calibration process and doesn’t require layer-wise optimization of activation ranges

- When might you choose one method over another?

If I have enough calibration dataset, I will use static quantization. Otherwise, I will use dynamic quantization.

- Please discuss the accuracy degradation when doing quantization and provide ways you may minimize this.

Why Accuracy Degradation Occurs in Quantization:

Limited Dynamic Range: Lower-precision formats like int8 or float16 have a much smaller dynamic range compared to float32. As a result, values that are too large or too small may be rounded or clipped, leading to loss of information.
Round-Off Errors: Quantizing floating-point numbers to integers introduces round-off errors, particularly in layers with small gradients or narrow distributions of values, where even minor errors can significantly affect the model.
Sensitivity in Certain Layers: Some layers, especially embedding layers or layers closer to the output, are more sensitive to precision loss. Quantizing these layers can cause disproportionately large accuracy drops.

Strategies to Minimize Accuracy Degradation：

1.Quantization-Aware Training (QAT)

In QAT, the model is trained with simulated quantization effects applied during both the forward and backward passes. The model learns to adapt to the quantization errors by updating its parameters during training.

2.Use Mixed-Precision Quantization

Instead of applying quantization uniformly across all layers, mixed-precision quantization keeps the most sensitive layers in higher precision (e.g., float16 or float32) and quantizes only less-sensitive layers to int8.

3.Per-Channel Quantization

In per-channel quantization, each output channel in a layer (e.g., convolutional layers) has its own scaling and zero-point values rather than sharing a single scale and zero-point across the layer.
Data


# Part 3: Quantization-Aware Training

## 3:1: Overview

Quantization Aware Training (QAT) is a technique designed to optimize neural networks for deployment on resource-constrained environments. By simulating low-precision arithmetic during training, QAT allows models to learn how to best adapt their weights for quantized operations, resulting in improved accuracy compared to post-training quantization alone. In this section, we will implement QAT using Hugging Face's Transformers and Datasets libraries, allowing us to maintain model performance while reducing memory footprint and inference latency.

## 3.2: Train a Model Normally

First, we will be loading a LLM called `distilbert-base-uncased`.
<hr>
<h3>Task:</h3>

1. Load the model (set num_labels to 2) and tokenizer
2. Move the model to cuda

In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

## TODO: Finish the rows below
model_name ='distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',  num_labels=2).to(device)
tokenizer =  DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we want to load our dataset. The dataset we will be using in this section is MRPC from the GLUE benchmark. Using knowledge from previous homework, finish the cell below.
<hr>
<h3>Task:</h3>

<strong>Note:</strong> Don't set the batch size to too big or the gpu memory will overload.

1. Create a train and validation loader (only take 500 samples for the training data and 100 for the validation data)


In [None]:
from torch.utils.data import DataLoader
from datasets import load_dataset

## TODO: Create the dataloaders

from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
import torch

# Load MRPC dataset from GLUE
dataset = load_dataset('glue', 'mrpc')

# Create custom dataset class for MRPC
class MRPCDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        # MRPC has sentence1 and sentence2 pairs
        sentence1 = self.dataset[idx]['sentence1']
        sentence2 = self.dataset[idx]['sentence2']
        label = self.dataset[idx]['label']

        # Tokenize sentence pairs
        encoding = self.tokenizer(
            sentence1,
            sentence2,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label)
        }

# Create train and validation datasets
train_dataset = MRPCDataset(dataset['train'].select(range(500)), tokenizer)
val_dataset = MRPCDataset(dataset['validation'].select(range(100)), tokenizer)

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)
eval_dataloader = DataLoader(val_dataset, batch_size=2)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Next, we want to write a method for training the model and a method for evaluating the model.
<hr>
<h3>Task:</h3>

1. Complete the `train_model` method defined below. It should train the model for x number of epochs and print out the loss for each epoch.
2. Complete the `evaluate_model` method defined below. It should print out the final accuracy.

In [None]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

In [None]:
## TODO: Finish the method
def train_model(model, train_dataloader, num_epochs=2):

   optimizer = AdamW(model.parameters(), lr=2e-5)

   # Calculate total training steps for scheduler
   total_steps = len(train_dataloader) * num_epochs

   # Create scheduler with warmup
   scheduler = get_linear_schedule_with_warmup(
       optimizer,
       num_warmup_steps=0,
       num_training_steps=total_steps
   )

   # Training loop
   for epoch in range(num_epochs):
       model.train()
       total_loss = 0

       # Use tqdm for progress bar
       progress_bar = tqdm(train_dataloader, desc=f'Epoch {epoch + 1}')

       for batch in progress_bar:
           # Move batch to device
           input_ids = batch['input_ids'].to(model.device)
           attention_mask = batch['attention_mask'].to(model.device)
           labels = batch['labels'].to(model.device)

           # Zero gradients
           optimizer.zero_grad()

           # Forward pass
           outputs = model(
               input_ids=input_ids,
               attention_mask=attention_mask,
               labels=labels
           )

           loss = outputs.loss
           total_loss += loss.item()

           # Backward pass
           loss.backward()

           # Clip gradients
           torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

           # Update weights
           optimizer.step()
           scheduler.step()

           # Update progress bar
           progress_bar.set_postfix({'loss': f'{loss.item():.3f}'})

       # Calculate average loss for epoch
       avg_loss = total_loss / len(train_dataloader)
       print(f"\nAverage loss for epoch {epoch + 1}: {avg_loss:.3f}")

   return model

In [None]:
## TODO: Finish the method
def evaluate_model(model, eval_dataloader):
   model.eval()

   # Initialize lists for predictions and true labels
   all_predictions = []
   all_labels = []
   total_eval_loss = 0

   # Evaluate without gradient calculations
   with torch.no_grad():
       for batch in tqdm(eval_dataloader, desc="Evaluating"):
           # Move batch to device
           input_ids = batch['input_ids'].to(model.device)
           attention_mask = batch['attention_mask'].to(model.device)
           labels = batch['labels'].to(model.device)

           # Forward pass
           outputs = model(
               input_ids=input_ids,
               attention_mask=attention_mask,
               labels=labels
           )

           # Get loss and predictions
           loss = outputs.loss
           total_eval_loss += loss.item()

           logits = outputs.logits
           predictions = torch.argmax(logits, dim=1).cpu().numpy()
           true_labels = labels.cpu().numpy()

           # Store predictions and labels
           all_predictions.extend(predictions)
           all_labels.extend(true_labels)

   # Calculate metrics
   avg_eval_loss = total_eval_loss / len(eval_dataloader)
   accuracy = accuracy_score(all_labels, all_predictions)
   f1 = f1_score(all_labels, all_predictions, average='binary')

   # Return results
   return {
       'loss': avg_eval_loss,
       'accuracy': accuracy,
       'f1': f1
   }

In [None]:
## Train the model for 20 epochs
train_model(model, train_dataloader, num_epochs=20)
evaluate_model(model, eval_dataloader)

Epoch 1: 100%|██████████| 250/250 [00:15<00:00, 16.66it/s, loss=0.326]



Average loss for epoch 1: 0.685


Epoch 2: 100%|██████████| 250/250 [00:13<00:00, 18.41it/s, loss=0.011]



Average loss for epoch 2: 0.607


Epoch 3: 100%|██████████| 250/250 [00:13<00:00, 18.44it/s, loss=0.003]



Average loss for epoch 3: 0.378


Epoch 4: 100%|██████████| 250/250 [00:13<00:00, 18.42it/s, loss=0.001]



Average loss for epoch 4: 0.172


Epoch 5: 100%|██████████| 250/250 [00:13<00:00, 18.35it/s, loss=0.001]



Average loss for epoch 5: 0.088


Epoch 6: 100%|██████████| 250/250 [00:13<00:00, 17.87it/s, loss=0.001]



Average loss for epoch 6: 0.053


Epoch 7: 100%|██████████| 250/250 [00:14<00:00, 17.43it/s, loss=0.000]



Average loss for epoch 7: 0.031


Epoch 8: 100%|██████████| 250/250 [00:13<00:00, 18.22it/s, loss=0.000]



Average loss for epoch 8: 0.055


Epoch 9: 100%|██████████| 250/250 [00:13<00:00, 18.24it/s, loss=0.000]



Average loss for epoch 9: 0.013


Epoch 10: 100%|██████████| 250/250 [00:13<00:00, 18.44it/s, loss=0.000]



Average loss for epoch 10: 0.018


Epoch 11: 100%|██████████| 250/250 [00:13<00:00, 18.46it/s, loss=0.000]



Average loss for epoch 11: 0.000


Epoch 12: 100%|██████████| 250/250 [00:13<00:00, 18.44it/s, loss=0.000]



Average loss for epoch 12: 0.000


Epoch 13: 100%|██████████| 250/250 [00:13<00:00, 18.44it/s, loss=0.000]



Average loss for epoch 13: 0.000


Epoch 14: 100%|██████████| 250/250 [00:14<00:00, 17.81it/s, loss=0.000]



Average loss for epoch 14: 0.000


Epoch 15: 100%|██████████| 250/250 [00:13<00:00, 18.19it/s, loss=0.000]



Average loss for epoch 15: 0.000


Epoch 16: 100%|██████████| 250/250 [00:13<00:00, 18.30it/s, loss=0.000]



Average loss for epoch 16: 0.000


Epoch 17: 100%|██████████| 250/250 [00:13<00:00, 18.39it/s, loss=0.000]



Average loss for epoch 17: 0.000


Epoch 18: 100%|██████████| 250/250 [00:13<00:00, 18.52it/s, loss=0.000]



Average loss for epoch 18: 0.000


Epoch 19: 100%|██████████| 250/250 [00:13<00:00, 18.37it/s, loss=0.000]



Average loss for epoch 19: 0.000


Epoch 20: 100%|██████████| 250/250 [00:13<00:00, 18.45it/s, loss=0.000]



Average loss for epoch 20: 0.000


Evaluating: 100%|██████████| 50/50 [00:00<00:00, 80.68it/s]


{'loss': 2.0329192268794576, 'accuracy': 0.78, 'f1': 0.8533333333333334}

## 3.3: Implementing Quantixation-Aware Training


Next, we want to use the same model and the same task to perform Quantization Aware Training. Complete the cells below to get a sense of how this works.

In [None]:
## TODO: Finish the rows below to recreate a new model (should be same as the cell above)
## TODO: Finish the rows below
model_name ='distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',  num_labels=2).to(device)
tokenizer =  DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<h3>Task:</h3>
Complete the cell below to prepare the model for quantization aware training.

In [None]:
from torch.ao.quantization.qconfig import float_qparams_weight_only_qconfig


In [None]:
# Define weight-only quantization to embedding layers
embedding_qconfig = float_qparams_weight_only_qconfig

# Define default quantization to other layers
default_qconfig = torch.ao.quantization.get_default_qat_qconfig('fbgemm')

## TODO: Finish the method below to apply embedding_qconfig to embedding layers and default_qconfig for all other layers
def set_qconfig_for_model(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Embedding):
            module.qconfig = embedding_qconfig
        else:
            module.qconfig = default_qconfig


In [None]:
set_qconfig_for_model(model)


In [None]:
## TODO: Set the model to training mode
model.train()

## TODO: Prepare the model for QAT
model = torch.quantization.prepare_qat(model)

In [None]:
## Train the model again for 20 epochs
train_model(model, train_dataloader, num_epochs=20)

Epoch 1: 100%|██████████| 250/250 [00:20<00:00, 12.48it/s, loss=0.777]



Average loss for epoch 1: 0.638


Epoch 2: 100%|██████████| 250/250 [00:18<00:00, 13.58it/s, loss=0.062]



Average loss for epoch 2: 0.640


Epoch 3: 100%|██████████| 250/250 [00:17<00:00, 13.90it/s, loss=0.032]



Average loss for epoch 3: 0.768


Epoch 4: 100%|██████████| 250/250 [00:17<00:00, 13.90it/s, loss=0.030]



Average loss for epoch 4: 0.738


Epoch 5: 100%|██████████| 250/250 [00:18<00:00, 13.62it/s, loss=0.024]



Average loss for epoch 5: 0.588


Epoch 6: 100%|██████████| 250/250 [00:17<00:00, 13.93it/s, loss=0.009]



Average loss for epoch 6: 0.523


Epoch 7: 100%|██████████| 250/250 [00:18<00:00, 13.65it/s, loss=0.005]



Average loss for epoch 7: 0.355


Epoch 8: 100%|██████████| 250/250 [00:18<00:00, 13.89it/s, loss=0.009]



Average loss for epoch 8: 0.267


Epoch 9: 100%|██████████| 250/250 [00:18<00:00, 13.65it/s, loss=0.002]



Average loss for epoch 9: 0.193


Epoch 10: 100%|██████████| 250/250 [00:18<00:00, 13.77it/s, loss=0.001]



Average loss for epoch 10: 0.095


Epoch 11: 100%|██████████| 250/250 [00:18<00:00, 13.69it/s, loss=0.000]



Average loss for epoch 11: 0.035


Epoch 12: 100%|██████████| 250/250 [00:18<00:00, 13.25it/s, loss=0.000]



Average loss for epoch 12: 0.002


Epoch 13: 100%|██████████| 250/250 [00:17<00:00, 13.92it/s, loss=0.000]



Average loss for epoch 13: 0.000


Epoch 14: 100%|██████████| 250/250 [00:18<00:00, 13.54it/s, loss=0.000]



Average loss for epoch 14: 0.000


Epoch 15: 100%|██████████| 250/250 [00:17<00:00, 13.93it/s, loss=0.000]



Average loss for epoch 15: 0.000


Epoch 16: 100%|██████████| 250/250 [00:18<00:00, 13.63it/s, loss=0.000]



Average loss for epoch 16: 0.000


Epoch 17: 100%|██████████| 250/250 [00:18<00:00, 13.84it/s, loss=0.000]



Average loss for epoch 17: 0.000


Epoch 18: 100%|██████████| 250/250 [00:18<00:00, 13.84it/s, loss=0.000]



Average loss for epoch 18: 0.000


Epoch 19: 100%|██████████| 250/250 [00:18<00:00, 13.65it/s, loss=0.000]



Average loss for epoch 19: 0.000


Epoch 20: 100%|██████████| 250/250 [00:17<00:00, 13.96it/s, loss=0.000]



Average loss for epoch 20: 0.000


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(
        30522, 768, padding_idx=0
        (activation_post_process): PlaceholderObserver(dtype=torch.float32, is_dynamic=False)
      )
      (position_embeddings): Embedding(
        512, 768
        (activation_post_process): PlaceholderObserver(dtype=torch.float32, is_dynamic=False)
      )
      (LayerNorm): LayerNorm(
        (768,), eps=1e-12, elementwise_affine=True
        (activation_post_process): FusedMovingAvgObsFakeQuantize(
          fake_quant_enabled=tensor([1], device='cuda:0'), observer_enabled=tensor([1], device='cuda:0'), scale=tensor([0.0873], device='cuda:0'), zero_point=tensor([97], device='cuda:0', dtype=torch.int32), dtype=torch.quint8, quant_min=0, quant_max=127, qscheme=torch.per_tensor_affine, reduce_range=True
          (activation_post_process): MovingAverageMinMaxObserver(min_val=-8.43149471282959, max_val=2.6527600288391

In [None]:
import torch
import torch.quantization as quant
## TODO: Set the model to evaluation mode
## TODO: Set the model to evaluation mode
# Set to eval mode
model.eval()

#model.qconfig = qconfig
quant.prepare(model, inplace=True)
# Evaluate
metrics = evaluate_model(model, eval_dataloader)

Evaluating: 100%|██████████| 50/50 [00:01<00:00, 32.06it/s]


In [None]:
metrics

{'loss': 2.129251233383402, 'accuracy': 0.74, 'f1': 0.8289473684210527}

## 3.4: Conceptual Questions

- Discuss the general procedure of QAT. Specifically, how is the forward and backward propagation different from normal Deep Learning training.

General Procedure of QAT

1. Prepare the Model for QAT:

The model is configured with quantization layers, typically by defining a quantization configuration (qconfig) in frameworks like PyTorch.
The model may also need layer fusions (e.g., combining convolution + batch normalization), which help reduce rounding errors and improve efficiency in quantization.

2. Set Up Fake Quantization:

During QAT, "fake quantization" is applied. Fake quantization simulates quantization effects by adding quantization and dequantization steps at each layer’s weights and activations without actually converting them to lower precision.
These fake quantization modules, inserted in the forward pass, approximate how quantization would affect the values, using the target bit-width (e.g., int8) to simulate the reduced precision.

3. Train the Model:

Training proceeds with standard techniques like backpropagation and gradient descent, but the forward and backward passes are modified to account for quantization effects due to fake quantization operations.





- What are the potential trade-offs when using quantization aware training, and how can they affect model deployment in resource-constrained environments?

QAT may cause less accuracy and longer traing time.
Quantized models created via QAT are generally optimized for specific hardware that supports lower-precision computation, and can increase inference spped.

- Compare the training results of Quantization Aware Training (QAT) and standard training, focusing on differences in training time, model accuracy, and inference speed. (I suggest using the same seed to train these two methods for consistency)

QAT has slightly lower training accuracy, but has less inference time and more training time due to fake optimization operations and more time need for convergence.

# Part 4: Advanced Quantization Techniques

## 4.1: Mixed Precision Training

Research and implement mixed precision training, from initializing a model to training and evaluating it using mixed precision. You can choose any model or dataset, or even implement your own custom neural network. The main goal of this exercise is to implement the mixed precision technique, and accuracy is not the primary concern.

In [None]:
## TODO: Finish the rows below to recreate a new model (should be same as the cell above)
## TODO: Finish the rows below
model_name ='distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',  num_labels=2).to(device)
tokenizer =  DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
from torch.utils.data import DataLoader
from datasets import load_dataset

## TODO: Create the dataloaders

from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
import torch

# Load MRPC dataset from GLUE
dataset = load_dataset('glue', 'mrpc')

# Create custom dataset class for MRPC
class MRPCDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=128):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        # MRPC has sentence1 and sentence2 pairs
        sentence1 = self.dataset[idx]['sentence1']
        sentence2 = self.dataset[idx]['sentence2']
        label = self.dataset[idx]['label']

        # Tokenize sentence pairs
        encoding = self.tokenizer(
            sentence1,
            sentence2,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label)
        }

# Create train and validation datasets
train_dataset = MRPCDataset(dataset['train'].select(range(500)), tokenizer)
val_dataset = MRPCDataset(dataset['validation'].select(range(100)), tokenizer)

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)
eval_dataloader = DataLoader(val_dataset, batch_size=2)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
import torch
from torch.cuda.amp import GradScaler, autocast

criterion = torch.nn.CrossEntropyLoss()

# Initialize model, optimizer, and scaler
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scaler = GradScaler()  # Automatically scales the loss

for epoch in range(5):
  model.train()
  # Training loop
  for batch in tqdm(train_dataloader, desc="Training"):
      # Move batch to device
      input_ids = batch['input_ids'].to(model.device)
      attention_mask = batch['attention_mask'].to(model.device)
      labels = batch['labels'].to(model.device)

      optimizer.zero_grad()  # Zero out gradients

      # Forward pass with mixed precision
      with autocast():
          outputs = model(
              input_ids=input_ids,
              attention_mask=attention_mask,
              labels=labels
          )

          # Get the loss
          loss = outputs.loss

      # Backward pass and optimization step
      scaler.scale(loss).backward()  # Scaled backpropagation
      scaler.step(optimizer)  # Optimizer step
      scaler.update()  # Update the scale for next iteration



  scaler = GradScaler()  # Automatically scales the loss
  with autocast():
Training: 100%|██████████| 250/250 [00:16<00:00, 14.77it/s]
Training: 100%|██████████| 250/250 [00:15<00:00, 16.00it/s]
Training: 100%|██████████| 250/250 [00:11<00:00, 21.56it/s]
Training: 100%|██████████| 250/250 [00:11<00:00, 21.56it/s]
Training: 100%|██████████| 250/250 [00:11<00:00, 21.53it/s]


In [None]:
# Evaluation loop (similar to your pattern)
total_eval_loss = 0
all_predictions = []
all_labels = []

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    # Move batch to device
    input_ids = batch['input_ids'].to(model.device)
    attention_mask = batch['attention_mask'].to(model.device)
    labels = batch['labels'].to(model.device)

    # Forward pass with mixed precision
    with torch.no_grad():
        with autocast():
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

        # Get loss and predictions
        loss = outputs.loss
        total_eval_loss += loss.item()

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1).cpu().numpy()
        true_labels = labels.cpu().numpy()

        # Store predictions and labels
        all_predictions.extend(predictions)
        all_labels.extend(true_labels)


  with autocast():
Evaluating: 100%|██████████| 50/50 [00:00<00:00, 89.80it/s]


In [None]:
metrics = evaluate_model(model, eval_dataloader)

Evaluating: 100%|██████████| 50/50 [00:01<00:00, 39.99it/s]


In [None]:
metrics

{'loss': 0.620005487203598, 'accuracy': 0.69, 'f1': 0.8165680473372781}

# Part 5: Summary

- How does quantization impact the trade-off between model accuracy and computational efficiency, and how can you mitigate potential accuracy losses during quantization?

Quantization improves computational efficiency by reducing model precision (e.g., from FP32 to INT8), which lowers memory usage, speeds up inference, and reduces power consumption, especially on edge devices. However, it often introduces accuracy loss due to the reduced representation of weights and activations, particularly in models with complex or non-linear operations. This trade-off can be mitigated through techniques like Quantization-Aware Training (QAT), which allows the model to learn to compensate for quantization-induced errors, or mixed-precision quantization, where critical layers retain higher precision. Additionally, robust calibration techniques and activation clipping during Post-Training Quantization (PTQ) can help minimize the impact on accuracy while maintaining efficiency.

- Explain the differences between post-training quantization and quantization-aware training (QAT). In what scenarios might one be preferred over the other, and why?

PTQ is simpler and faster, applying quantization to a pre-trained model without retraining, making it ideal for time-sensitive or less accuracy-critical tasks. However, it may result in significant accuracy loss, especially for complex models or tasks with non-linear operations. PTQ is best suited for simpler architectures or when only a calibration dataset is available.

In contrast, QAT simulates quantization during training, allowing the model to learn to mitigate precision loss. This approach retains higher accuracy and works well for complex models and critical applications, such as autonomous driving or medical imaging.

When to Use PTQ
If you need a quick deployment with moderate accuracy loss.
For smaller models or less accuracy-sensitive tasks.
If the training dataset is unavailable or proprietary.
When to Use QAT
For critical applications where accuracy cannot be compromised.
For complex architectures like transformers or object detectors.
If you have the resources and dataset to retrain the model.


- What are the challenges when quantizing models with layers that involve non-linear operations, like activation functions, and how might these challenges affect real-world applications?

Challenges in Quantizing Non-Linear Operations
Loss of Precision:

Activation functions often compress a wide range of inputs into a limited output range. For example:
ReLU outputs are non-negative, often resulting in many zero values, which amplifies rounding errors during quantization.
Sigmoid and tanh outputs are bounded within narrow ranges, and quantization can lead to coarse granularity in those ranges, significantly distorting small input variations.
Dynamic Range Variability:

Non-linear activations can produce outputs with widely varying ranges depending on the inputs and layer characteristics.
Quantization schemes (e.g., INT8) struggle to efficiently represent these dynamic ranges, leading to issues like saturation or vanishing outputs, where values get clipped or mapped to the same quantized level.

Impact on Real-World Applications

Accuracy Drop:

Applications requiring high precision, such as medical imaging, autonomous driving, or financial forecasting, may experience significant performance degradation due to the loss of precision in non-linear layers.

Quantization can cause models to behave unpredictably under certain inputs, such as producing unstable or extreme outputs for previously unseen data. This is critical in safety-critical systems like drones or industrial robotics.
Deployment Constraints:

The inability to fully quantize non-linear layers can limit the model's compatibility with low-power edge devices that lack floating-point units or require significant additional engineering effort.



- In the context of deploying models on mobile or embedded devices, how does quantization help meet the hardware constraints, and what are the potential limitations or concerns when quantizing for edge devices?

Quantization is a critical technique for deploying machine learning models on mobile or embedded devices, as it reduces the size and computational requirements of the model, helping to meet the constraints of such hardware.

Potential Limitations or Concerns
Accuracy Degradation:

1. Converting from high to low precision can lead to a loss of information, especially for models sensitive to small changes in weights or activations.
This may cause a noticeable drop in model accuracy, particularly in tasks requiring high precision, like fine-grained image recognition.
Compatibility Issues:

2. Not all neural network layers or operations are easily quantizable, and some models may have components that do not support quantization well.
Certain activation functions or custom layers may need additional adjustments or approximations to be quantized.
Quantization-Aware Training (QAT) Overhead:

3. Quantization-Aware Training (QAT) can be computationally expensive and time-consuming.

4. Not all edge devices support efficient execution of quantized models. Some older devices may lack the necessary hardware accelerators or libraries for INT8 operations.

