<a href="https://colab.research.google.com/github/sreekarvamsi/Model_Finetuning_n_Quantization/blob/main/Quantized_BERT_for_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Quantized BERT for Text Classification

The goal is to demonstrate the benefits of post-training quantization on a fine-tuned BERT model. We'll fine-tune bert-base-uncased for sentiment analysis, quantize it to INT8, and then compare the model size, inference speed, and accuracy before and after quantization.


## Step 1: Setup and Installation
First, let's install the necessary libraries. We'll use Hugging Face's transformers for the model, datasets for the data, evaluate for metrics, and optimum for the quantization process with ONNX Runtime.

In [None]:
!pip install transformers[torch] datasets evaluate optimum[onnxruntime]

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting optimum[onnxruntime]
  Downloading optimum-1.27.0-py3-none-any.whl.metadata (16 kB)
Collecting onnx (from optimum[onnxruntime])
  Downloading onnx-1.19.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (7.0 kB)
Collecting onnxruntime>=1.11.0 (from optimum[onnxruntime])
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
INFO: pip is looking at multiple versions of optimum[onnxruntime] to determine which version is compatible with other requirements. This could take a while.
Collecting optimum[onnxruntime]
  Downloading optimum-1.26.1-py3-none-any.whl.metadata (16 kB)
  Downloading optimum-1.26.0-py3-none-any.whl.metadata (16 kB)
  Downloading optimum-1.25.3-py3-none-any.whl.metadata (16 kB)
  Downloading optimum-1.25.2-py3-none-any.whl.metadata (16 kB)
  Downloading optimum-1.25.1-py3-none-any.whl.metadata (16 kB)


## Step 2: Load and Prepare the Dataset
We'll use the IMDB dataset, a classic benchmark for binary text classification (positive/negative movie reviews).

Load the Dataset: We'll load the dataset and create a smaller subset for faster fine-tuning, which is ideal for a one-day project.

Load Tokenizer: Load the bert-base-uncased tokenizer to preprocess our text.

Tokenize Data: Apply the tokenizer to the dataset.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset and create a smaller sample for quick training
imdb_dataset = load_dataset("imdb")
small_train_dataset = imdb_dataset["train"].shuffle(seed=42).select(range(1000))
small_test_dataset = imdb_dataset["test"].shuffle(seed=42).select(range(500))

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Preprocessing function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

# Apply tokenizer to the datasets
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## Step 3: Fine-Tune the BERT Model
Now, we'll fine-tune the standard bert-base-uncased model on our prepared dataset.

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
import evaluate
import os
import time

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Define directory to save the model
FP32_MODEL_DIR = "./models/bert-fp32"

# Define training arguments
training_args = TrainingArguments(
    output_dir=FP32_MODEL_DIR,
    eval_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

# Define metrics
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Create Trainer and start fine-tuning
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

trainer.train()

# Save the final fine-tuned model
trainer.save_model(FP32_MODEL_DIR)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.478561,0.86
2,No log,0.386827,0.892


In [None]:
!ls -lR ./models/

./models/:
total 4
drwxr-xr-x 4 root root 4096 Sep  2 19:23 bert-fp32

./models/bert-fp32:
total 427716
drwxr-xr-x 2 root root      4096 Sep  2 19:21 checkpoint-125
drwxr-xr-x 2 root root      4096 Sep  2 19:23 checkpoint-250
-rw-r--r-- 1 root root       687 Sep  2 19:23 config.json
-rw-r--r-- 1 root root 437958648 Sep  2 19:23 model.safetensors
-rw-r--r-- 1 root root      5777 Sep  2 19:23 training_args.bin

./models/bert-fp32/checkpoint-125:
total 1283252
-rw-r--r-- 1 root root       687 Sep  2 19:21 config.json
-rw-r--r-- 1 root root 437958648 Sep  2 19:21 model.safetensors
-rw-r--r-- 1 root root 876041611 Sep  2 19:21 optimizer.pt
-rw-r--r-- 1 root root     14645 Sep  2 19:21 rng_state.pth
-rw-r--r-- 1 root root      1465 Sep  2 19:21 scheduler.pt
-rw-r--r-- 1 root root      1041 Sep  2 19:21 trainer_state.json
-rw-r--r-- 1 root root      5777 Sep  2 19:21 training_args.bin

./models/bert-fp32/checkpoint-250:
total 1283248
-rw-r--r-- 1 root root       687 Sep  2 19:23 config.json
-

## Step 4: Baseline Evaluation (FP32 Model)
Before quantizing, we need to measure the performance of our original, full-precision (FP32) model.

Model Size: Check the size of the saved pytorch_model.bin file.

Inference Latency: Time how long it takes to run predictions on the test set.

Accuracy: Evaluate the model's accuracy on the test set.

In [None]:
# 1. Measure Model Size
fp32_model_size = os.path.getsize(os.path.join(FP32_MODEL_DIR, "model.safetensors")) / (1024 * 1024)
print(f"FP32 Model Size: {fp32_model_size:.2f} MB")

# 2. Measure Inference Latency and Accuracy
# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"
model.to(device)
model.eval()

total_time = 0
correct_predictions = 0
num_samples = len(tokenized_test)

with torch.no_grad():
    for i in range(num_samples):
        # inputs = {k: v.to(device).unsqueeze(0) for k, v in tokenized_test[i].items() if k in tokenizer.model_input_names}
        # ✅ This is the corrected code
        inputs = {k: torch.tensor(v).to(device).unsqueeze(0) for k, v in tokenized_test[i].items() if k in tokenizer.model_input_names}
        start_time = time.time()
        outputs = model(**inputs)
        total_time += time.time() - start_time

        prediction = torch.argmax(outputs.logits, dim=-1).item()
        if prediction == tokenized_test[i]["label"]:
            correct_predictions += 1

fp32_latency = (total_time / num_samples) * 1000 # Average latency in ms
fp32_accuracy = correct_predictions / num_samples
print(f"FP32 Average Latency: {fp32_latency:.2f} ms")
print(f"FP32 Accuracy: {fp32_accuracy:.4f}")

FP32 Model Size: 417.67 MB
FP32 Average Latency: 1568.41 ms
FP32 Accuracy: 0.8920


In [11]:
import torch

# Define the human-readable labels in the correct order (0: NEGATIVE, 1: POSITIVE)
labels = ["NEGATIVE", "POSITIVE"]

def predict(text, model, tokenizer):
    """
    Takes a text sentence and a model, and returns the predicted sentiment.
    """
    # 1. Tokenize the input text and convert to tensors
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # 2. Run inference
    # The .no_grad() is important as we are not training
    with torch.no_grad():
        outputs = model(**inputs)

    # 3. Get the prediction
    # The raw output is logits; torch.argmax finds the index of the highest score
    prediction_index = torch.argmax(outputs.logits, dim=-1).item()

    # 4. Decode the prediction
    return labels[prediction_index]

In [12]:
# --- Define some test sentences ---
positive_sentence = "I absolutely loved this movie, the acting was brilliant and the story was gripping!"
negative_sentence = "This was a complete waste of time. The plot was predictable and the characters were boring."

# --- Test the Original FP32 Model ---
# Ensure your FP32 'model' is loaded and on the CPU for a fair comparison if needed
# model.to("cpu")
fp32_prediction_pos = predict(positive_sentence, model, tokenizer)
fp32_prediction_neg = predict(negative_sentence, model, tokenizer)

print("--- Testing FP32 PyTorch Model ---")
print(f"Sentence: '{positive_sentence}'")
print(f"Prediction: {fp32_prediction_pos}") # Expected: POSITIVE
print("-" * 20)
print(f"Sentence: '{negative_sentence}'")
print(f"Prediction: {fp32_prediction_neg}") # Expected: NEGATIVE
print("\n" + "="*40 + "\n")

--- Testing FP32 PyTorch Model ---
Sentence: 'I absolutely loved this movie, the acting was brilliant and the story was gripping!'
Prediction: POSITIVE
--------------------
Sentence: 'This was a complete waste of time. The plot was predictable and the characters were boring.'
Prediction: NEGATIVE




##Step 5: Apply Post-Training Quantization (INT8)
Now for the core step. We'll use the optimum library to easily convert our PyTorch model to a quantized ONNX model. We will use Post-Training Dynamic Quantization, where the model weights are converted to INT8.

We added a new Export step that uses ORTModelForSequenceClassification to convert your PyTorch model from FP32_MODEL_DIR.

We save this new ONNX version to a different directory, ONNX_FP32_MODEL_DIR.

Crucially, the ORTQuantizer is now loaded from this new ONNX directory, which contains the .onnx file it needs.



In [None]:
from optimum.exporters.onnx import main_export
from pathlib import Path

# --- 1. EXPORT the fine-tuned PyTorch model to ONNX format ---

# Define where to save the ONNX model
ONNX_FP32_MODEL_DIR = Path("./models/bert-onnx-fp32")

# Use the main_export function for more control
main_export(
    model_name_or_path=FP32_MODEL_DIR,
    output=ONNX_FP32_MODEL_DIR,
    task="text-classification",  # We explicitly define the task here
    opset=14,                    # And we explicitly set the opset version
)

# Note: The tokenizer is not automatically saved with main_export, so we save it manually.
tokenizer.save_pretrained(ONNX_FP32_MODEL_DIR)

print(f"PyTorch model exported to ONNX format at: {ONNX_FP32_MODEL_DIR}")


# --- 2. QUANTIZE the exported ONNX model ---
# (The quantization code from before remains the same)

from optimum.onnxruntime import ORTQuantizer, AutoQuantizationConfig

INT8_MODEL_DIR = "./models/bert-int8"

# Create a quantizer FROM THE NEW ONNX MODEL DIRECTORY
quantizer = ORTQuantizer.from_pretrained(ONNX_FP32_MODEL_DIR)

# Define the quantization configuration
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# Apply quantization
quantizer.quantize(
    save_dir=INT8_MODEL_DIR,
    quantization_config=dqconfig,
)

print(f"Quantized INT8 model saved to: {INT8_MODEL_DIR}")

Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.8.0+cu126
Overriding 1 configuration item(s)
	- use_cache -> False
  inverted_mask = torch.tensor(1.0, dtype=dtype) - expanded_mask
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating ONNX model models/bert-onnx-fp32/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 2) matches (2, 2)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: models/bert-onnx-fp32
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)


PyTorch model exported to ONNX format at: models/bert-onnx-fp32


Quantizing model...
Saving quantized model at: models/bert-int8 (external data format: False)
Configuration saved in models/bert-int8/ort_config.json


Quantized INT8 model saved to: ./models/bert-int8


##Step 6: Evaluate the Quantized Model (INT8)
Finally, we evaluate the quantized INT8 model and compare its performance to the FP32 baseline.

In [None]:
# import torch
# import time
# from optimum.onnxruntime import ORTModelForSequenceClassification

# # 1. Measure Model Size
# int8_model_size = os.path.getsize(os.path.join(INT8_MODEL_DIR, "model_quantized.onnx")) / (1024 * 1024)
# print(f"INT8 Quantized Model Size: {int8_model_size:.2f} MB")

# # 2. Measure Inference Latency and Accuracy
# quantized_model = ORTModelForSequenceClassification.from_pretrained(INT8_MODEL_DIR)

# total_time_quantized = 0
# correct_predictions_quantized = 0
# num_samples = len(tokenized_test)

# for i in range(num_samples):
#     # The fix is applied on the next line
#     inputs = {k: torch.tensor(v).unsqueeze(0) for k, v in tokenized_test[i].items() if k in tokenizer.model_input_names}
#     start_time = time.time()
#     outputs = quantized_model(**inputs)
#     total_time_quantized += time.time() - start_time

#     prediction = torch.argmax(outputs.logits, dim=-1).item()
#     if prediction == tokenized_test[i]["label"]:
#         correct_predictions_quantized += 1

# int8_latency = (total_time_quantized / num_samples) * 1000
# int8_accuracy = correct_predictions_quantized / num_samples

# print(f"INT8 Average Latency: {int8_latency:.2f} ms")
# print(f"INT8 Accuracy: {int8_accuracy:.4f}")

In [None]:
import torch
import time
from optimum.onnxruntime import ORTModelForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset

# 1. Measure Model Size
int8_model_size = os.path.getsize(os.path.join(INT8_MODEL_DIR, "model_quantized.onnx")) / (1024 * 1024)
print(f"INT8 Quantized Model Size: {int8_model_size:.2f} MB")

# 2. Measure Inference Latency and Accuracy using BATCHES
quantized_model = ORTModelForSequenceClassification.from_pretrained(INT8_MODEL_DIR)

# --- FIX 1: Add token_type_ids to the DataLoader ---
batch_size = 32
input_ids = torch.tensor([item['input_ids'] for item in tokenized_test])
attention_mask = torch.tensor([item['attention_mask'] for item in tokenized_test])
token_type_ids = torch.tensor([item['token_type_ids'] for item in tokenized_test]) # Added this line
labels = torch.tensor([item['label'] for item in tokenized_test])
# Add token_type_ids to the dataset
eval_dataset = TensorDataset(input_ids, attention_mask, token_type_ids, labels)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size)
# ---

total_time_quantized = 0
correct_predictions_quantized = 0

for batch in eval_dataloader:
    # --- FIX 2: Unpack the new token_type_ids and add to the inputs dict ---
    batch_input_ids, batch_attention_mask, batch_token_type_ids, batch_labels = batch

    inputs = {
        "input_ids": batch_input_ids,
        "attention_mask": batch_attention_mask,
        "token_type_ids": batch_token_type_ids, # Added this line
    }
    # ---

    start_time = time.time()
    outputs = quantized_model(**inputs)
    total_time_quantized += time.time() - start_time

    predictions = torch.argmax(outputs.logits, dim=-1)
    correct_predictions_quantized += torch.sum(predictions == batch_labels).item()

# --- Calculate final metrics ---
num_samples = len(tokenized_test)
int8_latency = (total_time_quantized / num_samples) * 1000
int8_accuracy = correct_predictions_quantized / num_samples

print(f"INT8 Average Latency: {int8_latency:.2f} ms")
print(f"INT8 Accuracy: {int8_accuracy:.4f}")

INT8 Quantized Model Size: 105.24 MB
INT8 Average Latency: 1173.75 ms
INT8 Accuracy: 0.8920


In [13]:
# --- Define some test sentences ---
positive_sentence = "I absolutely loved this movie, the acting was brilliant and the story was gripping!"
negative_sentence = "This was a complete waste of time. The plot was predictable and the characters were boring."


# --- Test the Quantized INT8 Model ---
# Ensure your 'quantized_model' is loaded
int8_prediction_pos = predict(positive_sentence, quantized_model, tokenizer)
int8_prediction_neg = predict(negative_sentence, quantized_model, tokenizer)

print("--- Testing INT8 ONNX Model ---")
print(f"Sentence: '{positive_sentence}'")
print(f"Prediction: {int8_prediction_pos}") # Expected: POSITIVE
print("-" * 20)
print(f"Sentence: '{negative_sentence}'")
print(f"Prediction: {int8_prediction_neg}") # Expected: NEGATIVE

--- Testing INT8 ONNX Model ---
Sentence: 'I absolutely loved this movie, the acting was brilliant and the story was gripping!'
Prediction: POSITIVE
--------------------
Sentence: 'This was a complete waste of time. The plot was predictable and the characters were boring.'
Prediction: NEGATIVE


##Step 7: Final Results
Now, compile and present the final comparison as requested.

In [None]:
# Calculate the performance changes
size_reduction = fp32_model_size / int8_model_size
latency_reduction = fp32_latency / int8_latency
accuracy_drop = (fp32_accuracy - int8_accuracy) * 100

print("---" * 10)
print("✅ Project Results Summary ✅")
print("---" * 10)
print(f"Reduced model size by {size_reduction:.2f}x (from {fp32_model_size:.2f} MB to {int8_model_size:.2f} MB).")
print(f"Reduced inference latency by {latency_reduction:.2f}x (from {fp32_latency:.2f} ms to {int8_latency:.2f} ms).")
print(f"Accuracy drop: {accuracy_drop:.2f}%.")
print("---" * 10)

------------------------------
✅ Project Results Summary ✅
------------------------------
Reduced model size by 3.97x (from 417.67 MB to 105.24 MB).
Reduced inference latency by 1.34x (from 1568.41 ms to 1173.75 ms).
Accuracy drop: 0.00%.
------------------------------
