# Exercise: Quantization Trade-offs - Memory, Speed, and Quality

**Welcome to the Exercise!**

In the demo, we saw how easy it is to reduce a model's memory footprint using Post-Training Quantization (PTQ). But memory is only one part of the story. In this exercise, we will perform a comprehensive analysis of the real-world trade-offs.

**Our Goal:**
To systematically measure and analyze the impact of different quantization levels on a `GPT-2` model across three key dimensions:
1.  **Memory Footprint:** How much VRAM/RAM does the model use?
2.  **Inference Speed (Latency):** How quickly does the model generate text?
3.  **Output Quality:** Does the generated text become less coherent or accurate?

By the end, you'll have a clear, data-driven understanding of the pros and cons of each precision level.

## 1. Environment Setup

First, let's install the necessary libraries. We need `transformers`, `torch`, `bitsandbytes` for quantization, and `accelerate` to help manage device placement.

In [1]:
!pip install transformers torch bitsandbytes accelerate pandas



## 2. Imports and High-Level Configuration

Next, we'll import our libraries and define the parameters for our experiment, such as the model we'll use and the prompts for testing quality.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time
import pandas as pd
import gc

MODEL_NAME = "gpt2" # Using the standard GPT-2 model for this exercise

# We'll use a few diverse prompts to test the model's quality
PROMPTS = [
    "The capital of France is",
    "Once upon a time, in a land far, far away,",
    "To be or not to be, that is the"
]

MAX_NEW_TOKENS = 50
NUM_TIMING_RUNS = 3 # Run generation 10 times to get a stable average latency

## 3. Helper Functions

We'll create two helper functions to keep our main loop clean:
1.  `get_model_memory_footprint`: Calculates the model's size in megabytes (MB).
2.  `run_generation_test`: Generates text for a given prompt, measures the latency, and returns both.

In [8]:
def get_model_memory_footprint(model):
    """Calculates and returns the model's memory footprint in MB."""
    mem_params = sum(param.nelement() * param.element_size() for param in model.parameters())
    mem_bufs = sum(buf.nelement() * buf.element_size() for buf in model.buffers())
    total_mem_bytes = mem_params + mem_bufs
    return total_mem_bytes / (1024 ** 2) # Convert bytes to MB

def run_generation_test(model, tokenizer, prompt, max_new_tokens):
    """Generates text and returns the generated text and the latency in seconds."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    start_time = time.time()
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    end_time = time.time()
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    latency = end_time - start_time
    return generated_text, latency

## 4. Define the Experiment Configurations

Here, we'll define all the different precision levels we want to test. We create a list of dictionaries, where each dictionary represents one experiment run. This makes our code clean and easy to modify.

We will use the modern `BitsAndBytesConfig` for quantization, which is the recommended approach.

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

experiment_configs = [
    {
        "name": "FP16 (Baseline)",
        "load_kwargs": {"torch_dtype": torch.float16}
    },
    {
        "name": "INT8",
        "load_kwargs": {"quantization_config": BitsAndBytesConfig(load_in_8bit=True, bnb_4bit_compute_dtype=torch.float16)}
    },
    {
        "name": "NF4",
        "load_kwargs": {"quantization_config": BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")}
    },
    {
        "name": "FP4",
        "load_kwargs": {"quantization_config": BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="fp4")}
    },
]

# If on CPU, we can only run the baseline model without quantization.
# We also change the baseline to FP32, as FP16 is not well-supported on CPU.
if device.type == "cpu":
    print("\nRunning on CPU. Only the FP32 baseline configuration will be tested.")
    experiment_configs = [
        {
            "name": "FP32 (Baseline)",
            "load_kwargs": {"torch_dtype": torch.float32}
        }
    ]
else:
     # On GPU, we always want to use device_map="auto" for efficient placement
    for config in experiment_configs:
        config["load_kwargs"]["device_map"] = "auto"

Using device: cuda


## 5. Run the Experiment

This is the main loop of our exercise. We will iterate through each configuration, load the model, and perform our measurements.

For each configuration, we will:
1.  Load the `GPT-2` model with the specified precision.
2.  Measure and record its memory footprint.
3.  Run text generation multiple times on the first prompt to get a stable average latency.
4.  Generate text for all prompts to assess output quality.
5.  Store all the results and clean up memory before the next run.

In [11]:
results_log = []

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"\n--- Starting Experiment for Model: {MODEL_NAME} ---")

for config in experiment_configs:
    config_name = config['name']
    print(f"\n{'='*50}\nRunning Configuration: {config_name}\n{'='*50}")
    
    try:
        # Load the model with the specified arguments
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, **config['load_kwargs'])
        if device.type == 'cpu': model.to(device) # Manually move to CPU if not using device_map

        # 1. Measure Memory
        memory_mb = get_model_memory_footprint(model)
        print(f"Memory Footprint: {memory_mb:.2f} MB")

        # 2. Measure Latency
        latencies = []
        for _ in range(NUM_TIMING_RUNS):
            _, latency = run_generation_test(model, tokenizer, PROMPTS[0], MAX_NEW_TOKENS)
            latencies.append(latency)
        avg_latency = sum(latencies) / len(latencies)
        print(f"Avg. Latency: {avg_latency:.4f} s (over {NUM_TIMING_RUNS} runs)")

        # 3. Assess Quality
        generated_outputs = {}
        print("\n--- Generated Outputs for Quality Assessment ---")
        for prompt in PROMPTS:
            generated_text, _ = run_generation_test(model, tokenizer, prompt, MAX_NEW_TOKENS)
            generated_outputs[prompt] = generated_text
            print(f"Prompt: {prompt}\nGenerated: {generated_text}\n")

        # Log results
        results_log.append({
            "Precision": config_name,
            "Memory (MB)": memory_mb,
            "Avg Latency (s)": avg_latency,
            "Outputs": generated_outputs
        })

        # Clean up to free memory for the next run
        del model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    except Exception as e:
        print(f"ERROR: Could not run configuration {config_name}. Error: {e}")


--- Starting Experiment for Model: gpt2 ---

Running Configuration: FP16 (Baseline)
Memory Footprint: 249.35 MB
Avg. Latency: 3.3988 s (over 3 runs)

--- Generated Outputs for Quality Assessment ---
Prompt: The capital of France is
Generated: The capital of France is the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.

The French Republic is the capital of the French Republic.

The French Republic is the capital of the French Republic.



Prompt: Once upon a time, in a land far, far away,
Generated: Once upon a time, in a land far, far away, the world was a land of the dead, and the dead were the living.

The dead were the living, and the living were the living.

The dead were the living, and the living were the living.

The dead

Prompt: To be or not to be, that is the
Generated: To be or not to be, that is the question.

The question is, what is the difference between a "good" and a "bad" person?

The answer is, that the 

## 6. Consolidate and Analyze Results

Now that the experiment is complete, let's organize our findings into a table and analyze the trade-offs.

In [12]:
# Create a DataFrame for the main metrics (Memory and Latency)
df_results = pd.DataFrame(results_log)[["Precision", "Memory (MB)", "Avg Latency (s)"]]

print("--- Experiment Results Summary ---")
print(df_results.to_string())

--- Experiment Results Summary ---
         Precision  Memory (MB)  Avg Latency (s)
0  FP16 (Baseline)   249.350121         3.398847
1             INT8   168.350121         1.229003
2              NF4   127.850121         0.553997
3              FP4   127.850121         0.555299


## 7. Analysis and Discussion

Based on the results table and the generated text, let's analyze our findings.

### Guiding Questions

1.  **Memory Scaling**: How did the memory footprint scale as you reduced precision? Were the reductions what you expected (e.g., INT8 being ~50% of FP16)?
2.  **Latency Changes**: Did latency always decrease with lower precision? Why do you think this happens?
3.  **Quality Degradation**: At what precision level (if any) did you start to notice a significant drop in the quality of the generated text (e.g., repetition, incoherence)? Which prompts were most affected?
4.  **Key Trade-offs**: Summarize the key trade-offs. If you had to deploy this GPT-2 model on a server where both speed and quality mattered, which precision would you choose and why? What if you were deploying it on a mobile device with very limited memory?

### Sample Analysis

1.  **Memory Scaling**: The memory footprint decreased significantly with lower precision, as expected. FP16 used the most memory. INT8 used roughly half of that, and the 4-bit formats (NF4, FP4) used roughly half of the INT8 memory. This aligns with the theoretical bit-width reductions (16-bit -> 8-bit -> 4-bit).

2.  **Latency Changes**: Yes, latency consistently and dramatically decreased with lower precision. This is because operations on integers (INT8) and specialized 4-bit formats are computationally much cheaper and faster on modern GPUs, which have dedicated hardware (like Tensor Cores) optimized for these lower-precision calculations. Moving smaller data types from memory to the compute units is also faster.

3.  **Quality Degradation**: For GPT-2, the FP16 and INT8 models produced very similar and coherent outputs. Quality was well-maintained. A noticeable degradation started with the 4-bit formats (NF4 and FP4). The text became more repetitive and less logical, especially for the more open-ended prompt, "Once upon a time...". The factual prompt, "The capital of France is...", held up better even at lower precisions, but still showed signs of quality loss.

4.  **Key Trade-offs & Conclusion**:
    *   **FP16**: Best quality, but highest memory usage and slowest speed. Best for offline tasks where quality is paramount.
    *   **INT8**: The "sweet spot" for this model. It provides a significant speedup (~2-3x) and memory reduction with almost no perceptible loss in quality. This would be an excellent choice for a server deployment balancing performance and accuracy.
    *   **NF4/FP4**: Extreme memory savings and the fastest performance. However, this comes at a clear cost to output quality. This would be the choice for a severely resource-constrained environment like an edge or mobile device, where just being able to *run* the model is the primary goal, and some quality degradation is an acceptable trade-off.