# Lesson 3, Exercise 1: Post-Training Quantization - Beyond Memory: Speed and Quality Trade-offs

**Goal:**
The primary goal of this exercise is to move beyond simply observing memory reduction from quantization and to comprehensively evaluate its impact. You will quantify and analyze the trade-offs between model memory footprint, inference speed (latency), and the subjective quality of generated text when applying different levels of Post-Training Quantization (PTQ) to a GPT-2 model.

## 1. Installation & Setup

In [None]:
# !pip install transformers torch bitsandbytes pandas accelerate
# !accelerate config default # Run this if you haven't before

## 2. Imports and Configuration

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import pandas as pd

MODEL_NAME = "gpt2" # Standard GPT-2
PROMPTS = [
    "The capital of France is",
    "Once upon a time, in a land far, far away,",
    "To be or not to be, that is the"
]
MAX_NEW_TOKENS = 50
NUM_TIMING_RUNS = 3 # Number of times to run generation for averaging latency

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 3. Helper Functions

In [None]:
def get_model_memory_footprint(model):
    """Gets model memory footprint in MB."""
    mem_params = sum([___ for param in model.parameters()]) # TODO: extract memory size for each parameter. Ref: https://discuss.pytorch.org/t/finding-model-size/130275
    mem_bufs = sum([___ for buf in model.buffers()]) # TODO: extract memory size for each buffer. Ref: https://discuss.pytorch.org/t/finding-model-size/130275
    mem = mem_params + mem_bufs # in bytes
    return mem / 1024**2 # convert to MB

def generate_text_and_time(model, tokenizer, prompt, max_new_tokens):
    """Generates text and returns the generated text and latency."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    start_time = time.perf_counter() # Use perf_counter for more precise timing
    if model.device.type == 'cuda':
        torch.cuda.synchronize() # Ensure previous CUDA ops are done
        
    outputs = None # TODO: write generation logic
    
    if model.device.type == 'cuda':
        torch.cuda.synchronize() # Ensure generation is done
    end_time = time.perf_counter()
    
    generated_text = None # TODO: extract generated text from output. HINT: use the tokenizer 
    latency = end_time - start_time
    return generated_text, latency

## 4. Main Experiment Logic

In [None]:
results_list = [] # Use a different name to avoid conflict if re-running cells

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Define precision configurations to test
configurations = [
    ### TODO: Define the baseline configuration (FP16 for GPU or FP32 for CPU)
    # Example: {"name": "FP16 (Baseline GPU)" if device.type == "cuda" else "FP32 (Baseline CPU)", "load_args": {"torch_dtype": torch.float16 if device.type == "cuda" else torch.float32}},
    
    ### TODO: Define the INT8 quantization configuration using bitsandbytes
    # Example: {"name": "INT8 (bitsandbytes)", "load_args": {"load_in_8bit": True, "device_map": "auto" if device.type == "cuda" else None}},
    
    ### TODO: Define the NF4 (4-bit) quantization configuration using bitsandbytes
    
    ### TODO: Define the FP4 (4-bit) quantization configuration
    ]

# Adjust configurations if running on CPU (bitsandbytes quantization typically requires CUDA)
if device.type == "cpu":
    print("Bitsandbytes quantization (INT8, NF4, FP4) usually requires CUDA. Filtering configurations.")
    configurations = [config for config in configurations if "bitsandbytes" not in config["name"]]

print(f"\n--- Starting Experiment for Model: {MODEL_NAME} ---")

for config in configurations:
    print(f"\nLoading model with configuration: {config['name']}")
    model = None # Ensure model is reset
    try:
        ### TODO: Load the model using AutoModelForCausalLM.from_pretrained()
        # Use the load_args from the current 'config'.
        # Handle device placement correctly (model.to(device) if not using device_map - e.g. in CPU usecase).
        # Example for handling device_map on CPU (though bitsandbytes won't quantize):
        # current_load_args = config['load_args']
        # if "device_map" in current_load_args and device.type == "cpu":
        #    current_load_args = {k: v for k, v in current_load_args.items() if k != "device_map"}
        #    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, **current_load_args)
        #    model.to(device)
        pass # Replace with model loading logic on GPU using bitsandbytes. Hint: for this we can directly pass 
             # the config['load_args'] to the from_pretrained method.

        if model is None: # Check if model loading was skipped/failed in TODO
            print(f"Skipping {config['name']} due to model loading not implemented in TODO.")
            continue

        ### TODO: Get the model memory footprint using get_model_memory_footprint
        memory_mb = None # Placeholder
        print(f"Memory Footprint: {memory_mb:.2f} MB")

        avg_latencies_for_config = []
        generated_outputs_for_prompts = {}

        for i, prompt_text in enumerate(PROMPTS):
            print(f"  Processing prompt: '{prompt_text[:30]}...' ")
            prompt_specific_latencies = []
            current_generated_text = "N/A"
            
            ### TODO: Implement the timing loop (NUM_TIMING_RUNS)
            # Only perform multiple timing runs for the first prompt to establish 'Avg Latency (s)'
            # For other prompts, generate text once for quality assessment.
            # Store the first generation's text in 'current_generated_text'.
            # Accumulate latencies for the first prompt in 'prompt_specific_latencies'. Use generate_text_and_time function.
            # Collect the generated text for each prompt in generated_outputs_for_prompts

        overall_avg_latency_for_config = sum(avg_latencies_for_config) / len(avg_latencies_for_config) if avg_latencies_for_config else float('nan')

        results_list.append({
            "Float Precision": config["name"],
            "Memory (MB)": memory_mb,
            "Avg Latency (s)": overall_avg_latency_for_config, # Based on first prompt's timing
            **generated_outputs_for_prompts
        })
        
        del model # Free up memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    except Exception as e:
        print(f"Could not run configuration {config['name']}. Error: {e}")
        results_list.append({
            "Float Precision": config["name"],
            "Memory (MB)": "Error",
            "Avg Latency (s)": "Error",
            **{f"Prompt {i+1} Output": "Error" for i in range(len(PROMPTS))}
        })

# --- Display Results ---
df_results = pd.DataFrame(results_list)
print("\n\n--- Experiment Results Summary ---")
print(df_results.to_string())

## 5. Analysis and Discussion

Based on the 'Experiment Results Summary' table printed above, analyze your findings:

1.  **Memory Scaling:** 
    *   **TODO**: Describe how the model's memory footprint scaled as you reduced precision. Quantify the reductions.

2.  **Latency Changes:** 
    *   **TODO**: Analyze the changes in generation latency. Did latency always decrease with lower precision, or were there other factors at play? Explain potential reasons.

3.  **Output Quality Degradation:** 
    *   **TODO**: Based on your subjective review of the outputs for each prompt and precision, at what point (if any) did you start to observe significant degradation in the output quality (e.g., coherence, relevance, repetitiveness)? Provide examples.

4.  **Key Trade-offs:** 
    *   **TODO**: Conclude with a summary of the key trade-offs you observed between memory savings, inference speed, and text quality when applying different quantization levels to GPT-2. Which configuration seemed to offer the best balance for which scenario?