# 03 - LLM Evaluation (Zero-Shot)

**Previous:** [02_Data_Processing_and_Tokenization.ipynb](02_Data_Processing_and_Tokenization.ipynb)  
**Next:** [04_SLM_Training_LoRA.ipynb](04_SLM_Training_LoRA.ipynb)

---

## What This Notebook Covers

In this notebook, we evaluate **Large Language Models (LLMs)** on our medical diagnosis task **without any training**.

**Key Questions:**
1. What is zero-shot evaluation?
2. How do we load large models efficiently (7-8B parameters)?
3. What is model quantization and why do we need it?
4. How do we run inference and extract predictions?
5. How do we measure performance (accuracy, F1, etc.)?

**Models We'll Evaluate:**
- **Llama 3.1 8B** (Meta's largest instruction-tuned model)
- **Mistral 7B** (Popular open-source alternative)

**Why This Matters:**
- Establishes baseline performance
- Tests if large models can solve medical tasks out-of-the-box
- Provides comparison point for our finetuned small models

---

## Setup

In [None]:
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Critical for GPU memory management
os.environ['PYTORCH_ALLOC_CONF'] = 'expandable_segments:True'

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

print(f"‚úÖ Project Root: {project_root}")

In [None]:
# Import libraries
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict
from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
import gc

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported")

In [None]:
# Check GPU
if torch.cuda.is_available():
    print(f"‚úÖ CUDA Available: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    print(f"   CUDA Version: {torch.version.cuda}")
    device = "cuda"
else:
    print("‚ö†Ô∏è  CUDA not available - using CPU (very slow!)")
    device = "cpu"

---

## 1. Understanding Zero-Shot Evaluation üéØ

### What is Zero-Shot?

**Zero-shot** means the model has **never seen** examples of this specific task during training.

```
Traditional ML:
  1. Collect labeled data for task X
  2. Train model on task X
  3. Test model on task X
  ‚Üí Model learned from task X examples

Zero-Shot Learning:
  1. Model pre-trained on general text
  2. NO training on task X
  3. Test model on task X using instructions only
  ‚Üí Model relies on general knowledge
```

### Why Zero-Shot?

**Advantages:**
- ‚úÖ No training needed (saves time and compute)
- ‚úÖ Works on new tasks immediately
- ‚úÖ Generalizes across domains

**Disadvantages:**
- ‚ùå Often lower performance than finetuned models
- ‚ùå May not understand task-specific nuances
- ‚ùå Sensitive to prompt wording

### Our Experimental Setup

We'll evaluate LLMs in zero-shot mode to answer:
> **Can large models (7-8B) with no medical training compete with small finetuned models (3B)?**

**Prediction:**
- LLMs have more parameters ‚Üí more knowledge
- But no medical specialization
- Will they be good enough?

---

## 2. Model Quantization üóúÔ∏è

### The Memory Problem

**Problem:** Large models are HUGE!

```
Llama 3.1 8B in full precision (float32):
  8 billion parameters √ó 4 bytes = 32 GB
  + activations, optimizer states = ~40-50 GB total!
```

Most consumer GPUs: 16-24 GB VRAM ‚Üí **Won't fit!**

### Solution: Quantization

**Quantization** reduces the precision of model weights:

```
Float32 (32-bit):  3.14159265358979... (very precise)
  ‚Üì Quantize to 8-bit
Int8 (8-bit):      3 (less precise, 4x smaller!)
  ‚Üì Quantize to 4-bit  
Int4 (4-bit):      3 (least precise, 8x smaller!)
```

**Trade-offs:**

| Precision | Memory | Speed | Accuracy |
|-----------|--------|-------|----------|
| float32   | 32 GB  | 1.0x  | 100%     |
| float16   | 16 GB  | 1.5x  | 99.9%    |
| int8      | 8 GB   | 2.0x  | 99.5%    |
| int4      | 4 GB   | 3.0x  | 98-99%   |

**We'll use 4-bit quantization (NF4):**
- Llama 8B: 32 GB ‚Üí **4-6 GB** (fits on most GPUs!)
- Minimal accuracy loss (~1-2%)
- Faster inference

### BitsAndBytes NF4

**NF4 (4-bit NormalFloat)** is a special quantization format:
- Optimized for weights that follow normal distribution
- Better than standard int4
- Developed by Tim Dettmers (University of Washington)

---

## 3. Loading a Large Language Model

Let's load Llama 3.1 8B with 4-bit quantization:

In [None]:
# Configuration for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",            # Use NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
    bnb_4bit_use_double_quant=True        # Double quantization for even more compression
)

print("Quantization Config:")
print(f"  Quantization Type: NF4 (4-bit)")
print(f"  Compute Dtype: bfloat16")
print(f"  Double Quantization: Enabled")
print(f"\n  Expected Memory: ~4-6 GB (vs ~32 GB in float32)")
print(f"  Expected Speedup: ~2-3x")

In [None]:
# Model to load
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

print(f"Loading model: {model_name}")
print("This may take 1-2 minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print(f"‚úÖ Tokenizer loaded")

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",                    # Automatically distribute across GPUs
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
model.eval()  # Set to evaluation mode

print(f"‚úÖ Model loaded and quantized")

# Check memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"\nGPU Memory:")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Reserved:  {reserved:.2f} GB")

### Model Information

In [None]:
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model Statistics:")
print(f"  Total Parameters:     {total_params:,}")
print(f"  Trainable Parameters: {trainable_params:,}")
print(f"  Model Size:           ~{total_params / 1e9:.1f}B parameters")
print(f"\n  Architecture: {model.config.model_type}")
print(f"  Hidden Size:  {model.config.hidden_size}")
print(f"  Num Layers:   {model.config.num_hidden_layers}")
print(f"  Num Heads:    {model.config.num_attention_heads}")
print(f"  Vocab Size:   {model.config.vocab_size:,}")

---

## 4. Loading Test Data

Let's load the test set we'll evaluate on:

In [None]:
# Load dataset
print("Loading MedSynth dataset...")
dataset = load_dataset("samhog/medsynth-diagnosis-icd10-10k", split="train")

# Split into train/val/test (70/15/15)
train_test_split = dataset.train_test_split(test_size=0.3, seed=42)
val_test_split = train_test_split['test'].train_test_split(test_size=0.5, seed=42)

train_dataset = train_test_split['train']
val_dataset = val_test_split['train']
test_dataset = val_test_split['test']

print(f"\n‚úÖ Dataset Split:")
print(f"   Train: {len(train_dataset):,} examples")
print(f"   Val:   {len(val_dataset):,} examples")
print(f"   Test:  {len(test_dataset):,} examples")

# For this demo, use a subset of test set
test_subset = test_dataset.select(range(min(100, len(test_dataset))))
print(f"\n   Using {len(test_subset)} test examples for demo")

---

## 5. Running Zero-Shot Inference

### The Inference Process

```
1. Format conversation with chat template
   "Doctor: What brings you here?\nPatient: I have a fever."
   
2. Tokenize input
   [128000, 128006, ...] (token IDs)
   
3. Generate prediction
   Model outputs: [9805, 2705, 13, 24] (new token IDs)
   
4. Decode prediction
   "J06.9"
   
5. Extract ICD-10 code
   "J06.9" ‚Üí cleaned and validated
```

### Inference Function

In [None]:
def format_conversation_for_inference(example: Dict) -> str:
    """
    Format a medical conversation for zero-shot inference.
    """
    system_prompt = (
        "You are a medical diagnosis assistant. "
        "Based on the doctor-patient conversation below, predict ONLY the ICD-10 diagnosis code. "
        "Respond with just the code (e.g., 'J06.9'), nothing else."
    )
    
    # Format conversation
    conversation_text = "\n".join([
        f"{msg['role'].capitalize()}: {msg['content']}"
        for msg in example['messages']
    ])
    
    # Build chat messages (no assistant response - we want model to generate it)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": conversation_text}
    ]
    
    # Apply chat template
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True  # Add assistant prompt for generation
    )
    
    return formatted

# Test formatting
test_example = test_subset[0]
formatted_input = format_conversation_for_inference(test_example)

print("Example Formatted Input:")
print("="*70)
print(formatted_input[:500] + "...")
print("="*70)
print(f"\nExpected Output: {test_example['diagnosis']}")

### Prediction Function

In [None]:
@torch.no_grad()  # Disable gradient computation for inference
def predict_icd10(example: Dict, model, tokenizer, max_new_tokens: int = 10) -> str:
    """
    Predict ICD-10 code for a conversation.
    
    Args:
        example: Dataset example with 'messages' field
        model: Language model
        tokenizer: Tokenizer
        max_new_tokens: Maximum tokens to generate
    
    Returns:
        Predicted ICD-10 code (string)
    """
    # Format conversation
    formatted = format_conversation_for_inference(example)
    
    # Tokenize
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    
    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,           # Deterministic (greedy decoding)
        temperature=1.0,
        top_p=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    
    # Decode only the new tokens (not the input)
    generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
    prediction = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    # Clean prediction (extract just the code)
    prediction = prediction.strip()
    
    # Sometimes models add extra text - extract first word
    prediction = prediction.split()[0] if prediction.split() else ""
    
    return prediction

# Test prediction
print("Testing prediction on first example...\n")
prediction = predict_icd10(test_example, model, tokenizer)

print(f"Ground Truth: {test_example['diagnosis']}")
print(f"Prediction:   {prediction}")
print(f"Match:        {'‚úÖ CORRECT' if prediction == test_example['diagnosis'] else '‚ùå INCORRECT'}")

### Batch Evaluation

Now let's evaluate the entire test subset:

In [None]:
def evaluate_model(dataset, model, tokenizer) -> Dict:
    """
    Evaluate model on entire dataset.
    
    Returns:
        Dictionary with predictions and ground truth
    """
    predictions = []
    ground_truth = []
    
    print(f"Evaluating on {len(dataset)} examples...\n")
    
    for example in tqdm(dataset, desc="Predicting"):
        try:
            pred = predict_icd10(example, model, tokenizer)
            predictions.append(pred)
            ground_truth.append(example['diagnosis'])
        except Exception as e:
            print(f"Error on example: {e}")
            predictions.append("")  # Empty prediction on error
            ground_truth.append(example['diagnosis'])
    
    return {
        'predictions': predictions,
        'ground_truth': ground_truth
    }

# Run evaluation
results = evaluate_model(test_subset, model, tokenizer)

print(f"\n‚úÖ Evaluation complete!")
print(f"   Predictions: {len(results['predictions'])}")
print(f"   Ground Truth: {len(results['ground_truth'])}")

---

## 6. Calculating Metrics üìä

### Metrics We'll Use

**Exact Match Accuracy:**
```
Accuracy = (Correct Predictions) / (Total Predictions)
```
Example: Predicted "J06.9", Ground Truth "J06.9" ‚Üí ‚úÖ Correct

**Prefix Match Accuracy:**
```
Category correct = Does predicted category match?
```
Example: Predicted "J06.8", Ground Truth "J06.9" ‚Üí ‚úÖ Same category (J06)

**Precision, Recall, F1:**
- **Precision**: Of predictions for code X, how many were correct?
- **Recall**: Of all actual code X cases, how many did we find?
- **F1**: Harmonic mean of precision and recall

### Calculate Metrics

In [None]:
def calculate_metrics(predictions: List[str], ground_truth: List[str]) -> Dict:
    """
    Calculate comprehensive evaluation metrics.
    """
    # Exact match accuracy
    exact_matches = sum(p == g for p, g in zip(predictions, ground_truth))
    exact_match_acc = exact_matches / len(predictions)
    
    # Prefix match (category level, e.g., J06.9 ‚Üí J06)
    def get_category(code: str) -> str:
        """Extract category from ICD-10 code."""
        if not code:
            return ""
        # Extract letter + first digits (e.g., J06.9 ‚Üí J06)
        parts = code.split('.')
        return parts[0] if parts else ""
    
    pred_categories = [get_category(p) for p in predictions]
    true_categories = [get_category(g) for g in ground_truth]
    
    category_matches = sum(p == g for p, g in zip(pred_categories, true_categories))
    category_acc = category_matches / len(predictions)
    
    # Precision, Recall, F1 (macro-averaged across codes)
    precision = precision_score(ground_truth, predictions, average='macro', zero_division=0)
    recall = recall_score(ground_truth, predictions, average='macro', zero_division=0)
    f1 = f1_score(ground_truth, predictions, average='macro', zero_division=0)
    
    return {
        'exact_match_accuracy': exact_match_acc,
        'category_accuracy': category_acc,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'num_examples': len(predictions)
    }

# Calculate metrics
metrics = calculate_metrics(results['predictions'], results['ground_truth'])

print("\n" + "="*70)
print("EVALUATION METRICS")
print("="*70)
print(f"\nExact Match Accuracy:  {metrics['exact_match_accuracy']:.1%}")
print(f"Category Accuracy:     {metrics['category_accuracy']:.1%}")
print(f"\nPrecision (macro):     {metrics['precision']:.1%}")
print(f"Recall (macro):        {metrics['recall']:.1%}")
print(f"F1 Score (macro):      {metrics['f1_score']:.1%}")
print(f"\nExamples Evaluated:    {metrics['num_examples']}")
print("="*70)

### Error Analysis

Let's look at some incorrect predictions to understand model behavior:

In [None]:
# Find errors
errors = []
for i, (pred, true) in enumerate(zip(results['predictions'], results['ground_truth'])):
    if pred != true:
        errors.append({
            'index': i,
            'predicted': pred,
            'ground_truth': true,
            'conversation': test_subset[i]['messages']
        })

print(f"\nErrors: {len(errors)} / {len(results['predictions'])} ({len(errors)/len(results['predictions'])*100:.1f}%)\n")

# Show first 5 errors
print("First 5 Errors:")
print("="*70)
for i, error in enumerate(errors[:5]):
    print(f"\n[Error {i+1}]")
    print(f"  Predicted:    {error['predicted']}")
    print(f"  Ground Truth: {error['ground_truth']}")
    print(f"  Conversation snippet:")
    for msg in error['conversation'][:2]:  # First 2 messages
        print(f"    {msg['role']:8s}: {msg['content'][:60]}...")
    print("-"*70)

### Confusion Analysis

Which codes does the model confuse most often?

In [None]:
# Find most common confusions
from collections import Counter

confusions = Counter()
for pred, true in zip(results['predictions'], results['ground_truth']):
    if pred != true:
        confusions[(true, pred)] += 1

print("\nTop 10 Confusions (True ‚Üí Predicted):")
print("="*70)
for (true, pred), count in confusions.most_common(10):
    print(f"  {true:10s} ‚Üí {pred:10s}  ({count} times)")
print("="*70)

---

## 7. Visualizing Results

### Performance Summary

In [None]:
# Create bar plot of metrics
fig, ax = plt.subplots(figsize=(10, 6))

metric_names = ['Exact Match', 'Category', 'Precision', 'Recall', 'F1']
metric_values = [
    metrics['exact_match_accuracy'],
    metrics['category_accuracy'],
    metrics['precision'],
    metrics['recall'],
    metrics['f1_score']
]

bars = ax.bar(metric_names, metric_values, color=['#2ecc71', '#3498db', '#9b59b6', '#e74c3c', '#f39c12'])
ax.set_ylabel('Score')
ax.set_title(f'Zero-Shot Performance: {model_name.split("/")[1]}')
ax.set_ylim(0, 1.0)
ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='50% baseline')

# Add value labels on bars
for bar, value in zip(bars, metric_values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.02,
            f'{value:.1%}', ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

### Prediction Distribution

In [None]:
# Compare distribution of predicted vs ground truth codes
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Ground truth distribution
true_counts = Counter(results['ground_truth'])
top_true = dict(true_counts.most_common(10))
ax1.barh(list(top_true.keys()), list(top_true.values()), color='#3498db')
ax1.set_xlabel('Frequency')
ax1.set_title('Ground Truth: Top 10 ICD-10 Codes')
ax1.invert_yaxis()

# Predicted distribution  
pred_counts = Counter(results['predictions'])
top_pred = dict(pred_counts.most_common(10))
ax2.barh(list(top_pred.keys()), list(top_pred.values()), color='#e74c3c')
ax2.set_xlabel('Frequency')
ax2.set_title('Predicted: Top 10 ICD-10 Codes')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

print("\nüí° Insight: Compare the distributions - does the model over/under-predict certain codes?")

---

## 8. Model Cleanup

Free GPU memory before loading another model:

In [None]:
# Clean up model from memory
del model
del tokenizer

# Aggressive cleanup
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    torch.cuda.ipc_collect()

print("‚úÖ Model unloaded and memory freed")

if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1024**3
    print(f"   GPU Memory Allocated: {allocated:.2f} GB")

---

## 9. Key Takeaways üí°

### What We Learned

1. **Zero-Shot Evaluation**
   - Tests model's general knowledge without task-specific training
   - Establishes baseline performance
   - Fast to run (no training needed)

2. **4-bit Quantization**
   - Reduces memory by 8x (32 GB ‚Üí 4 GB)
   - Minimal accuracy loss (~1-2%)
   - Essential for running large models on consumer GPUs

3. **Inference Process**
   - Format conversation ‚Üí Tokenize ‚Üí Generate ‚Üí Decode ‚Üí Extract
   - Careful prompt engineering matters
   - Need to clean/validate outputs

4. **Evaluation Metrics**
   - Exact match: Most strict (full code must match)
   - Category match: More lenient (J06.9 vs J06.8)
   - F1/Precision/Recall: Account for class imbalance

### Expected Results

**Typical Zero-Shot Performance:**
- Exact Match: 10-30% (depends on model)
- Category Match: 30-50%
- F1 Score: 15-35%

**Why Not Higher?**
- No medical specialization
- Limited context about ICD-10 codes
- May confuse similar conditions

**This is why we finetune!** üéØ

### Common Issues

‚ùå **CUDA OOM**: Reduce batch size or use smaller model  
‚ùå **Slow inference**: Check quantization is enabled  
‚ùå **Wrong format**: Verify chat template is correct  
‚ùå **Empty predictions**: Model may not understand task - adjust prompt  

---

## 10. What's Next? üëâ

Now we have baseline LLM performance! Next steps:

1. **Train Small Models** - Finetune 3B models with LoRA
   - Can specialization beat size?
   - How much improvement from finetuning?

2. **Evaluate Finetuned Models** - Compare with baseline
   - Same metrics as zero-shot
   - Direct comparison

3. **Analyze Trade-offs** - Size vs Specialization
   - Performance
   - Speed
   - Memory

**Next Notebook:** [04_SLM_Training_LoRA.ipynb](04_SLM_Training_LoRA.ipynb)

---

## Summary

In this notebook, we:

- ‚úÖ Understood zero-shot evaluation
- ‚úÖ Learned about 4-bit quantization (NF4)
- ‚úÖ Loaded large models efficiently (8B ‚Üí 4GB)
- ‚úÖ Ran inference on medical conversations
- ‚úÖ Calculated comprehensive metrics
- ‚úÖ Analyzed errors and confusions
- ‚úÖ Visualized results

**Key Files in Project:**
- `src/models/llm_model.py` - LLM wrapper with quantization
- `src/evaluation/metrics.py` - Metric calculation functions
- `src/config/base_config.py` - Model configs and quantization settings

---

**Continue to:** [04_SLM_Training_LoRA.ipynb](04_SLM_Training_LoRA.ipynb) üöÄ