# 02 - Data Processing and Tokenization

**Previous:** [01_Data_Loading_and_Exploration.ipynb](01_Data_Loading_and_Exploration.ipynb)  
**Next:** [03_LLM_Evaluation_ZeroShot.ipynb](03_LLM_Evaluation_ZeroShot.ipynb)

---

## What This Notebook Covers

In this notebook, we'll explore one of the most critical steps in NLP: **tokenization**.

**Key Questions We'll Answer:**
1. How does text become numbers that models can understand?
2. What are tokens, and how do tokenizers work?
3. How do we format doctor-patient conversations for instruction-tuned models?
4. What are padding and truncation, and why do we need them?
5. How do we create efficient batches for training?

**Why This Matters:**
- Models don't understand text - they work with numbers
- Poor tokenization = poor model performance
- Chat templates ensure models interpret conversations correctly
- Proper batching makes training 10-100x faster

---

## Setup

In [1]:
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Critical for GPU memory management
os.environ['PYTORCH_ALLOC_CONF'] = 'expandable_segments:True'

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

print(f"‚úÖ Project Root: {project_root}")

‚úÖ Project Root: /home/bmw/src/simon/finetuning


In [2]:
# Import libraries
import torch
from transformers import AutoTokenizer
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict
from collections import Counter

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported")

‚úÖ All libraries imported


---

## 1. What is Tokenization? üî§ ‚Üí üî¢

### The Problem

Language models are neural networks that perform mathematical operations. They need **numbers**, not **text**.

**Tokenization** is the process of converting text into numerical tokens that models can process.

### The Process

```
Text Input:
"I have a sore throat."

        ‚Üì Tokenization

Tokens (pieces):
["I", " have", " a", " sore", " throat", "."]

        ‚Üì Convert to IDs

Token IDs (numbers):
[40, 423, 264, 36366, 28691, 13]

        ‚Üì Model processes

Embeddings (vectors):
[[0.23, -0.45, ...], [0.67, 0.12, ...], ...]
```

### Key Concepts

**Token**: A piece of text (can be a word, subword, or character)
- Modern tokenizers use **subword tokenization** (BPE, WordPiece, SentencePiece)
- Common words: 1 token (`"hello"` ‚Üí `["hello"]`)
- Rare words: Multiple tokens (`"pneumonoultramicroscopicsilicovolcanoconiosis"` ‚Üí 15+ tokens!)

**Vocabulary**: Set of all possible tokens the model knows
- Llama 3: ~128,000 tokens
- Qwen 2.5: ~151,000 tokens
- Trade-off: Larger vocab = more memory, smaller vocab = longer sequences

**Special Tokens**: Tokens with special meaning
- `<|begin_of_text|>` (BOS): Start of text
- `<|end_of_text|>` (EOS): End of text  
- `<|pad|>` (PAD): Padding for batching
- `<|assistant|>`, `<|user|>`: Role indicators for chat models

---

## 2. Loading a Tokenizer

Each model family has its own tokenizer trained on specific data. We must use the **exact tokenizer** that matches the model.

Let's load the tokenizer for Llama 3.2 3B (one of our SLMs):

In [3]:
# Load tokenizer for Llama 3.2 3B
model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"‚úÖ Loaded tokenizer for: {model_name}")
print(f"\nVocabulary Size: {len(tokenizer):,} tokens")
print(f"Model Max Length: {tokenizer.model_max_length:,} tokens")
print(f"\nSpecial Tokens:")
print(f"  BOS (Begin): {tokenizer.bos_token} ‚Üí ID {tokenizer.bos_token_id}")
print(f"  EOS (End):   {tokenizer.eos_token} ‚Üí ID {tokenizer.eos_token_id}")
print(f"  PAD (Pad):   {tokenizer.pad_token} ‚Üí ID {tokenizer.pad_token_id}")

‚úÖ Loaded tokenizer for: meta-llama/Llama-3.2-3B-Instruct

Vocabulary Size: 128,256 tokens
Model Max Length: 131,072 tokens

Special Tokens:
  BOS (Begin): <|begin_of_text|> ‚Üí ID 128000
  EOS (End):   <|eot_id|> ‚Üí ID 128009
  PAD (Pad):   None ‚Üí ID None


### Setting Padding Token

Some tokenizers (like Llama) don't have a default padding token. We need to set one:

In [4]:
# Set padding token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"‚úÖ Set pad_token to eos_token: {tokenizer.pad_token}")
else:
    print(f"‚úÖ Pad token already set: {tokenizer.pad_token}")

‚úÖ Set pad_token to eos_token: <|eot_id|>


---

## 3. Tokenization Examples

Let's see tokenization in action with medical examples:

In [5]:
def visualize_tokenization(text: str, tokenizer):
    """
    Shows how text gets tokenized step-by-step.
    """
    print(f"\n{'='*70}")
    print(f"Original Text:")
    print(f"  '{text}'")
    print(f"\nLength: {len(text)} characters")
    
    # Tokenize
    encoded = tokenizer(text, return_tensors="pt")
    token_ids = encoded['input_ids'][0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    
    print(f"\n{'='*70}")
    print(f"Tokens ({len(tokens)} total):")
    for i, (token, token_id) in enumerate(zip(tokens, token_ids)):
        # Clean up token display (Llama uses ƒ† for spaces)
        display_token = token.replace('ƒ†', '‚ñÅ')  # Use ‚ñÅ to show spaces
        print(f"  [{i:2d}] {display_token:20s} ‚Üí ID: {token_id:6d}")
    
    print(f"\n{'='*70}")
    print(f"Token IDs: {token_ids}")
    print(f"\nCompression Ratio: {len(text)} chars ‚Üí {len(tokens)} tokens ({len(text)/len(tokens):.1f}x)")
    print(f"{'='*70}")

### Example 1: Simple Medical Sentence

In [6]:
visualize_tokenization("I have a sore throat.", tokenizer)


Original Text:
  'I have a sore throat.'

Length: 21 characters

Tokens (7 total):
  [ 0] <|begin_of_text|>    ‚Üí ID: 128000
  [ 1] I                    ‚Üí ID:     40
  [ 2] ‚ñÅhave                ‚Üí ID:    617
  [ 3] ‚ñÅa                   ‚Üí ID:    264
  [ 4] ‚ñÅsore                ‚Üí ID:  36366
  [ 5] ‚ñÅthroat              ‚Üí ID:  28691
  [ 6] .                    ‚Üí ID:     13

Token IDs: [128000, 40, 617, 264, 36366, 28691, 13]

Compression Ratio: 21 chars ‚Üí 7 tokens (3.0x)


### Example 2: Medical Terminology

Watch how complex medical terms get split into multiple tokens:

In [7]:
visualize_tokenization("Patient presents with acute pharyngitis.", tokenizer)


Original Text:
  'Patient presents with acute pharyngitis.'

Length: 40 characters

Tokens (10 total):
  [ 0] <|begin_of_text|>    ‚Üí ID: 128000
  [ 1] Patient              ‚Üí ID:  37692
  [ 2] ‚ñÅpresents            ‚Üí ID:  18911
  [ 3] ‚ñÅwith                ‚Üí ID:    449
  [ 4] ‚ñÅacute               ‚Üí ID:  30883
  [ 5] ‚ñÅph                  ‚Üí ID:   1343
  [ 6] ary                  ‚Üí ID:    661
  [ 7] ng                   ‚Üí ID:    983
  [ 8] itis                 ‚Üí ID:  20000
  [ 9] .                    ‚Üí ID:     13

Token IDs: [128000, 37692, 18911, 449, 30883, 1343, 661, 983, 20000, 13]

Compression Ratio: 40 chars ‚Üí 10 tokens (4.0x)


### Example 3: ICD-10 Codes

ICD codes are crucial for our task - let's see how they tokenize:

In [8]:
visualize_tokenization("Diagnosis: J06.9 (Acute upper respiratory infection)", tokenizer)


Original Text:
  'Diagnosis: J06.9 (Acute upper respiratory infection)'

Length: 52 characters

Tokens (15 total):
  [ 0] <|begin_of_text|>    ‚Üí ID: 128000
  [ 1] Di                   ‚Üí ID:  22427
  [ 2] agnosis              ‚Üí ID:  50915
  [ 3] :                    ‚Üí ID:     25
  [ 4] ‚ñÅJ                   ‚Üí ID:    622
  [ 5] 06                   ‚Üí ID:   2705
  [ 6] .                    ‚Üí ID:     13
  [ 7] 9                    ‚Üí ID:     24
  [ 8] ‚ñÅ(                   ‚Üí ID:    320
  [ 9] Ac                   ‚Üí ID:  11916
  [10] ute                  ‚Üí ID:   1088
  [11] ‚ñÅupper               ‚Üí ID:   8582
  [12] ‚ñÅrespiratory         ‚Üí ID:  42631
  [13] ‚ñÅinfection           ‚Üí ID:  19405
  [14] )                    ‚Üí ID:      8

Token IDs: [128000, 22427, 50915, 25, 622, 2705, 13, 24, 320, 11916, 1088, 8582, 42631, 19405, 8]

Compression Ratio: 52 chars ‚Üí 15 tokens (3.5x)


### Observations

Notice:
1. **Common words** ("I", "have", "a") are usually single tokens
2. **Medical terms** ("pharyngitis") may split into subwords ("phar" + "yng" + "itis")
3. **ICD codes** ("J06.9") typically split into multiple tokens ("J", "06", ".", "9")
4. **Spaces** are part of tokens (shown as ‚ñÅ)

This is why specialized medical tokenizers can help - they treat medical terms as single tokens!

---

## 4. Chat Templates üí¨

### The Problem

Our data is **multi-turn conversations** between doctor and patient. How do we represent this?

```python
# Raw data structure:
{
    "messages": [
        {"role": "doctor", "content": "What brings you here?"},
        {"role": "patient", "content": "I have a fever."}
    ],
    "diagnosis": "J06.9"
}
```

We need to convert this into a format the model understands!

### Chat Templates

Modern instruction-tuned models use **chat templates** - special formatting that indicates roles.

**Llama 3 Chat Template:**
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a medical diagnosis assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Doctor: What brings you here?
Patient: I have a fever.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

J06.9<|eot_id|>
```

The template tells the model:
- Where each message starts/ends
- Who is speaking (system/user/assistant)
- Where to generate the response

### Applying Chat Templates

In [9]:
# Example conversation
conversation = [
    {
        "role": "system",
        "content": "You are a medical diagnosis assistant. Predict the ICD-10 code based on the doctor-patient conversation."
    },
    {
        "role": "user",
        "content": "Doctor: What brings you here today?\nPatient: I have a sore throat and fever for 3 days."
    },
    {
        "role": "assistant",
        "content": "J06.9"
    }
]

# Apply chat template
formatted_text = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,  # Get text first (not token IDs)
    add_generation_prompt=False
)

print("Raw Conversation:")
for msg in conversation:
    print(f"  [{msg['role']:9s}] {msg['content'][:50]}...")

print(f"\n{'='*70}")
print("Formatted with Chat Template:")
print(f"{'='*70}")
print(formatted_text)
print(f"{'='*70}")

Raw Conversation:
  [system   ] You are a medical diagnosis assistant. Predict the...
  [user     ] Doctor: What brings you here today?
Patient: I hav...
  [assistant] J06.9...

Formatted with Chat Template:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2026

You are a medical diagnosis assistant. Predict the ICD-10 code based on the doctor-patient conversation.<|eot_id|><|start_header_id|>user<|end_header_id|>

Doctor: What brings you here today?
Patient: I have a sore throat and fever for 3 days.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

J06.9<|eot_id|>


### Chat Template + Tokenization

In [10]:
# Now tokenize the formatted text
encoded = tokenizer.apply_chat_template(
    conversation,
    tokenize=True,
    return_tensors="pt",
    # Add return_dict=True to be explicit, though it's often the default
    return_dict=True 
)

# Extract the tensor from the dictionary
input_ids = encoded['input_ids']

print(f"Token IDs Shape: {input_ids.shape}")
print(f"Total Tokens: {input_ids.shape[1]}")
print(f"\nFirst 20 Token IDs: {input_ids[0][:20].tolist()}")

# Decode to verify (pass the tensor, not the dict)
decoded = tokenizer.decode(input_ids[0])
print(f"\nDecoded (verify it matches):")
print(decoded[:200] + "...")

Token IDs Shape: torch.Size([1, 84])
Total Tokens: 84

First 20 Token IDs: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1721]

Decoded (verify it matches):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2026

You are a medical diagnosis assistant. Predict the ICD-10 code based on the ...


---

## 5. Processing Our Dataset

Now let's apply this to our actual medical conversation dataset:

In [15]:
# Load dataset (same as notebook 01)
print("Loading MedSynth dataset...")
dataset = load_dataset(
    "Ahmad0067/MedSynth",
    split="train"
)

print(f"‚úÖ Loaded {len(dataset)} examples")

print(f"Available columns: {dataset.column_names}")

# Look at first example
example = dataset[0]
print(f"\nFirst Example:")
print(f"  Diagnosis: {example['diagnosis']}")
print(f"  Messages: {len(example['messages'])} turns")
print(f"\nFirst 2 messages:")
for msg in example['messages'][:2]:
    print(f"  [{msg['role']:10s}] {msg['content']}")

Loading MedSynth dataset...
‚úÖ Loaded 10240 examples
Available columns: [' Note', 'Dialogue', 'ICD10', 'ICD10_desc']

First Example:


KeyError: 'diagnosis'

### Creating the Formatting Function

This function converts our dataset format to the chat template format:

In [None]:
def format_conversation(example: Dict, tokenizer) -> Dict:
    """
    Formats a conversation example for instruction tuning.
    
    Input format:
    {
        "messages": [{"role": "doctor", "content": "..."}, ...],
        "diagnosis": "J06.9"
    }
    
    Output format:
    [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "Doctor: ...\nPatient: ..."},
        {"role": "assistant", "content": "J06.9"}
    ]
    """
    # System prompt
    system_prompt = (
        "You are a medical diagnosis assistant. "
        "Based on the doctor-patient conversation, predict the ICD-10 diagnosis code."
    )
    
    # Format conversation turns
    conversation_text = "\n".join([
        f"{msg['role'].capitalize()}: {msg['content']}"
        for msg in example['messages']
    ])
    
    # Build chat format
    chat = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": conversation_text},
        {"role": "assistant", "content": example['diagnosis']}
    ]
    
    # Apply chat template and tokenize
    formatted = tokenizer.apply_chat_template(
        chat,
        tokenize=True,
        padding="max_length",
        max_length=512,
        truncation=True,
        return_tensors="pt"
    )
    
    return {
        "input_ids": formatted[0],
        "labels": formatted[0].clone(),  # For training, labels = input_ids
        "diagnosis": example['diagnosis']
    }

# Test on first example
formatted_example = format_conversation(example, tokenizer)
print(f"‚úÖ Formatted Example:")
print(f"  Input IDs Shape: {formatted_example['input_ids'].shape}")
print(f"  Labels Shape: {formatted_example['labels'].shape}")
print(f"  Diagnosis: {formatted_example['diagnosis']}")

### Verify Formatting

Let's decode the tokens to verify the formatting is correct:

In [None]:
decoded_text = tokenizer.decode(formatted_example['input_ids'])
print("Decoded Formatted Text:")
print("="*70)
print(decoded_text)
print("="*70)

---

## 6. Padding and Truncation

### The Problem

Neural networks require **fixed-size inputs**. But our conversations have different lengths!

```
Conversation 1: "I have a cold."              ‚Üí 15 tokens
Conversation 2: "I have chest pain..." (long) ‚Üí 247 tokens
Conversation 3: "Fever."                      ‚Üí 8 tokens
```

We can't feed these directly to a model - they must all be the same length.

### Solution: Padding

**Padding** adds special `<|pad|>` tokens to make all sequences the same length:

```
Target length: 512 tokens

Conversation 1: [tokens...] + [PAD] * 497  ‚Üí 512 tokens
Conversation 2: [tokens...] + [PAD] * 265  ‚Üí 512 tokens
Conversation 3: [tokens...] + [PAD] * 504  ‚Üí 512 tokens
```

### Solution: Truncation

**Truncation** cuts off tokens that exceed the maximum length:

```
Very long conversation: 1024 tokens
Max length: 512 tokens

Truncated: [first 512 tokens]  # Last 512 tokens discarded
```

### Analyzing Sequence Lengths

In [None]:
# Process all examples to get token lengths
print("Analyzing token lengths across dataset...\n")

token_lengths = []
for example in dataset:
    formatted = format_conversation(example, tokenizer)
    # Count non-padding tokens
    non_pad_tokens = (formatted['input_ids'] != tokenizer.pad_token_id).sum().item()
    token_lengths.append(non_pad_tokens)

# Statistics
token_lengths = np.array(token_lengths)
print(f"Token Length Statistics:")
print(f"  Min:     {token_lengths.min():4d} tokens")
print(f"  Max:     {token_lengths.max():4d} tokens")
print(f"  Mean:    {token_lengths.mean():6.1f} tokens")
print(f"  Median:  {np.median(token_lengths):6.1f} tokens")
print(f"  Std Dev: {token_lengths.std():6.1f} tokens")
print(f"\nPercentiles:")
for p in [50, 75, 90, 95, 99]:
    print(f"  {p:2d}%: {np.percentile(token_lengths, p):6.1f} tokens")

### Visualizing Length Distribution

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1.hist(token_lengths, bins=30, edgecolor='black', alpha=0.7)
ax1.axvline(token_lengths.mean(), color='red', linestyle='--', label=f'Mean: {token_lengths.mean():.0f}')
ax1.axvline(np.median(token_lengths), color='green', linestyle='--', label=f'Median: {np.median(token_lengths):.0f}')
ax1.set_xlabel('Number of Tokens')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Tokenized Sequence Lengths')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot
ax2.boxplot(token_lengths, vert=True)
ax2.set_ylabel('Number of Tokens')
ax2.set_title('Token Length Box Plot')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüí° Insight: Most conversations are between {np.percentile(token_lengths, 25):.0f}-{np.percentile(token_lengths, 75):.0f} tokens")
print(f"           A max_length of 512 tokens captures {(token_lengths <= 512).sum() / len(token_lengths) * 100:.1f}% of conversations")

### Choosing Max Length

Based on the distribution, we choose `max_length=512` because:
1. Captures most conversations without truncation
2. Balances between memory efficiency and information retention
3. Standard size for many LLMs (powers of 2 are efficient)

**Trade-offs:**
- **Smaller** (256): Faster training, but truncates more data
- **Larger** (1024): Preserves more data, but 2x memory and slower

---

## 7. Batching for Efficiency

### Why Batching?

Processing one example at a time is **extremely slow**:

```
Sequential processing (batch_size=1):
  Example 1: 50ms
  Example 2: 50ms
  ...
  Total for 1000 examples: 50,000ms = 50 seconds

Batched processing (batch_size=32):
  Batch 1 (32 examples): 100ms
  Batch 2 (32 examples): 100ms
  ...
  Total for 1000 examples: ~3,200ms = 3.2 seconds
```

**15x speedup!** üöÄ

### How Batching Works

A **batch** is a group of examples processed together:

```python
# Single example
input_ids: [512]          # Shape: (sequence_length,)

# Batch of 32 examples  
input_ids: [32, 512]      # Shape: (batch_size, sequence_length)
```

GPUs are designed for **parallel processing** - they excel at matrix operations on batches!

### Creating Batches

In [None]:
# Process first 8 examples into a batch
batch_size = 8
batch_examples = dataset.select(range(batch_size))

# Format each example
formatted_batch = [format_conversation(ex, tokenizer) for ex in batch_examples]

# Stack into tensors
batch_input_ids = torch.stack([ex['input_ids'] for ex in formatted_batch])
batch_labels = torch.stack([ex['labels'] for ex in formatted_batch])
batch_diagnoses = [ex['diagnosis'] for ex in formatted_batch]

print(f"Batch Created:")
print(f"  Input IDs Shape: {batch_input_ids.shape}")
print(f"  Labels Shape:    {batch_labels.shape}")
print(f"  Batch Size:      {batch_input_ids.shape[0]}")
print(f"  Sequence Length: {batch_input_ids.shape[1]}")
print(f"\nDiagnoses in batch: {batch_diagnoses}")

### Visualizing Padding in a Batch

Let's see where the padding tokens are:

In [None]:
# Create mask: True where token is NOT padding
attention_mask = (batch_input_ids != tokenizer.pad_token_id).int()

# Count real tokens per example
real_tokens = attention_mask.sum(dim=1)

print("Tokens per example in batch:")
for i, (n_tokens, diagnosis) in enumerate(zip(real_tokens, batch_diagnoses)):
    padding = batch_input_ids.shape[1] - n_tokens.item()
    print(f"  Example {i}: {n_tokens:3d} real tokens + {padding:3d} padding = 512 total | Diagnosis: {diagnosis}")

# Visualize as heatmap
plt.figure(figsize=(12, 6))
plt.imshow(attention_mask.numpy(), aspect='auto', cmap='RdYlGn', interpolation='nearest')
plt.colorbar(label='Token Type', ticks=[0, 1])
plt.clim(0, 1)
plt.xlabel('Token Position')
plt.ylabel('Example in Batch')
plt.title('Batch Attention Mask (Green=Real Token, Red=Padding)')
plt.tight_layout()
plt.show()

print(f"\nüí° Notice: Different examples have different amounts of padding!")

---

## 8. The Complete Processing Pipeline

Let's put everything together into a reusable data processing class:

In [None]:
class MedicalDataProcessor:
    """
    Complete data processing pipeline for medical diagnosis.
    
    Handles:
    1. Loading tokenizer
    2. Formatting conversations with chat templates
    3. Tokenization
    4. Padding/truncation
    5. Batching
    """
    
    def __init__(self, model_name: str, max_length: int = 512):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.max_length = max_length
        
        print(f"‚úÖ Initialized MedicalDataProcessor")
        print(f"   Model: {model_name}")
        print(f"   Max Length: {max_length}")
        print(f"   Vocab Size: {len(self.tokenizer):,}")
    
    def format_example(self, example: Dict) -> Dict:
        """Format a single example."""
        system_prompt = (
            "You are a medical diagnosis assistant. "
            "Based on the doctor-patient conversation, predict the ICD-10 diagnosis code."
        )
        
        conversation_text = "\n".join([
            f"{msg['role'].capitalize()}: {msg['content']}"
            for msg in example['messages']
        ])
        
        chat = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": conversation_text},
            {"role": "assistant", "content": example['diagnosis']}
        ]
        
        formatted = self.tokenizer.apply_chat_template(
            chat,
            tokenize=True,
            padding="max_length",
            max_length=self.max_length,
            truncation=True,
            return_tensors="pt"
        )
        
        return {
            "input_ids": formatted[0],
            "labels": formatted[0].clone(),
            "diagnosis": example['diagnosis']
        }
    
    def process_dataset(self, dataset, batch_size: int = 32):
        """Process entire dataset into batches."""
        print(f"\nProcessing {len(dataset)} examples...")
        
        # Format all examples
        formatted = [self.format_example(ex) for ex in dataset]
        
        # Create batches
        batches = []
        for i in range(0, len(formatted), batch_size):
            batch_data = formatted[i:i+batch_size]
            batch = {
                'input_ids': torch.stack([ex['input_ids'] for ex in batch_data]),
                'labels': torch.stack([ex['labels'] for ex in batch_data]),
                'diagnoses': [ex['diagnosis'] for ex in batch_data]
            }
            batches.append(batch)
        
        print(f"‚úÖ Created {len(batches)} batches of size {batch_size}")
        return batches

# Test the processor
processor = MedicalDataProcessor(model_name="meta-llama/Llama-3.2-3B-Instruct")
batches = processor.process_dataset(dataset, batch_size=16)

print(f"\nFirst Batch:")
print(f"  Input IDs: {batches[0]['input_ids'].shape}")
print(f"  Labels:    {batches[0]['labels'].shape}")
print(f"  Diagnoses: {batches[0]['diagnoses'][:3]}...")

---

## 9. Key Takeaways üí°

### What We Learned

1. **Tokenization Basics**
   - Text ‚Üí Tokens ‚Üí Token IDs
   - Subword tokenization handles rare words
   - Each model has its own vocabulary

2. **Chat Templates**
   - Format multi-turn conversations for instruction models
   - Include special tokens for roles (system/user/assistant)
   - Critical for model to understand conversation structure

3. **Padding & Truncation**
   - Neural networks need fixed-size inputs
   - Padding fills short sequences
   - Truncation cuts long sequences
   - Choose `max_length` based on data distribution

4. **Batching**
   - Process multiple examples in parallel
   - 10-100x speedup on GPUs
   - Larger batches = faster, but more memory

### Why This Matters for Our Project

**For Training:**
- Proper tokenization ensures model learns meaningful patterns
- Chat templates teach model to generate diagnoses after conversations
- Batching makes training feasible (hours instead of days)

**For Evaluation:**
- Consistent formatting between training and inference
- Same tokenizer must be used for same model
- Efficient batching speeds up evaluation

### Common Issues

‚ùå **Wrong tokenizer**: Model won't understand inputs  
‚ùå **No chat template**: Model confused about conversation structure  
‚ùå **Wrong max_length**: Either truncate too much or waste memory  
‚ùå **No padding token**: Batching fails  
‚ùå **Batch size too large**: CUDA out of memory  

---

## 10. What's Next? üëâ

Now that we understand how to process our data, we're ready to:

1. **Evaluate Large Language Models (LLMs)** - Test 7-8B models zero-shot
   - Load models with quantization
   - Run inference on test set
   - Measure baseline performance

2. **Train Small Language Models (SLMs)** - Finetune 3B models with LoRA
   - Set up LoRA adapters
   - Run training loop
   - Monitor progress

3. **Compare Results** - Does specialization beat size?
   - Accuracy metrics
   - Speed comparison
   - Memory usage

**Next Notebook:** [03_LLM_Evaluation_ZeroShot.ipynb](03_LLM_Evaluation_ZeroShot.ipynb)

---

## Summary

In this notebook, we covered:

- ‚úÖ What tokenization is and why it's needed
- ‚úÖ How tokenizers work (vocabulary, subwords, special tokens)
- ‚úÖ Chat templates for multi-turn conversations
- ‚úÖ Padding and truncation strategies
- ‚úÖ Batching for GPU efficiency
- ‚úÖ Complete data processing pipeline

**Key Files in Project:**
- `src/data_processing/processor.py` - Similar to our `MedicalDataProcessor`
- `src/config/base_config.py` - Contains `max_length`, `batch_size` settings

---

**Continue to:** [03_LLM_Evaluation_ZeroShot.ipynb](03_LLM_Evaluation_ZeroShot.ipynb) üöÄ