# Advanced PaliGemma Fine-tuning: Custom Metrics & Systematic Hyperparameter Optimization

*Going beyond "loss goes down = model works better"*

Most fine-tuning tutorials stop at the basics: load model, train, watch loss decrease, call it done. But here's what I've learned from taking models from prototype to production—the real work starts where most tutorials end.

This notebook explores the evaluation and experimentation approaches that bridge the gap between research demos and systems you'd actually deploy. We'll dive deeper into the systematic analysis that helps you understand not just *if* your model is improving, but *how* and *why*.

**The Challenge**: Standard training metrics tell you surprisingly little about real-world performance. I've seen models with beautiful loss curves fail catastrophically on edge cases, generate inconsistent outputs, or perform poorly on the specific criteria that actually matter for the use case.

**Our Approach**: Instead of relying on default metrics and hoping for the best, we'll implement custom evaluation frameworks and run ablation study with experiments to better understand how different metrics correlate with downstream performance

## What We'll Explore

**📊 Custom Metrics for Real-World Performance**
- Build task-specific evaluation metrics that actually correlate with production performance
- Implement memory-efficient computation strategies to track complex metrics during training
- Learn to identify quality issues early rather than after expensive training runs

**🔬 Systematic Hyperparameter Analysis**
- Design principled ablation studies that isolate the impact of individual configuration choices
- Quantify the trade-offs between model size, capacity, training efficiency, and final performance
- Understand when to adjust LoRA rank vs. alpha scaling vs. training duration for optimal results


By the end, you'll have a comprehensive framework for evaluating multimodal models during training and the experimental methodology to make data-driven decisions about hyperparameter optimization—moving beyond trial-and-error to systematic improvement.

> **📚 Prerequisites**: This is an advanced tutorial that builds on fundamental concepts. If you're new to PaliGemma fine-tuning or PEFT methods, I recommend starting with the simpler tutorial first: [**PaliGemma Fine-tuning with QLORA and PEFT**](peft_paligemma_im2json_qlora_SFT.ipynb)
>
> The basic tutorial covers essential concepts like model loading, data preparation, and standard training workflows that we'll extend here with custom metrics and systematic experimentation.

## Image-to-JSON Paligemma Fine-tuning with QLoRA using Advanced Metrics Tracking

Fine-tuning PaliGemma-3B on receipt extraction with QLoRA, enhanced by custom evaluation metrics that go beyond standard loss functions.

Setup and Dependencies


In [None]:
import json
import os
import re

import matplotlib.pyplot as plt
import numpy as np
import torch
from datasets import load_dataset
from dotenv import load_dotenv
from huggingface_hub import login as hf_login
from peft import get_peft_model, LoraConfig
from PIL import Image
from transformers import (
    BitsAndBytesConfig,
    PaliGemmaForConditionalGeneration,
    PaliGemmaProcessor,
)
from transformers.trainer_callback import EarlyStoppingCallback
from trl import SFTConfig, SFTTrainer
import wandb

from metrics import JSONMetrics
from helper import extract_json_from_llm_output


# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()

### Authentication Setup.

Since, we will be running the code on a remote instance, we want to minimize interaction with console and login automatically.

In [None]:
def setup_authentication(env_path=None):
    """
    Set up authentication for Hugging Face Hub and Weights & Biases.
    
    Args:
        env_path (str, optional): Path to .env file. If None, uses the same directory as the script.
    
    Returns:
        dict: Status of authentication attempts for each service
    """
    # Default to .env file in the same directory as the script
    if env_path is None:
        env_path = os.path.join(os.path.dirname(__file__), '.env')
    
    # Load environment variables
    load_dotenv(env_path)
    
    auth_status = {}
    
    # Authenticate with Hugging Face
    hf_token = os.getenv('HF_TOKEN')
    if hf_token:
        try:
            hf_login(token=hf_token, add_to_git_credential=True)
            auth_status['huggingface'] = 'success'
            print("Successfully logged in to Hugging Face Hub")
        except Exception as e:
            auth_status['huggingface'] = f'error: {str(e)}'
            print(f"Error logging in to Hugging Face Hub: {e}")
    else:
        auth_status['huggingface'] = 'no_token'
        print("HF_TOKEN not found in environment variables. You may need to log in manually.")
    
    # Authenticate with Weights & Biases
    wandb_api_key = os.getenv('WANDB_API_KEY')
    if wandb_api_key:
        try:
            wandb.login(key=wandb_api_key, relogin=True)
            auth_status['wandb'] = 'success'
            print("Successfully logged in to Weights & Biases")
        except Exception as e:
            auth_status['wandb'] = f'error: {str(e)}'
            print(f"Error logging in to Weights & Biases: {e}")
    else:
        auth_status['wandb'] = 'no_token'
        print("WANDB_API_KEY not found in environment variables. You'll be prompted to log in if needed.")
    
    return auth_status

In [None]:
# Set up authentication for services
auth_status = setup_authentication()

### Setting some variables

In [None]:
MAX_LENGTH = 512
device = "cuda"
model_id = "google/paligemma-3b-pt-224"

### Loading CORD dataset

In [None]:
# Load dataset with appropriate size based on environment
cord_ds = load_dataset("naver-clova-ix/cord-v2")
cord_trains_ds = cord_ds["train"]
cord_validation_ds = cord_ds["validation"]

### Custom Collation Function

In [None]:

processor = PaliGemmaProcessor.from_pretrained(model_id)
def collate_fn(examples):
        texts = [f"<image> <bos> Extract JSON " for example in examples]
        images = [example["image"].convert("RGB") for example in examples]
        labels = [str(json.loads(example['ground_truth'])['gt_parse']) for example in examples]

        tokens = processor(
            text=texts,
            images=images,
            suffix=labels,
            return_tensors="pt",
            padding="max_length",  # Pad all sequences to max_length
            truncation=True,
            max_length=MAX_LENGTH
        )
        tokens = tokens.to(torch.bfloat16).to(device)
        return tokens

## Memory-Efficient Training Strategies
### Quantization and LoRA Configuration

In [None]:
# Configure BitsAndBytes for 4-bit quantization (CUDA only)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Enable double quantization
)
    
# LoRA configuration
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
    lora_alpha=32,  # Alpha parameter for LoRA scaling
    lora_dropout=0.05  # Add dropout for better regularization
)

## Loading the Model with Quantization and LoRA
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # trainable params: 11,298,816 || all params: 2,934,765,296 || trainable%: 0.3850


## Configuring Supervised Fine-Tuning

In [None]:
args = SFTConfig(
    # Model output configuration
    output_dir="paligemma-imgtojson-0002", # Directory to save checkpoints and repository ID for HuggingFace Hub
    
    # Training duration parameters
    num_train_epochs=3,                  # 3 epochs balances training time and performance for this task
                                         # Alternative: Fewer epochs (1-2) might underfit; more epochs (4+) risk overfitting
                                         # especially with small datasets or when using powerful models like PaliGamma
    
    # Batch size configuration
    per_device_train_batch_size=4,       # Optimal for specific GPU used during training
    per_device_eval_batch_size=4,        # Matching train batch size ensures consistent evaluation
    
    # Gradient optimization
    gradient_accumulation_steps=8,       # Simulates larger batch size (effective batch = 4*8 = 32) without extra memory
    gradient_checkpointing=True,         # Trades computation for memory by not storing all activations
    
    # Optimizer settings
    optim="adamw_torch_fused",           # Fused implementation is faster on CUDA devices
    learning_rate=2e-4,                  # 2e-4 is optimal based on QLoRA paper for fine-tuning
                                         # Alternative: Lower (5e-5) for more stability but slower convergence
    
    # Precision settings
    bf16=True,                           # BFloat16 offers better numerical stability than FP16 while saving memory
    tf32=False,                          # TF32 precision disabled as it's not supported on P100
    
    # Learning rate schedule
    lr_scheduler_type="cosine",          # Cosine decay provides smooth learning rate reduction
    max_grad_norm=0.3,                   # 0.3 prevents gradient explosion based on QLoRA paper
    warmup_ratio=0.03,                   # 3% warmup helps stabilize early training
    
    # Advanced evaluation settings
    evaluation_strategy="steps",         # Evaluate at regular step intervals rather than epochs
    eval_steps=5,                        # Evaluate every 5 steps for quick feedback on model performance
    save_strategy="steps",               # Save at regular step intervals
    save_steps=10,                       # Save every 10 steps to balance checkpoint frequency and storage
    logging_steps=5,                     # Log metrics every 5 steps for detailed training progress

    # Early stopping (currently disabled)
    # load_best_model_at_end=True,       # When enabled, loads the best model according to metric_for_best_model
    metric_for_best_model="eval_loss",   # Evaluation loss is a good general metric for model quality
    greater_is_better=False,             # Loss function is decreasing, lower values are better

    # Memory optimization
    dataloader_pin_memory=False,         # Disabled to reduce memory pressure
                                         # Alternative: True could speed up data transfer to GPU but uses more memory
    dataloader_num_workers=0,            # Single-process data loading to avoid memory duplication
    remove_unused_columns=True,          # Removes unused colums

    # Experiment tracking
    push_to_hub=True,                    # Automatically push model to Hugging Face Hub for sharing/deployment
    report_to="wandb" if os.getenv('WANDB_API_KEY') else "none", # Use W&B for experiment tracking if API key exists
    
    # SFT specific parameters
    dataset_text_field="", # need a dummy field for collator
    dataset_kwargs={"skip_prepare_dataset": True}, # important for collator
    gradient_checkpointing_kwargs={"use_reentrant": False}, # use reentrant checkpointing
)


## Introducing Custom Merics

### Custom Metrics: Beyond Loss Functions

Standard loss functions optimize for token-level accuracy, but for structured outputs like JSON, we need metrics that capture what actually matters in production:

**The Production Reality Check**:
- **JSON Validity**: A model with great loss might generate unparseable JSON 
- **Structural Consistency**: Fields might be missing, nested incorrectly, or have wrong data types
- **Value Accuracy**: Even valid JSON can have incorrect extracted values
- **Field Coverage**: Critical information might be systematically missed

#### Our Multi-Dimensional Evaluation Framework

We implement **seven complementary metrics** through a custom `JSONMetrics` class (defined in `metrics.py`):

**📊 Core Quality Metrics**:
- **Structure Similarity**: Jaccard similarity of JSON keys—measures if the model captures the expected schema
- **Value Accuracy**: Exact match percentage for shared keys—the gold standard for extraction tasks
- **Field F1 Score**: Balances precision/recall of field detection—critical for incomplete extractions

**🔍 Robustness Metrics**:
- **Value Similarity**: Fuzzy matching using BLEU scores and string similarity—handles minor OCR variations and semantic differences
- **Edit Distance**: Normalized Levenshtein distance between JSON strings—quantifies how many edits are needed (lower is better)

**📈 Coverage Metrics**:
- **Field Precision**: Measures how many predicted fields are actually correct
- **Field Recall**: Measures how many ground truth fields were successfully captured

**🎯 Composite Score**: Weighted combination of all metrics provides a single quality indicator.

By tracking these metrics during training, we can understand not just *if* our model is improving, but *how* different aspects of extraction quality evolve—enabling targeted hyperparameter adjustments.

## Implementing Custom Metrics in SFTTrainer

The key to production-ready fine-tuning is tracking metrics that actually matter for your use case. Here's how we integrate our custom JSON evaluation metrics directly into the training loop.

### Step 1: Understanding the Trainer Integration

The `SFTTrainer` accepts two crucial arguments for custom evaluation:
- **`compute_metrics`**: Function called during evaluation to calculate custom metrics
- **`preprocess_logits_for_metrics`**: Memory optimization function to prevent OOM errors

Let's implement both step by step.

In [None]:
def extract_json_from_string(text: str):
    """
    Helper function to extract JSON from a string.
    This matches the pattern used in extract_json_from_llm_output.
    """
    try:
        # Try to parse the entire string as JSON first
        return json.loads(text)
    except json.JSONDecodeError:
        # If that fails, try to find JSON within the string
        json_pattern = r'\{.*\}'
        match = re.search(json_pattern, text, re.DOTALL)
        if match:
            try:
                return json.loads(match.group())
            except json.JSONDecodeError:
                pass
        return None

def compute_metrics(eval_preds) -> dict:
    """
    Calculate custom JSON metrics during evaluation.
    
    This function is called by SFTTrainer during evaluation steps to compute
    task-specific metrics beyond standard loss functions.
    
    Args:
        eval_preds: Object containing predictions and labels
            - predictions[0]: Predicted token indices (after preprocess_logits_for_metrics)
            - label_ids: Ground truth token indices
            
    Returns:
        dict: Aggregated metrics for logging to experiment tracker
    """
    labels = eval_preds.label_ids
    predictions = eval_preds.predictions[0]

    # Initialize metrics tracking
    json_metrics = JSONMetrics()
    tokenizer = processor.tokenizer
    
    # Sample subset to prevent evaluation slowdown
    batch_size = len(predictions)
    sample_size = min(30, batch_size)  # Process max 30 examples per evaluation
    sample_indices = np.random.choice(batch_size, sample_size, replace=False)

    # Collect metrics for each example
    metrics_lists = {
        'structure_similarities': [],
        'value_accuracies': [],
        'field_f1_scores': [],
        'field_recalls': [],
        'field_precisions': [],    
        'value_similarities': [],
        'edit_distances': [],
        'overall_scores': []
    }
    
    valid_json_count = 0
    
    # Process each sampled example
    for idx in sample_indices:
        try:
            # Handle potential shape mismatches
            if idx >= len(labels):
                continue
                
            # Get tokens and remove padding
            pred_tokens = predictions[idx][predictions[idx] != -100]
            label_tokens = labels[idx][labels[idx] != -100]
            
            # Decode to text
            pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
            label_text = tokenizer.decode(label_tokens, skip_special_tokens=True)
            
            # Extract JSON from both prediction and ground truth
            pred_json = extract_json_from_llm_output(pred_text)
            label_json = extract_json_from_string(label_text)
            
            # Only calculate metrics if both are valid JSON
            if pred_json is not None and label_json is not None:
                valid_json_count += 1
                
                # Calculate all metrics using our custom JSONMetrics class
                result = json_metrics.calculate_overall_score(pred_json, label_json)
                
                # Collect each metric (using correct field names from metrics.py)
                metrics_lists['structure_similarities'].append(result['structure_similarity'])
                metrics_lists['value_accuracies'].append(result['value_accuracy'])
                metrics_lists['field_f1_scores'].append(result['field_f1'])
                metrics_lists['field_recalls'].append(result['field_recall'])
                metrics_lists['field_precisions'].append(result['field_precision'])
                metrics_lists['value_similarities'].append(result['value_similarity'])
                metrics_lists['edit_distances'].append(result['edit_distance'])
                metrics_lists['overall_scores'].append(result['overall_score'])
                
        except Exception as e:
            print(f"Error processing example {idx}: {e}")
            continue
    
    # Calculate final aggregated metrics
    final_metrics = {'json_validity': valid_json_count / sample_size if sample_size > 0 else 0.0}
    
    # Average all collected metrics
    for metric_name, values in metrics_lists.items():
        if values:  # Only if we have valid values
            final_metrics[metric_name.rstrip('s')] = np.mean(values)  # Remove plural 's'
        else:
            final_metrics[metric_name.rstrip('s')] = 0.0
    
    return final_metrics

### Step 2: Memory Optimization for Large Models

When working with large models, storing full logits in memory can cause OOM errors. The `preprocess_logits_for_metrics` function solves this by converting logits to predictions immediately.

In [None]:
def preprocess_logits_for_metrics(logits, labels):
    """
    🚨 CRITICAL MEMORY OPTIMIZATION 🚨
    
    Convert raw logits to predicted token indices before storing in memory.
    This prevents OOM errors when evaluating large models.
    
    Without this function, the trainer would store the full logits tensor
    (shape: [batch_size, sequence_length, vocab_size]) which can be massive.
    
    Args:
        logits (tuple): Model output logits - typically (logits_tensor,)
        labels (torch.Tensor): Ground truth labels
        
    Returns:
        tuple: (predicted_token_indices, labels)
    """
    # Convert logits to predicted token indices (much smaller memory footprint)
    predictions = torch.argmax(logits[0], dim=-1)
    return predictions, labels

### Step 3: Integrating Everything into SFTTrainer

Now we combine everything by passing our custom functions to the trainer. This enables automatic metric calculation and logging during training.

In [None]:
# Create the SFTTrainer with custom metrics integration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=cord_trains_ds,
    eval_dataset=cord_validation_ds,
    data_collator=collate_fn,
    dataset_text_field="",
    peft_config=lora_config,
    tokenizer=processor.tokenizer,
    
    # 🎯 KEY ADDITIONS: Custom metrics integration
    compute_metrics=compute_metrics,                              # Our custom evaluation function
    preprocess_logits_for_metrics=preprocess_logits_for_metrics, # Memory optimization
)

print("✅ SFTTrainer configured with custom JSON metrics!")
print("📊 Metrics tracked: JSON validity, structure similarity, value accuracy, field F1, value similarity, edit distance")
print("🚀 Ready for training with production-grade evaluation!")

## Training the Model

In [None]:
trainer.train()

## Training Results: Custom Metrics in Action

Here are the results from fine-tuning PaliGemma-3B on the image-to-JSON task using QLoRA, tracked through our custom evaluation framework in Weights & Biases.

### What to Look For in These Charts

**Training Progress Indicators**:
- **Loss Convergence**: Standard training loss shows optimization progress
- **Custom Metrics Trends**: Our JSON-specific metrics reveal quality improvements that loss alone can't capture
- **Stability Patterns**: Look for smooth improvements vs. erratic fluctuations

**Key Insights from Custom Metrics**:
- **JSON Validity**: Track how often the model generates parseable JSON
- **Structure Similarity**: Monitor if the model learns the expected schema structure
- **Value Accuracy**: Watch exact match rates improve as training progresses
- **Edit Distance**: Lower values indicate predictions closer to ground truth

These metrics provide early signals about model quality that you won't get from loss curves alone—exactly what you need for production deployment decisions.

### Training Metrics Evolution

![Train Charts from Weights & Biases](./wandb_charts/custom_metric_train.png)

### Evaluation Metrics Deep Dive

![Eval Charts from Weights & Biases](./wandb_charts/custom_metric_eval_part1.png)
![Eval Charts from Weights & Biases](./wandb_charts/custom_metric_eval_part2.png)

**Analysis Highlights**:
- **Multi-dimensional tracking** reveals different aspects of model improvement
- **Real-time feedback** enables early detection of training issues
- **Production-relevant metrics** correlate with actual deployment performance

This comprehensive view demonstrates why custom metrics are essential for structured output tasks—they provide actionable insights that guide both training decisions and hyperparameter optimization in our upcoming ablation studies.

## Ablation Study Review


### Theoretical Foundations of QLoRA: Mathematical Formulation

#### 1. **Low-Rank Decomposition Mathematics**

The core insight of LoRA is that the weight updates during fine-tuning have a low "intrinsic rank". For a pre-trained weight matrix W₀ ∈ ℝ^(d×k), the adapted weight becomes:

```
W = W₀ + ΔW = W₀ + BA
```

Where:
- **B ∈ ℝ^(d×r)** and **A ∈ ℝ^(r×k)** are low-rank matrices
- **r << min(d,k)** is the rank bottleneck
- **ΔW = BA** represents the adaptation with only **r(d+k)** parameters instead of **dk**

#### 2. **Quantization Impact Analysis**

QLoRA introduces 4-bit quantization with the NF4 (NormalFloat4) data type:

**Memory Reduction**: 
- Base model: ~2.9B × 16 bits = ~5.8GB
- Quantized: ~2.9B × 4 bits = ~1.45GB  
- **Reduction**: 75% memory savings

#### 3. **Scaling Law α/r Relationship**

The scaling factor α controls adaptation strength:

```
ΔW = (α/r) × BA
```

**Key Insight**: As r increases, α should scale proportionally to maintain consistent adaptation strength. Our experiments will validate different α/r ratios to find optimal values for vision-language tasks.

#### 4. **Multimodal Attention Mechanism**

PaliGemma processes vision and language modalities through cross-attention.
LoRA adapts the cross-modal attention weights while preserving the pre-trained vision-language alignment.

## Systematic Ablation Studies: Data-Driven Hyperparameter Optimization

An ablation study systematically varies one hyperparameter at a time while keeping others constant—this isolates the true impact of each configuration choice rather than relying on intuition or random search.

### Experimental Design & Metrics Selection

**Tracking Strategy**: While our custom framework tracks seven comprehensive metrics, I'll focus on three key indicators that capture different dimensions of model quality:

- **Edit Distance** (lower is better): Measures how many character-level edits are needed to transform predictions into ground truth—captures overall output quality
- **Field F1 Score**: Balances precision and recall of field detection—critical for incomplete extractions in production
- **Value Accuracy**: Percentage of exact matches for shared fields—the gold standard for extraction tasks

**Why These Three**: Together, they provide a holistic view covering structural accuracy (F1), content precision (Value Accuracy), and overall similarity (Edit Distance). This combination reveals trade-offs that single metrics might miss.

### Experimental Results

Each table row represents a controlled experiment with Weights & Biases tracking for complete reproducibility.

### Experiment 1: LoRA Rank (r) Analysis
The LoRA rank `r` controls the number of trainable parameters in the adaptation matrices. 

Impact of changing R:
- Lower R (e.g. 8–32): fewer trainable parameters, lower memory usage → faster training, better for small tasks.
- Higher R (≥64–256): more capacity, capturing complex patterns, but slower and risk of overfitting.

Effect of each rank on trainable parameters.

| Rand (r) | All parameters | Trainable parameters   | trainable%   |
| ---- | ------------- | --- | --- |
| 4  | 2,929,115,888 | 5,649,408  | 0.1929  |
| 8  | 2,929,115,888 | 11,298,816 | 0.3850   |
| 16 | 2,929,115,888 | 22,597,632 | 0.7670  | 
| 32 | 2,929,115,888 | 45,195,264 | 1.5224  |
| 64 | 2,929,115,888 | 90,390,528 | 2.9992  |
| 128 | 2,929,115,888 | 180,781,056 | 5.8237  |

As the table illustrates, the number of trainable parameters scales linearly with the rank `r`. This has direct implications for the trade-off between model capacity and computational cost:
- **Low Ranks (4-16)**: Keep the model lightweight and fast to train, making them ideal for initial experiments or tasks where minimal adaptation is needed. However, they might lack the capacity to capture complex details.
- **Medium Ranks (32-64)**: Offer a balance between performance and computational cost. Our experiments show `r=32` is a sweet spot, providing significant performance gains without excessive overhead.
- **High Ranks (128+)**: Provide maximum capacity but come with a higher risk of overfitting, longer training times, and diminishing returns on performance for many tasks.

For the first experiment, we will run a set of runs with different Rank (r) value, fixing lora_alpha = 2*r, this is to have a consistant weigths adaptation strenght. 
The goal is to find the lowest rank that achieves satisfactory performance.

Here is a table with the results:
| Config     |Experiment name| R   | α   | Dropout | LR   | Edit Distance | Field F1 Score | Value Accuracy | Value Sim
| ---------- | ------------- | --- | --- | ------- | ----  | ----- | ----- | ---------- | ------- |
| R ablation   | paligemma-img2json-0000 | 8 | 16  | 0.05    | 2e‑4  | 0.1407  | 0.8244 | 75.82 | 88.39
| R ablation | paligemma-img2json-0013 | 16 | 32  | 0.05    | 2e‑4  | 0.1116 | 0.8646 | 80.33 | 90.52
| Baseline  | paligemma-img2json-0009 | 32 | 64  | 0.05    | 2e‑4  | 0.085  | 0.883 | 81.9 | 93.65


The charts below visualize the performance of each run during training and evaluation.

#### Training Performance Comparison

![Rank Ablation - Training Metrics](./wandb_charts/r_ablation_train.png)

#### Evaluation Performance Comparison
_(For brevity, some metrics are omitted from this report)_
![Rank Ablation - Evaluation Metrics](./wandb_charts/r_ablation_eval.png)

### Analysis & Key Findings

- **Clear Correlation Between Rank and Performance**: The evaluation charts demonstrate a strong, positive correlation between LoRA rank and model performance. The `r=32` run (grey line) consistently achieves the best scores across all key metrics, including the lowest (best) `eval/edit_distance` and the highest `eval/value_accuracy` and `eval/field_f1_score`.

- **Loss Curves Corroborate Metric Trends**: This observation is supported by the loss curves. Both `train/loss` and `eval/loss` are lowest for the `r=32` run, indicating that the model with greater capacity learned the task more effectively and generalized better to the validation set.

- **Performance vs. Cost Trade-off**: While `r=32` has ~4x the trainable parameters of `r=8`, the performance gains are substantial. The improvement in **Value Accuracy** from 75.8% to 81.9% and the nearly 40% reduction in **Edit Distance** (from 0.1407 to 0.085) justify the increased computational cost for a production-oriented task where accuracy is critical.

- **Conclusion**: For this image-to-JSON task, a rank of at least 32 is beneficial. The conventional wisdom that performance is "largely independent of R" may not hold for complex, structured data extraction tasks, or perhaps applies at ranks beyond 32 where we might see diminishing returns. This experiment underscores the necessity of empirical validation over relying on general heuristics.

### Experiment 2: LoRA Alpha (α) Analysis
The LoRA alpha (`α`) parameter acts as a scaling factor for the learned weight updates. The final update `ΔW` is scaled by `α/r`. This means `α` controls the magnitude of the adaptation.

**Impact of changing α**:
- A higher `α` gives more weight to the LoRA adaptation, allowing for more significant changes to the base model's behavior.
- A lower `α` results in a more subtle adaptation.
- A common heuristic is to set `α` to be twice the rank (`α = 2r`) or equal the rank (`α = r`), but the optimal ratio depends on the task and how much the base model needs to be adapted.

For this experiment, we fix the rank `r=32` and compare `α=32` (i.e., `α=r`) against `α=64` (i.e., `α=2r`) to test this heuristic.

Here are the results comparing an alpha value equal to the rank versus double the rank:

| Config     |Experiment name| R   | α   | Dropout | LR   | Edit Distance | Field F1 Score | Value Accuracy |
| ---------- | ------------- | --- | --- | ------- | ----  | ----- | ----- | ---------- |
| α = r   | paligemma-img2json-0008 | 32  | 32  | 0.05    | 2e‑4  | 0.1027 | 0.8672 | 77.49 |
| α = 2r | paligemma-img2json-0009 | 32 | 64  | 0.05    | 2e‑4  | 0.085  | 0.883 | 81.9       |

#### Training Performance Comparison

![Alpha Ablation - Training Metrics](./wandb_charts/alpha_ablation2_train.png)

#### Evaluation Performance Comparison

![Alpha Ablation - Evaluation Metrics](./wandb_charts/alpha_ablation2_eval.png)

### Analysis & Key Findings

- **Higher Alpha Drives Performance**: The evaluation charts clearly show that `α=64` (grey line, `paligemma-img2json-0009`) significantly outperforms `α=32` (maroon line, `paligemma-img2json-0008`) across all key metrics. The model with stronger adaptation achieves lower (better) `eval/edit_distance` and higher scores for `eval/value_accuracy`, `eval/field_f1_score`, and `eval/structure_similarity`.

- **Loss Curves Confirm Better Learning**: This trend is mirrored in the loss curves. The `α=64` run achieves a lower loss on both the training and evaluation sets, indicating that it learned more effectively and generalized better.

- **Validating the `α = 2r` Heuristic**: The results strongly support the `α = 2r` heuristic for this task. Doubling alpha from 32 to 64 led to a substantial performance boost: **Value Accuracy** increased from 77.5% to 81.9%, and **Edit Distance** dropped by over 17% (from 0.1027 to 0.085). This indicates the base model required a more significant adaptation than an `α=r` setting could provide.

- **Conclusion**: For this image-to-JSON task, a larger alpha relative to the rank is crucial. The base model needs considerable fine-tuning to handle structured data extraction, and a higher alpha provides the necessary scaling for the LoRA updates to be effective. The `α = 2r` rule of thumb proves to be a very effective starting point.

### Experiment 3: Dropout Analysis

Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of neuron activations to zero during training. This forces the model to learn more robust features that are not dependent on any single neuron.

**Impact of changing Dropout**:
- **Higher Dropout**: Increases regularization, which can help prevent overfitting on larger models or with longer training, but may lead to underfitting if set too high.
- **Lower/Zero Dropout**: Reduces or removes regularization. The original QLoRA paper noted that dropout was often unnecessary for very large models but could be beneficial for smaller ones (e.g., 7B parameter models).

**Experimental Setup & Findings**:
To assess its impact, I conducted experiments with `lora_dropout` set to `0.05` and `0.0`, keeping the previously determined optimal `r=32` and `α=64`. The results showed no significant difference in performance across our key metrics between the two configurations.

**Conclusion**:
Given that there were no clear signs of overfitting with a dropout of `0.05`, and no performance degradation compared to `0.0`, I chose to retain `lora_dropout=0.05` for all other experiments. This serves as a minor, low-cost safeguard against potential overfitting without negatively impacting the model's learning capacity on this dataset.

### Experiment 4: Learning Rate Analysis

The learning rate is controlling the step size the optimizer takes during weight updates. A learning rate scheduler, like the cosine scheduler used here (see visualized below), dynamically adjusts this rate during training to balance fast initial progress with stable convergence later on.

**Impact of changing Learning Rate**:
- **Higher Learning Rate**: Allows for faster training and can help escape local minima, but risks overshooting the optimal weights and becoming unstable.
- **Lower Learning Rate**: Leads to more stable, predictable convergence but can be extremely slow and has a higher risk of getting stuck in suboptimal local minima.

**Experimental Setup & Findings**:
This experiment compares the recommended QLoRA learning rate of `2e-4` against a much smaller rate of `2e-5`. The goal is to demonstrate the dramatic impact of learning rate on training speed and final performance.

#### Training Performance Comparison
The `train/learning_rate` chart clearly shows the two different schedules. The `train/loss` chart shows the higher learning rate leads to much faster convergence.

![Train Charts from Weights & Biases](./wandb_charts/learning_rate_train.png)

#### Evaluation Performance Comparison
The evaluation metrics confirm the training trends, with the higher learning rate achieving significantly better results.

![Validation Charts from Weights & Biases](./wandb_charts/learning_rate_eval.png)

### Analysis & Key Findings

- **Learning Rate Dominates Performance**: The difference is stark. The higher learning rate (`2e-4`, purple line) vastly outperforms the lower rate across every single metric. Its `eval/loss` plummets, while the `eval/loss` for the lower rate barely decreases.

- **Validation of QLoRA's Recommendation**: This experiment strongly validates the `2e-4` learning rate recommended in the QLoRA paper as an effective starting point. While further tuning might yield marginal gains, a significantly lower rate is clearly detrimental for this task.

## Conclusions

This notebook demonstrated a comprehensive, production-oriented approach to fine-tuning a multimodal model for a structured data extraction task. By moving beyond simple loss monitoring and implementing a framework of custom, task-relevant metrics, we gained deep insights into model behavior and were able to make data-driven decisions.

**Key Achievements**:
1.  **Custom Metrics Framework**: We successfully integrated seven distinct metrics to evaluate JSON quality, structure, and value accuracy directly within the training loop. This provided a multi-dimensional view of performance that loss alone could not capture.
2.  **Systematic Hyperparameter Optimization**: Through a series of controlled ablation studies, we systematically isolated the impact of key QLoRA hyperparameters. This data-driven process allowed us to move beyond heuristics and identify an optimal configuration for our specific task.

**Optimal Configuration Found**:
Our experiments converged on the following configuration as the most effective for this image-to-JSON task:
- **LoRA Rank (`r`)**: 32
- **LoRA Alpha (`α`)**: 64 (following the `α = 2r` ratio)
- **Learning Rate**: `2e-4`
- **Dropout**: `0.05`

**Why This Matters**:
The methodology presented here—combining custom evaluation with systematic experimentation—provides a robust and reproducible blueprint for fine-tuning models for specialized, real-world applications. It bridges the gap between academic research and production deployment by establishing a clear, evidence-based path to achieving optimal performance, ensuring that the final model is not just trained, but truly effective at its intended task.

The next logical step is to take our optimized model and deploy it to a production-ready environment. Stay tuned for a future guide on serving this model efficiently at scale!