# LoRA Fine-Tuning EmbeddingGemma on PDF QA Pairs

This notebook fine-tunes EmbeddingGemma using LoRA on query-passage pairs generated from PDF documents. The model learns to:
- **Map queries to relevant passages**: Make embeddings of queries and their correct passages similar
- **Improve retrieval**: Enable better semantic search over PDF content
- **Domain adaptation**: Adapt the general EmbeddingGemma model to your specific PDF domain

This workflow uses the training pairs generated in notebook 10.


In [1]:
# Import all necessary functions from src modules
from src.data.loaders import load_query_passage_pairs, validate_pairs
from src.models.embedding_pipeline import load_embeddinggemma_model
from src.models.lora_setup import setup_lora_model, print_trainable_parameters
from src.training.trainer import train_model
from src.utils.paths import timestamped_path, find_latest_timestamped_file
import torch
import pandas as pd


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load Training Dataset

Load the query-passage pairs generated in notebook 10. The `load_query_passage_pairs()` function automatically maps 'query' ‚Üí 'anchor' and 'passage' ‚Üí 'positive' for compatibility with the training framework.


In [2]:
# Load training dataset from CSV file generated in notebook 10
# Automatically find the latest train file (or specify path manually)
train_csv_path = find_latest_timestamped_file(
    "data/processed",
    "pdf_query_passage_pairs_train",
    "csv"
)

if train_csv_path is None:
    raise FileNotFoundError(
        "No training CSV file found. Please run notebook 10 first to generate the dataset."
    )

print(f"Loading training data from: {train_csv_path}")

# Load pairs (automatically maps query‚Üíanchor, passage‚Üípositive)
train_data = load_query_passage_pairs(str(train_csv_path))

print(f"Loaded {len(train_data)} training pairs")
print("\nFirst pair:")
print(f"  Query (anchor):   '{train_data[0]['anchor']}'")
print(f"  Passage (positive): '{train_data[0]['positive'][:100]}...'")

# Validate dataset
stats = validate_pairs(train_data)
print(f"\nDataset statistics:")
print(f"  Average query length: {stats['avg_anchor_length']:.1f} characters")
print(f"  Average passage length: {stats['avg_positive_length']:.1f} characters")
print(f"  Has empty strings: {stats['has_empty']}")


Loading training data from: data/processed/pdf_query_passage_pairs_train_20260114_024812.csv
Loaded 895 training pairs

First pair:
  Query (anchor):   '```json
[
  {
    "question": "What is the primary subject matter of the book, as indicated by its title?'
  Passage (positive): '*** START OF THE PROJECT GUTENBERG EBOOK MOBY DICK; OR, THE WHALE ***...'

Dataset statistics:
  Average query length: 212.4 characters
  Average passage length: 1909.7 characters
  Has empty strings: False


## Step 2: Load Base Model and Apply LoRA

We'll load the base EmbeddingGemma model and configure it with LoRA adapters. LoRA allows us to fine-tune only a small subset of parameters, making training efficient and preventing catastrophic forgetting.

### LoRA Configuration

- `r`: Rank of LoRA adapters (default: 16, controls capacity)
- `lora_alpha`: Scaling factor (default: 32, typically 2x rank)
- `lora_dropout`: Dropout rate for LoRA layers (default: 0.1)
- `target_modules`: Which transformer layers to apply LoRA to (typically attention projections)


In [3]:
# Load base EmbeddingGemma model
tokenizer, base_model = load_embeddinggemma_model()

# Check device
device = next(base_model.parameters()).device
print(f"Model loaded on device: {device}")

# Apply LoRA configuration
model = setup_lora_model(
    base_model,
    r=16,  # LoRA rank (controls adapter capacity)
    lora_alpha=32,  # Scaling factor (typically 2x rank)
    lora_dropout=0.1,  # Dropout for regularization
    target_modules=["q_proj", "k_proj", "v_proj"]  # Apply to attention projections
)

# Verify only LoRA parameters are trainable
print_trainable_parameters(model)


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


Model loaded on device: cuda:0
Trainable params: 1,376,256 (0.45% of 304,239,360)


{'trainable': 1376256, 'total': 304239360, 'percentage': 0.452359615797246}

## Step 3: Load Validation Dataset (Optional)

If you have a validation set, load it to monitor training progress and prevent overfitting.


In [4]:
# Load validation dataset (optional)
# Automatically find the latest val file (or specify path manually)
val_csv_path = find_latest_timestamped_file(
    "data/processed",
    "pdf_query_passage_pairs_val",
    "csv"
)

if val_csv_path is None:
    print("Validation file not found, skipping validation")
    val_data = None
else:
    print(f"Loading validation data from: {val_csv_path}")
    try:
        val_data = load_query_passage_pairs(str(val_csv_path))
        print(f"Loaded {len(val_data)} validation pairs")
    except FileNotFoundError:
        print("Error loading validation file, skipping validation")
        val_data = None


Loading validation data from: data/processed/pdf_query_passage_pairs_val_20260114_024812.csv
Loaded 221 validation pairs


## Step 4: Check GPU and Train the Model

First, let's check GPU availability and memory to optimize batch size for DGX Spark.

Then we'll train the model using contrastive learning. The training process:
1. **Forward pass**: Compute embeddings for queries and passages
2. **Contrastive loss**: Use Multiple Negatives Ranking Loss to bring positive pairs closer
3. **Backward pass**: Update only LoRA parameters

### Training Parameters (Optimized for ~2 hours on DGX Spark)

- `epochs`: 50 epochs (increased from 3)
- `batch_size`: 64 (increased from 8 for better GPU utilization)
- `learning_rate`: $1.6 \times 10^{-3}$ (scaled with batch size using linear scaling rule: $lr_{new} = lr_{base} \times \frac{batch\_size_{new}}{batch\_size_{base}}$)
- `temperature`: 1.0 (contrastive loss temperature)


In [5]:
# Check GPU availability and memory for batch size optimization
if torch.cuda.is_available():
    print("GPU Information:")
    print("-" * 60)
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        allocated = torch.cuda.memory_allocated(i) / 1e9
        reserved = torch.cuda.memory_reserved(i) / 1e9
        total = props.total_memory / 1e9
        free = total - reserved
        
        print(f"GPU {i}: {props.name}")
        print(f"  Total Memory: {total:.2f} GB")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved: {reserved:.2f} GB")
        print(f"  Free: {free:.2f} GB")
        print(f"  Compute Capability: {props.major}.{props.minor}")
    print("-" * 60)
    print(f"\nUsing device: {device}")
    print(f"Recommended batch size: 64-128 for DGX Spark (will use 64)")
else:
    print("CUDA not available - using CPU")
    print("Note: Training will be much slower on CPU")


GPU Information:
------------------------------------------------------------
GPU 0: NVIDIA GB10
  Total Memory: 128.53 GB
  Allocated: 1.22 GB
  Reserved: 1.28 GB
  Free: 127.25 GB
  Compute Capability: 12.1
------------------------------------------------------------

Using device: cuda:0
Recommended batch size: 64-128 for DGX Spark (will use 64)


In [6]:
# Train the model optimized for DGX Spark (~2 hours training time)
# The trainer uses Multiple Negatives Ranking Loss for contrastive learning
# Note: train_model modifies the model in-place and returns a list of losses per epoch

# Calculate optimal parameters for ~2 hours training
# Current baseline: 3 epochs @ batch_size=8 took ~97 seconds (32.3 sec/epoch)
# Target: ~7200 seconds (2 hours)
# Strategy: Increase batch size for better GPU utilization, scale epochs accordingly

import time
start_time = time.time()

# Optimized hyperparameters for DGX Spark
# Larger batch size = better GPU utilization + more in-batch negatives
BATCH_SIZE = 64  # Increased from 8 to better utilize GPU memory (8x larger)
BASE_BATCH_SIZE = 8
BASE_LR = 2e-4

# Estimate epochs for ~2 hours:
# With batch_size=64: ~14 batches/epoch (vs 112 with batch_size=8)
# Larger batches are more GPU-efficient, estimate ~40-50 sec/epoch
# For 2 hours (7200 sec): 7200/45 ‚âà 160 epochs
# Using 150 epochs for safety (can adjust based on actual timing)
EPOCHS = 150  # Optimized to reach ~2 hours total time

# Scale learning rate with batch size (linear scaling rule)
# This maintains similar gradient magnitudes as batch size increases
# Formula: lr_new = lr_base * (batch_size_new / batch_size_base)
LEARNING_RATE = BASE_LR * (BATCH_SIZE / BASE_BATCH_SIZE)  # 2e-4 * 8 = 1.6e-3

batches_per_epoch = (len(train_data) + BATCH_SIZE - 1) // BATCH_SIZE
estimated_epoch_time = 40  # Conservative estimate in seconds
estimated_total_time = EPOCHS * estimated_epoch_time

print(f"DGX Spark Training Configuration:")
print(f"  Training pairs: {len(train_data)}")
print(f"  Batch size: {BATCH_SIZE} ({batches_per_epoch} batches per epoch)")
print(f"  Epochs: {EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE:.2e} (scaled from {BASE_LR:.2e} for batch size {BATCH_SIZE})")
print(f"  Temperature: 1.0")
print(f"  Estimated time: ~{estimated_total_time/3600:.2f} hours ({estimated_total_time/60:.1f} minutes)")
print(f"\nStarting training...")

losses = train_model(
    model=model,
    tokenizer=tokenizer,
    train_data=train_data,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    temperature=1.0,
    device=device
)

elapsed_time = time.time() - start_time
hours = elapsed_time / 3600
minutes = (elapsed_time % 3600) / 60

print("\n" + "="*60)
print("Training completed!")
print(f"Total training time: {hours:.2f} hours ({elapsed_time:.1f} seconds)")
print(f"Average time per epoch: {elapsed_time/EPOCHS:.2f} seconds")
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"Loss improvement: {losses[0] - losses[-1]:.4f}")
print("="*60)


DGX Spark Training Configuration:
  Training pairs: 895
  Batch size: 64 (14 batches per epoch)
  Epochs: 150
  Learning rate: 1.60e-03 (scaled from 2.00e-04 for batch size 64)
  Temperature: 1.0
  Estimated time: ~1.67 hours (100.0 minutes)

Starting training...
Epoch 1/150: training loss = 3.8875
Epoch 2/150: training loss = 3.7557
Epoch 3/150: training loss = 3.6663
Epoch 4/150: training loss = 3.6331
Epoch 5/150: training loss = 3.6253
Epoch 6/150: training loss = 3.5951
Epoch 7/150: training loss = 3.6114
Epoch 8/150: training loss = 3.6219
Epoch 9/150: training loss = 3.6445
Epoch 10/150: training loss = 3.6213
Epoch 11/150: training loss = 3.5988
Epoch 12/150: training loss = 3.5595
Epoch 13/150: training loss = 3.5363
Epoch 14/150: training loss = 3.5500
Epoch 15/150: training loss = 3.5703
Epoch 16/150: training loss = 3.5593
Epoch 17/150: training loss = 3.5552
Epoch 18/150: training loss = 3.5327
Epoch 19/150: training loss = 3.5156
Epoch 20/150: training loss = 3.4979
Epoch

## Step 5: Save Fine-Tuned Model

Save the LoRA adapters so you can load them later for inference or further training.


In [7]:
# Save LoRA adapters
from peft import PeftModel

# Save adapters to timestamped directory
adapter_path = timestamped_path("outputs/models", "pdf_qa_lora_adapter", "")
adapter_path.mkdir(parents=True, exist_ok=True)

# Note: model was modified in-place during training, so we use 'model' not 'trained_model'
model.save_pretrained(str(adapter_path))
tokenizer.save_pretrained(str(adapter_path))

print(f"Saved LoRA adapters to: {adapter_path}")
print("\nTo load the model later:")
print(f"  from peft import PeftModel")
print(f"  from src.models.embedding_pipeline import load_embeddinggemma_model")
print(f"  tokenizer, base_model = load_embeddinggemma_model()")
print(f"  model = PeftModel.from_pretrained(base_model, '{adapter_path}')")


Saved LoRA adapters to: outputs/models/pdf_qa_lora_adapter_20260114_151605.

To load the model later:
  from peft import PeftModel
  from src.models.embedding_pipeline import load_embeddinggemma_model
  tokenizer, base_model = load_embeddinggemma_model()
  model = PeftModel.from_pretrained(base_model, 'outputs/models/pdf_qa_lora_adapter_20260114_151605.')


## Step 6: Test Fine-Tuned Model

Let's test the fine-tuned model on a sample query to see if it retrieves the correct passage.


In [8]:
# Test the fine-tuned model
from src.models.embedding_pipeline import embed_texts
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example query from training data
test_query = train_data[0]['anchor']
correct_passage = train_data[0]['positive']

# Embed query and passage using fine-tuned model
# Note: model was modified in-place during training, so we use 'model' not 'trained_model'
query_emb = embed_texts(test_query, model, tokenizer, device=device, max_length=512)
passage_emb = embed_texts(correct_passage, model, tokenizer, device=device, max_length=512)

# Compute similarity
similarity = cosine_similarity(query_emb.numpy(), passage_emb.numpy())[0][0]

print(f"Query: '{test_query}'")
print(f"\nCorrect Passage: '{correct_passage[:150]}...'")
print(f"\nSimilarity score: {similarity:.4f}")
print(f"(Higher is better, should be close to 1.0 for correct pairs)")


Query: '```json
[
  {
    "question": "What is the primary subject matter of the book, as indicated by its title?'

Correct Passage: '*** START OF THE PROJECT GUTENBERG EBOOK MOBY DICK; OR, THE WHALE ***...'

Similarity score: 0.9810
(Higher is better, should be close to 1.0 for correct pairs)


## Summary

This notebook demonstrated:
1. ‚úÖ Loading query-passage pairs from PDF-generated dataset
2. ‚úÖ Setting up LoRA for efficient fine-tuning
3. ‚úÖ Training EmbeddingGemma on PDF QA pairs
4. ‚úÖ Saving fine-tuned LoRA adapters
5. ‚úÖ Testing the fine-tuned model

**Next Steps:**
- Use the fine-tuned model for semantic search over your PDF documents
- Evaluate retrieval performance using the evaluation modules
- Consider using hard negatives (from notebook 10) for advanced training
- The fine-tuned model can be loaded and used for inference in other notebooks


Here‚Äôs how to read what happened and **how much better the model actually got**, in practical terms.

---

## 1. What the loss numbers mean (plain English)

* **Training loss** measures how surprised the model is by the correct answers.
* Lower = better.
* Loss is on a **log scale**, so small numeric improvements can still be meaningful.

You started at:

* **Initial loss:** **3.8875**

You ended at:

* **Final loss:** **3.2902**

That‚Äôs a drop of:

* **Œî loss = ‚àí0.5973**

---

## 2. How big is that improvement really?

### A) Relative improvement

$$
\frac{0.5973}{3.8875} \approx 15.4\%
$$

‚û°Ô∏è **About a 15% reduction in error** by this metric.

That‚Äôs **solid but not dramatic**, especially for:

* A **small dataset (895 pairs)**
* **150 epochs**
* A relatively high learning rate

---

### B) Interpreting loss exponentially (important!)

Because loss is logarithmic, we can convert it to ‚Äúeffective likelihood‚Äù:

$$
\text{Improvement factor} = e^{0.5973} \approx 1.82
$$

‚û°Ô∏è **The model is ~1.8√ó more confident/accurate** on average than at the start.

This is a much better intuition than raw loss numbers.

---

## 3. What the loss curve tells us about learning quality

### Phase breakdown

#### üîπ Epochs 1‚Äì30: Fast learning

* Loss drops from **3.89 ‚Üí ~3.40**
* Model is learning basic structure and patterns

#### üîπ Epochs 30‚Äì60: Slower gains

* Gradual improvement down to **~3.28**
* Still learning, but diminishing returns

#### üîπ Epochs 60‚Äì90: Convergence

* Loss flattens around **3.25‚Äì3.27**
* This is likely **near the capacity limit** given data size

#### üîπ Epochs 120‚Äì150: Instability / mild overfitting

* Loss **rises back up** to **3.29**
* Indicates:

  * Learning rate slightly too high late in training
  * Model starting to chase noise

üìå **Your best model was probably around epochs 90‚Äì115**, not at epoch 150.

---

## 4. Did training ‚Äúwork‚Äù?

**Yes ‚Äî clearly.**

Evidence:

* Strong early loss reduction
* Clear convergence
* Final model significantly better than initialization
* No catastrophic divergence

But also:

* You are **data-limited**, not compute-limited
* Extra epochs past ~100 gave little benefit

---

## 5. How much better is the model *functionally*?

Without validation metrics we can‚Äôt be exact, but typically:

* Expect **noticeably better outputs** than the base model
* Improved:

  * Domain-specific wording
  * Format consistency
  * Preference alignment
* Not a ‚Äúnew capability‚Äù jump

Think:

> **Sharper, more reliable, more on-style ‚Äî not magically smarter**

---

## 6. Key takeaways (TL;DR)

* ‚úÖ Loss improved by **15.4%**
* ‚úÖ Effective performance improved by **~1.8√ó**
* ‚ö†Ô∏è Training past ~100 epochs likely unnecessary
* ‚ö†Ô∏è Slight overfitting / learning-rate instability late
* üìà Biggest gains came early; diminishing returns dominate

---

## 7. If you want, next we can:

* Estimate **expected quality improvement** on real prompts
* Decide **best checkpoint** to keep
* Suggest **better LR schedule** for next run
* Evaluate whether **more data or lower LR** would help more

Just tell me what you want to optimize next.
