# üöÄ Step 3: Training the LoRA Adapter

**Goal**: Train Nemotron-3-Nano with LoRA on the MedMCQA dataset.

In this notebook, we'll:
1. Load the model with LoRA configuration from notebook 02
2. Set up the SFTTrainer (Supervised Fine-Tuning Trainer) from TRL
3. Configure training hyperparameters
4. Train and monitor the loss
5. Save the trained adapter
6. Test the fine-tuned model

**Stack used**: Transformers + PEFT + TRL (standard HuggingFace stack)

## 3.1 Setup and Authentication

In [1]:
import os
import json
from pathlib import Path

from huggingface_hub import login

# Authenticate with HuggingFace using token from environment
if os.environ.get("HF_TOKEN"):
    login(token=os.environ["HF_TOKEN"])
    print("‚úÖ Logged in to HuggingFace Hub")
else:
    print("‚ö†Ô∏è HF_TOKEN not found in environment. Set it to avoid rate limits.")

‚ö†Ô∏è HF_TOKEN not found in environment. Set it to avoid rate limits.


  from .autonotebook import tqdm as notebook_tqdm


## 3.2 Load Training Configuration

We'll use the configuration saved from notebook 02.

In [2]:
# Load the config we saved in notebook 02
config_path = Path("../outputs/training_config.json")

with open(config_path) as f:
    config = json.load(f)

print("Loaded configuration:")
print(json.dumps(config, indent=2))

MODEL_NAME = config["model_name"]
MAX_SEQ_LENGTH = config["training_config"]["max_seq_length"]

Loaded configuration:
{
  "model_name": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16",
  "lora_config": {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "target_modules": [
      "o_proj",
      "down_proj",
      "v_proj",
      "up_proj",
      "q_proj",
      "k_proj"
    ],
    "task_type": "TaskType.CAUSAL_LM",
    "bias": "none"
  },
  "training_config": {
    "max_seq_length": 1024
  }
}


## 3.3 Check GPU Resources

In [3]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        total_memory_gb = props.total_memory / (1024**3)
        print(f"\nGPU {i}: {props.name}")
        print(f"  Total Memory: {total_memory_gb:.1f} GB")
else:
    raise RuntimeError("No GPU available. Training requires a GPU.")

PyTorch version: 2.10.0+cu128
CUDA available: True
CUDA version: 12.8
GPU count: 8

GPU 0: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 1: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 2: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 3: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 4: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 5: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 6: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB

GPU 7: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB


## 3.4 Load the Dataset

In [4]:
from datasets import load_from_disk

# Load the formatted dataset from notebook 01
formatted_dataset = load_from_disk("../data/medmcqa_formatted")

print(f"Dataset loaded:")
print(f"  Train: {len(formatted_dataset['train']):,} examples")
print(f"  Validation: {len(formatted_dataset['validation']):,} examples")

print(f"\nSample text (first 300 chars):")
print(formatted_dataset["train"][0]["text"][:300] + "...")

Dataset loaded:
  Train: 182,822 examples
  Validation: 4,183 examples

Sample text (first 300 chars):
<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parench...


## 3.5 Load the Model and Tokenizer

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"Loading tokenizer from: {MODEL_NAME}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Configure padding (required for batch training)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer.padding_side = "right"  # Right padding for training (left for generation)

print(f"‚úÖ Tokenizer loaded")
print(f"   Vocabulary size: {len(tokenizer):,}")
print(f"   Pad token: {tokenizer.pad_token!r}")
print(f"   Padding side: {tokenizer.padding_side}")

Loading tokenizer from: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
‚úÖ Tokenizer loaded
   Vocabulary size: 131,072
   Pad token: '<|im_end|>'
   Padding side: right


In [6]:
print(f"\nLoading model: {MODEL_NAME}")
print("This may take a few minutes...")

# Clear any cached memory
torch.cuda.empty_cache()

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

print(f"\n‚úÖ Model loaded!")
print(f"   Model type: {type(model).__name__}")
print(f"   Model dtype: {model.dtype}")

`torch_dtype` is deprecated! Use `dtype` instead!



Loading model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
This may take a few minutes...


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6243/6243 [00:10<00:00, 585.11it/s, Materializing param=lm_head.weight]                                           



‚úÖ Model loaded!
   Model type: NemotronHForCausalLM
   Model dtype: torch.bfloat16


## 3.6 Apply LoRA Configuration

In [7]:
from peft import LoraConfig, TaskType, get_peft_model

# Recreate the LoRA config from our saved configuration
lora_config = LoraConfig(
    r=config["lora_config"]["r"],
    lora_alpha=config["lora_config"]["lora_alpha"],
    lora_dropout=config["lora_config"]["lora_dropout"],
    target_modules=config["lora_config"]["target_modules"],
    task_type=TaskType.CAUSAL_LM,
    bias=config["lora_config"]["bias"],
)

print("LoRA Configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Scaling: {lora_config.lora_alpha / lora_config.r}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")

LoRA Configuration:
  Rank (r): 16
  Alpha: 32
  Scaling: 2.0
  Dropout: 0.05
  Target modules: {'v_proj', 'q_proj', 'up_proj', 'k_proj', 'o_proj', 'down_proj'}


In [8]:
# Apply LoRA to the model
print("\nApplying LoRA adapters...")
model = get_peft_model(model, lora_config)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

# Print trainable parameters
model.print_trainable_parameters()


Applying LoRA adapters...
trainable params: 434,659,328 || all params: 32,012,596,672 || trainable%: 1.3578


In [9]:
# Check memory after LoRA setup
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / (1024**3)
    reserved = torch.cuda.memory_reserved() / (1024**3)
    print(f"\nGPU Memory:")
    print(f"  Allocated: {allocated:.1f} GB")
    print(f"  Reserved: {reserved:.1f} GB")


GPU Memory:
  Allocated: 3.3 GB
  Reserved: 3.3 GB


## 3.7 Configure Training Arguments

These hyperparameters control the training process. Think of them as knobs to tune.

**Key parameters explained**:
- `learning_rate`: Step size for weight updates. Too high = unstable, too low = slow.
- `per_device_train_batch_size`: Examples per GPU per step. Limited by VRAM.
- `gradient_accumulation_steps`: Simulate larger batches without more memory.
- `num_train_epochs`: Full passes through the dataset.
- `warmup_ratio`: Gradually increase LR at start (avoids early instability).

In [10]:
from transformers import TrainingArguments

# Output directory for checkpoints and logs
OUTPUT_DIR = "../outputs/lora_adapter"

training_args = TrainingArguments(
    # Output
    output_dir=OUTPUT_DIR,
    
    # Training duration
    num_train_epochs=1,  # Start with 1 epoch for quick iteration
    # max_steps=1000,  # Alternative: train for fixed number of steps
    
    # Batch size configuration
    # Effective batch size = per_device_batch_size * gradient_accumulation_steps * num_gpus
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,  # Effective batch size = 2 * 8 = 16 per GPU
    
    # Learning rate schedule
    learning_rate=2e-4,  # Higher than full fine-tuning since only LoRA weights update
    warmup_ratio=0.03,   # 3% of training steps for warmup
    lr_scheduler_type="cosine",  # Gradually decrease LR
    
    # Optimization
    optim="adamw_torch",  # Standard AdamW optimizer
    weight_decay=0.01,    # L2 regularization
    max_grad_norm=1.0,    # Gradient clipping for stability
    
    # Precision
    bf16=True,  # Use BF16 on A100/H100
    
    # Logging
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=10,
    logging_first_step=True,
    report_to=["tensorboard"],  # Log to TensorBoard
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=500,  # Evaluate every 500 steps
    
    # Checkpointing
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,  # Keep only last 3 checkpoints to save space
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # Other settings
    gradient_checkpointing=True,  # Trade compute for memory
    dataloader_pin_memory=True,
    remove_unused_columns=True,
    seed=42,
)

# Calculate and display effective batch size
effective_batch_size = (
    training_args.per_device_train_batch_size
    * training_args.gradient_accumulation_steps
    * torch.cuda.device_count()
)

print(f"Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Per-device batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Number of GPUs: {torch.cuda.device_count()}")
print(f"  Effective batch size: {effective_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Warmup ratio: {training_args.warmup_ratio}")

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


Training configuration:
  Epochs: 1
  Per-device batch size: 2
  Gradient accumulation: 8
  Number of GPUs: 8
  Effective batch size: 128
  Learning rate: 0.0002
  Warmup ratio: 0.03


## 3.8 Set Up the SFTTrainer

The **SFTTrainer** (Supervised Fine-Tuning Trainer) from TRL handles:
- Tokenizing the dataset on-the-fly
- Packing multiple examples into sequences (optional, for efficiency)
- Managing the training loop
- Logging and checkpointing

In [14]:
from trl import SFTTrainer, SFTConfig

# Create the SFT config (extends TrainingArguments)
sft_config = SFTConfig(
    # Inherit all training arguments
    **training_args.to_dict(),
    
    # SFT-specific settings
    # Note: TRL 0.27+ renamed max_seq_length -> max_length
    max_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",  # Column name in our dataset
    packing=False,  # Don't pack multiple examples (cleaner loss signal)
)

print(f"SFT Configuration:")
print(f"  Max sequence length: {sft_config.max_length}")
print(f"  Dataset text field: {sft_config.dataset_text_field}")
print(f"  Packing: {sft_config.packing}")

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


SFT Configuration:
  Max sequence length: 1024
  Dataset text field: text
  Packing: False


In [16]:
# Create the trainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["validation"],
    processing_class=tokenizer,
)

print(f"\n‚úÖ Trainer created!")
print(f"   Training examples: {len(trainer.train_dataset):,}")
print(f"   Evaluation examples: {len(trainer.eval_dataset):,}")


‚úÖ Trainer created!
   Training examples: 182,822
   Evaluation examples: 4,183


In [17]:
# Calculate total training steps
num_training_examples = len(trainer.train_dataset)
steps_per_epoch = num_training_examples // effective_batch_size
total_steps = steps_per_epoch * training_args.num_train_epochs

print(f"\nTraining plan:")
print(f"  Training examples: {num_training_examples:,}")
print(f"  Effective batch size: {effective_batch_size}")
print(f"  Steps per epoch: {steps_per_epoch:,}")
print(f"  Total epochs: {training_args.num_train_epochs}")
print(f"  Total steps: {total_steps:,}")
print(f"  Evaluation every: {training_args.eval_steps} steps")
print(f"  Checkpoints every: {training_args.save_steps} steps")


Training plan:
  Training examples: 182,822
  Effective batch size: 128
  Steps per epoch: 1,428
  Total epochs: 1
  Total steps: 1,428
  Evaluation every: 500 steps
  Checkpoints every: 500 steps


## 3.9 Train the Model

Now we train! Watch the loss metrics:
- **train_loss**: Should decrease steadily
- **eval_loss**: Should decrease but may plateau. If it increases while train_loss decreases, that's overfitting.

**Pro tip**: You can monitor training in real-time with TensorBoard:
```bash
tensorboard --logdir=outputs/lora_adapter/logs
```

In [20]:
print("üöÄ Starting training...")
print("="*60)

# Train!
train_result = trainer.train()

print("="*60)
print("‚úÖ Training complete!")

üöÄ Starting training...


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [19]:
# Print training summary
print("\nTraining Summary:")
print(f"  Total steps: {train_result.global_step}")
print(f"  Training loss: {train_result.training_loss:.4f}")

if hasattr(train_result, 'metrics'):
    print(f"\nFinal metrics:")
    for key, value in train_result.metrics.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")
        else:
            print(f"  {key}: {value}")


Training Summary:


NameError: name 'train_result' is not defined

## 3.10 Save the Trained Adapter

In [None]:
# Save the final model
final_adapter_path = Path(OUTPUT_DIR) / "final_adapter"

print(f"Saving adapter to: {final_adapter_path}")
trainer.save_model(str(final_adapter_path))

# Also save the tokenizer with the adapter
tokenizer.save_pretrained(str(final_adapter_path))

print(f"\n‚úÖ Model and tokenizer saved!")

# List saved files
print(f"\nSaved files:")
for f in sorted(final_adapter_path.iterdir()):
    size = f.stat().st_size / (1024**2)  # MB
    print(f"  {f.name}: {size:.1f} MB")

## 3.11 Test the Fine-Tuned Model

Let's compare the fine-tuned model to the baseline from notebook 02.

**Reminder**: The baseline (untrained LoRA) output was gibberish like:
```
, and,,,,,,,, in, and,, in,,,, in and,, in,,, and and in in,...
```

In [None]:
# Prepare the same test prompt from notebook 02
test_prompt = """<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Which vitamin is essential for blood clotting?

A) Vitamin A
B) Vitamin C
C) Vitamin K
D) Vitamin D<|im_end|>
<|im_start|>assistant
"""

# Tokenize
inputs = tokenizer(test_prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

print(f"Test prompt length: {inputs['input_ids'].shape[1]} tokens")

In [None]:
# Generate response
model.eval()

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

print("Generated response (FINE-TUNED MODEL):")
print("="*60)
print(generated_text)
print("="*60)

In [None]:
# Test with another question
test_prompt_2 = """<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: What is the most common cause of peptic ulcer disease?

A) Stress
B) Spicy food
C) Helicobacter pylori infection
D) Alcohol consumption<|im_end|>
<|im_start|>assistant
"""

inputs_2 = tokenizer(test_prompt_2, return_tensors="pt")
inputs_2 = {k: v.to(model.device) for k, v in inputs_2.items()}

with torch.no_grad():
    outputs_2 = model.generate(
        **inputs_2,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated_text_2 = tokenizer.decode(outputs_2[0], skip_special_tokens=False)

print("Generated response (Question 2):")
print("="*60)
print(generated_text_2)
print("="*60)

## 3.12 Final Evaluation on Validation Set

In [None]:
# Run final evaluation
print("Running final evaluation on validation set...")

eval_results = trainer.evaluate()

print("\n" + "="*60)
print("Final Evaluation Results:")
print("="*60)
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

---

## ‚úÖ Summary

In this notebook, we:

1. **Loaded** the model with LoRA configuration
2. **Configured** training hyperparameters (LR, batch size, etc.)
3. **Set up** the SFTTrainer from TRL
4. **Trained** the LoRA adapter on MedMCQA
5. **Saved** the trained adapter (~2GB vs 60GB for full model)
6. **Tested** the fine-tuned model on sample questions

### Key Observations

| Metric | Before Training | After Training |
|--------|-----------------|----------------|
| Output quality | Gibberish | Coherent medical answers |
| Loss | ~12 | ~1-2 (depends on training) |
| Trainable params | 435M (1.36%) | Same |

### Saved Artifacts

- `outputs/lora_adapter/final_adapter/`: The trained LoRA weights
- `outputs/lora_adapter/logs/`: TensorBoard training logs
- `outputs/lora_adapter/checkpoint-*/`: Intermediate checkpoints

## ‚è≠Ô∏è Next Steps

In the next notebook (`04_unsloth_comparison.ipynb`), we'll:
- Compare training with Unsloth optimization
- Measure speed and memory improvements
- Discuss when to use each approach

In [None]:
# Cleanup
del model
del trainer
torch.cuda.empty_cache()

print("‚úÖ GPU memory cleared. Ready for next notebook!")