# üîß Step 2: Loading the Model & Configuring LoRA

**Goal**: Load Nemotron-3-Nano and set up LoRA adapters using the standard HuggingFace stack.

In this notebook, we'll:
1. Load the base model and tokenizer
2. Understand the model architecture
3. Configure LoRA parameters
4. Verify the setup before training

**Stack used**: Transformers + PEFT (no Unsloth yet ‚Äî that comes in notebook 04)

## 2.1 Install Dependencies

First, let's make sure we have the required packages installed.

In [1]:
# Dependencies are managed in pyproject.toml
# Run `uv sync` in the project root to install all required packages

import os
from huggingface_hub import login

# Authenticate with HuggingFace using token from environment
if os.environ.get("HF_TOKEN"):
    login(token=os.environ["HF_TOKEN"])
    print("‚úÖ Logged in to HuggingFace Hub")
else:
    print("‚ö†Ô∏è HF_TOKEN not found in environment. Set it to avoid rate limits.")

‚ö†Ô∏è HF_TOKEN not found in environment. Set it to avoid rate limits.


  from .autonotebook import tqdm as notebook_tqdm


## 2.2 Check GPU Availability

Before loading a 30B model, let's verify we have sufficient GPU resources.

**Requirements**:
- A100 80GB (recommended) or H100
- CUDA available with BF16 support

In [2]:
import torch

# Check CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        total_memory_gb = props.total_memory / (1024**3)
        print(f"\nGPU {i}: {props.name}")
        print(f"  Total Memory: {total_memory_gb:.1f} GB")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  BF16 Supported: {props.major >= 8}")
else:
    print("‚ö†Ô∏è No GPU available. This notebook requires a GPU with at least 40GB VRAM.")

PyTorch version: 2.10.0+cu128
CUDA available: True
CUDA version: 12.8
GPU count: 8

GPU 0: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 1: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 2: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 3: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 4: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 5: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 6: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True

GPU 7: NVIDIA A100-SXM4-80GB
  Total Memory: 79.3 GB
  Compute Capability: 8.0
  BF16 Supported: True


## 2.3 Load the Tokenizer

The tokenizer converts text to token IDs (and back). Think of it as a serialization format.

Key concepts:
- **Vocabulary**: The set of all tokens the model knows (~128k for Nemotron)
- **Special tokens**: Control tokens like `<|im_start|>`, `<|im_end|>`, `<pad>`, etc.
- **Token IDs**: Integer indices into the vocabulary

In [3]:
from transformers import AutoTokenizer

MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

print(f"Tokenizer type: {type(tokenizer).__name__}")
print(f"Vocabulary size: {len(tokenizer):,}")
print(f"Model max length: {tokenizer.model_max_length:,}")



Tokenizer type: TokenizersBackend
Vocabulary size: 131,072
Model max length: 262,144


In [4]:
# Check special tokens
print("Special tokens:")
print(f"  BOS (beginning of sequence): {tokenizer.bos_token!r} -> {tokenizer.bos_token_id}")
print(f"  EOS (end of sequence): {tokenizer.eos_token!r} -> {tokenizer.eos_token_id}")
print(f"  PAD (padding): {tokenizer.pad_token!r} -> {tokenizer.pad_token_id}")
print(f"  UNK (unknown): {tokenizer.unk_token!r} -> {tokenizer.unk_token_id}")

Special tokens:
  BOS (beginning of sequence): '<s>' -> 1
  EOS (end of sequence): '<|im_end|>' -> 11
  PAD (padding): None -> None
  UNK (unknown): '<unk>' -> 0


In [5]:
# Set padding token if not set (required for batch training)
# Common practice: use EOS token as padding token for decoder-only models
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print(f"Set pad_token to: {tokenizer.pad_token!r}")

# Left padding is preferred for generation (keeps the rightmost tokens intact)
tokenizer.padding_side = "left"
print(f"Padding side: {tokenizer.padding_side}")

Set pad_token to: '<|im_end|>'
Padding side: left


## 2.4 Test Tokenization

Let's see how our formatted medical QA text gets tokenized.

In [6]:
# Load our formatted dataset to test
from datasets import load_from_disk

formatted_dataset = load_from_disk("../data/medmcqa_formatted")
sample_text = formatted_dataset["train"][0]["text"]

print("Sample text (first 500 chars):")
print(sample_text[:500])
print("...")

Sample text (first 500 chars):
<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma

A) Hyperplasia
B) Hyperophy
C) Atrophy
D) Dyplasia<|im_end|>
<|im_start|>assistant
The correct answer is C) Atrophy.

Explanation: Chronic urethral obstruction because of urinary calculi, prostat
...


In [7]:
# Tokenize the sample
tokens = tokenizer(sample_text, return_tensors="pt")

print(f"\nTokenization result:")
print(f"  Input IDs shape: {tokens['input_ids'].shape}")
print(f"  Attention mask shape: {tokens['attention_mask'].shape}")
print(f"  Number of tokens: {tokens['input_ids'].shape[1]}")
print(f"\nFirst 20 token IDs: {tokens['input_ids'][0][:20].tolist()}")


Tokenization result:
  Input IDs shape: torch.Size([1, 173])
  Attention mask shape: torch.Size([1, 173])
  Number of tokens: 173

First 20 token IDs: [10, 25708, 1010, 4568, 1584, 1261, 9371, 16967, 1046, 3450, 1278, 6245, 10284, 4098, 1536, 31277, 1278, 6298, 7091, 1321]


In [8]:
# Decode back to text to verify
decoded = tokenizer.decode(tokens['input_ids'][0][:50])
print(f"Decoded first 50 tokens:")
print(decoded)

Decoded first 50 tokens:
<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney


In [9]:
# Check how ChatML special tokens are tokenized
special_tokens = ["<|im_start|>", "<|im_end|>", "system", "user", "assistant"]

print("ChatML token analysis:")
for token in special_tokens:
    ids = tokenizer.encode(token, add_special_tokens=False)
    print(f"  '{token}' -> {ids}")

ChatML token analysis:
  '<|im_start|>' -> [10]
  '<|im_end|>' -> [11]
  'system' -> [25708]
  'user' -> [3263]
  'assistant' -> [1503, 19464]


## 2.5 Analyze Token Lengths

We need to choose an appropriate `max_seq_length` for training. Too short = truncation, too long = wasted memory.

In [10]:
import numpy as np
from tqdm import tqdm

# Sample a subset for analysis (full dataset takes too long)
sample_size = min(5000, len(formatted_dataset["train"]))
sample_indices = np.random.choice(len(formatted_dataset["train"]), sample_size, replace=False)

token_lengths = []
for idx in tqdm(sample_indices, desc="Tokenizing samples"):
    text = formatted_dataset["train"][int(idx)]["text"]
    tokens = tokenizer(text, return_tensors="pt")
    token_lengths.append(tokens['input_ids'].shape[1])

print(f"\nToken length statistics (n={sample_size}):")
print(f"  Min: {min(token_lengths):,}")
print(f"  Max: {max(token_lengths):,}")
print(f"  Mean: {np.mean(token_lengths):,.0f}")
print(f"  Median: {np.median(token_lengths):,.0f}")
print(f"  95th percentile: {np.percentile(token_lengths, 95):,.0f}")
print(f"  99th percentile: {np.percentile(token_lengths, 99):,.0f}")

Tokenizing samples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:02<00:00, 1715.83it/s]


Token length statistics (n=5000):
  Min: 69
  Max: 2,111
  Mean: 206
  Median: 171
  95th percentile: 441
  99th percentile: 751





In [11]:
# Recommend max_seq_length based on analysis
p99 = np.percentile(token_lengths, 99)

# Round up to nearest power of 2 for efficiency
recommended_max_len = 2 ** int(np.ceil(np.log2(p99)))

print(f"\nüìä Recommendation:")
print(f"   99th percentile: {p99:.0f} tokens")
print(f"   Recommended max_seq_length: {recommended_max_len}")
print(f"")
print(f"   This will truncate ~1% of examples but saves significant memory.")
print(f"   For training, we'll use max_seq_length = 1024 (covers ~95% without truncation)")


üìä Recommendation:
   99th percentile: 751 tokens
   Recommended max_seq_length: 1024

   This will truncate ~1% of examples but saves significant memory.
   For training, we'll use max_seq_length = 1024 (covers ~95% without truncation)


## 2.6 Load the Base Model

Now let's load Nemotron-3-Nano. This is a 30B parameter model, but only ~3B parameters activate per token (MoE architecture).

**Key settings**:
- `torch_dtype=torch.bfloat16`: Use BF16 precision (halves memory vs FP32)
- `device_map="auto"`: Automatically distribute across available GPUs
- `attn_implementation="flash_attention_2"`: Use FlashAttention-2 for memory efficiency

In [12]:
from transformers import AutoModelForCausalLM
import torch

print(f"Loading model: {MODEL_NAME}")
print("This may take a few minutes...")

# Check available memory before loading
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    free_memory = torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated(0)
    print(f"\nAvailable GPU memory: {free_memory / (1024**3):.1f} GB")

Loading model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
This may take a few minutes...

Available GPU memory: 79.3 GB


In [13]:
# Load the model with optimal settings
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    # Use FlashAttention-2 if available (significant memory savings)
    # attn_implementation="flash_attention_2" if torch.cuda.is_available() else None,
)

print(f"\n‚úÖ Model loaded successfully!")

`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6243/6243 [00:10<00:00, 590.23it/s, Materializing param=lm_head.weight]                                          



‚úÖ Model loaded successfully!


In [14]:
# Check memory usage after loading
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / (1024**3)
    reserved = torch.cuda.memory_reserved(0) / (1024**3)
    print(f"GPU memory allocated: {allocated:.1f} GB")
    print(f"GPU memory reserved: {reserved:.1f} GB")

GPU memory allocated: 3.2 GB
GPU memory reserved: 3.2 GB


## 2.7 Explore the Model Architecture

Understanding the architecture helps us know:
1. Which layers can be targeted with LoRA
2. How the MoE (Mixture of Experts) routing works
3. The hybrid Mamba-Transformer structure

In [15]:
# Print model summary
print(f"Model type: {type(model).__name__}")
print(f"Model dtype: {model.dtype}")
print(f"\nModel configuration:")
print(f"  Hidden size: {model.config.hidden_size}")
print(f"  Num layers: {model.config.num_hidden_layers}")
print(f"  Num attention heads: {model.config.num_attention_heads}")
print(f"  Vocabulary size: {model.config.vocab_size}")

Model type: NemotronHForCausalLM
Model dtype: torch.bfloat16

Model configuration:
  Hidden size: 2688
  Num layers: 52
  Num attention heads: 32
  Vocabulary size: 131072


In [16]:
# Count parameters
def count_parameters(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

total_params, trainable_params = count_parameters(model)
print(f"\nParameter count:")
print(f"  Total: {total_params / 1e9:.2f}B")
print(f"  Trainable: {trainable_params / 1e9:.2f}B")
print(f"  Trainable %: {100 * trainable_params / total_params:.2f}%")


Parameter count:
  Total: 31.58B
  Trainable: 31.58B
  Trainable %: 100.00%


In [17]:
# Find all linear layer names (potential LoRA targets)
def find_linear_layers(model):
    """Find all linear layer names in the model."""
    linear_layers = set()
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Get the layer name (last part)
            layer_name = name.split('.')[-1]
            linear_layers.add(layer_name)
    return sorted(linear_layers)

linear_layers = find_linear_layers(model)
print(f"\nFound {len(linear_layers)} unique linear layer types:")
for layer in linear_layers:
    print(f"  - {layer}")


Found 9 unique linear layer types:
  - down_proj
  - in_proj
  - k_proj
  - lm_head
  - o_proj
  - out_proj
  - q_proj
  - up_proj
  - v_proj


## 2.8 Configure LoRA

Now we'll set up LoRA adapters. These are small trainable matrices injected into specific layers.

**Key parameters** (see AGENTS.md for detailed explanation):
- `r` (rank): Controls capacity. Higher = more parameters, better fit. Start with 16-32.
- `lora_alpha`: Scaling factor. Rule of thumb: 2x the rank.
- `target_modules`: Which layers get LoRA adapters.
- `lora_dropout`: Regularization to prevent overfitting.

In [18]:
from peft import LoraConfig, TaskType, get_peft_model

# Configure LoRA
lora_config = LoraConfig(
    # Core LoRA parameters
    r=16,                      # Rank: capacity vs memory trade-off
    lora_alpha=32,             # Scaling factor (2x rank is a good default)
    lora_dropout=0.05,         # Regularization
    
    # Which layers to target
    # For Nemotron: attention projections + MLP layers
    # Note: This model doesn't have gate_proj (no SwiGLU gating)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "up_proj", "down_proj",                   # MLP
    ],
    
    # Task type for causal language modeling
    task_type=TaskType.CAUSAL_LM,
    
    # Other settings
    bias="none",               # Don't train biases (reduces params)
    inference_mode=False,      # We're training, not inferencing
)

print("LoRA Configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Scaling factor: {lora_config.lora_alpha / lora_config.r}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")

LoRA Configuration:
  Rank (r): 16
  Alpha: 32
  Scaling factor: 2.0
  Dropout: 0.05
  Target modules: {'o_proj', 'down_proj', 'v_proj', 'up_proj', 'q_proj', 'k_proj'}


In [19]:
# Apply LoRA to the model
print("Applying LoRA adapters to model...")
model = get_peft_model(model, lora_config)
print("‚úÖ LoRA adapters added!")

Applying LoRA adapters to model...
‚úÖ LoRA adapters added!


In [20]:
# Check the new parameter counts
model.print_trainable_parameters()

trainable params: 434,659,328 || all params: 32,012,596,672 || trainable%: 1.3578


In [21]:
# Detailed breakdown
total_params, trainable_params = count_parameters(model)

print(f"\nDetailed parameter breakdown:")
print(f"  Total parameters: {total_params / 1e9:.2f}B")
print(f"  Trainable (LoRA): {trainable_params / 1e6:.2f}M")
print(f"  Frozen (base): {(total_params - trainable_params) / 1e9:.2f}B")
print(f"  Trainable %: {100 * trainable_params / total_params:.4f}%")
print(f"")
print(f"Memory savings:")
print(f"  Without LoRA, we'd need to store gradients for {total_params/1e9:.1f}B params")
print(f"  With LoRA, we only store gradients for {trainable_params/1e6:.1f}M params")
print(f"  That's {total_params / trainable_params:.0f}x fewer gradient values!")


Detailed parameter breakdown:
  Total parameters: 32.01B
  Trainable (LoRA): 434.66M
  Frozen (base): 31.58B
  Trainable %: 1.3578%

Memory savings:
  Without LoRA, we'd need to store gradients for 32.0B params
  With LoRA, we only store gradients for 434.7M params
  That's 74x fewer gradient values!


## 2.9 Verify the Setup with a Forward Pass

Let's make sure everything works by running a forward pass on a sample.

In [22]:
# Prepare a sample input
sample_text = formatted_dataset["train"][0]["text"]
inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, max_length=512)

# Move to GPU
inputs = {k: v.to(model.device) for k, v in inputs.items()}

print(f"Input shape: {inputs['input_ids'].shape}")
print(f"Input device: {inputs['input_ids'].device}")

Input shape: torch.Size([1, 173])
Input device: cuda:0


In [23]:
# Forward pass
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])

print(f"\n‚úÖ Forward pass successful!")
print(f"  Output logits shape: {outputs.logits.shape}")
print(f"  Loss: {outputs.loss.item():.4f}")

NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was provided, so no cache will be returned.



‚úÖ Forward pass successful!
  Output logits shape: torch.Size([1, 173, 131072])
  Loss: 12.0128


## 2.10 Test Text Generation (Optional)

Let's see what the base model (with untrained LoRA) produces.
This gives us a baseline to compare against after training.

In [24]:
# Create a test prompt
test_prompt = """<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Which vitamin is essential for blood clotting?

A) Vitamin A
B) Vitamin C
C) Vitamin K
D) Vitamin D<|im_end|>
<|im_start|>assistant
"""

# Tokenize
inputs = tokenizer(test_prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

print(f"Prompt length: {inputs['input_ids'].shape[1]} tokens")

Prompt length: 66 tokens


In [25]:
# Generate response
model.eval()

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print("Generated response (BASE MODEL + UNTRAINED LoRA):")
print("="*60)
print(generated_text)
print("="*60)

Generated response (BASE MODEL + UNTRAINED LoRA):
<|im_start|>system
You are a medical expert. Answer the multiple choice question by selecting the correct option and providing a brief explanation.<|im_end|>
<|im_start|>user
Question: Which vitamin is essential for blood clotting?

A) Vitamin A
B) Vitamin C
C) Vitamin K
D) Vitamin D<|im_end|>
<|im_start|>assistant
, and,,,,,,,, in, and,, in,,,, in and,, in,,, and and in in,,, and and, in in,,,, in and, and and and and, in,, and and, and and and and over in and, and, and, and, and,, in, and, in,, and and and in and, and in and in and in,, and, and,, and in and in and in,,,, in,, in, and, and, in in and and, and in,,, and, and,,,,,,,, in, in, in in,, and


## 2.11 Save the Configuration (Optional)

We can save the LoRA configuration for reproducibility.

In [27]:
import json
from pathlib import Path

# Create outputs directory if it doesn't exist
outputs_dir = Path("../outputs")
outputs_dir.mkdir(exist_ok=True)

# Save config as JSON for reference
config_dict = {
    "model_name": MODEL_NAME,
    "lora_config": {
        "r": lora_config.r,
        "lora_alpha": lora_config.lora_alpha,
        "lora_dropout": lora_config.lora_dropout,
        "target_modules": list(lora_config.target_modules),  # Convert set to list for JSON
        "task_type": str(lora_config.task_type),
        "bias": lora_config.bias,
    },
    "training_config": {
        "max_seq_length": 1024,  # Our recommendation from analysis
    }
}

with open(outputs_dir / "training_config.json", "w") as f:
    json.dump(config_dict, f, indent=2)

print(f"‚úÖ Configuration saved to {outputs_dir / 'training_config.json'}")
print("\nConfig contents:")
print(json.dumps(config_dict, indent=2))

‚úÖ Configuration saved to ../outputs/training_config.json

Config contents:
{
  "model_name": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16",
  "lora_config": {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "target_modules": [
      "o_proj",
      "down_proj",
      "v_proj",
      "up_proj",
      "q_proj",
      "k_proj"
    ],
    "task_type": "TaskType.CAUSAL_LM",
    "bias": "none"
  },
  "training_config": {
    "max_seq_length": 1024
  }
}


---

## ‚úÖ Summary

In this notebook, we:

1. **Verified GPU resources** ‚Äî Checked CUDA availability and memory
2. **Loaded the tokenizer** ‚Äî Configured special tokens and padding
3. **Analyzed token lengths** ‚Äî Determined optimal `max_seq_length` (1024)
4. **Loaded the base model** ‚Äî Nemotron-3-Nano with BF16 precision
5. **Explored architecture** ‚Äî Found target layers for LoRA
6. **Configured LoRA** ‚Äî Set up adapters with rank=16, alpha=32
7. **Verified the setup** ‚Äî Successful forward pass with ~0.1% trainable params
8. **Tested generation** ‚Äî Baseline output from untrained model

### Key Numbers to Remember

| Metric | Value |
|--------|-------|
| Total parameters | ~30B |
| Trainable (LoRA) | ~30-50M |
| Trainable % | ~0.1% |
| Recommended max_seq_length | 1024 |

## ‚è≠Ô∏è Next Step

In the next notebook (`03_training.ipynb`), we'll:
- Set up the SFTTrainer from TRL
- Configure training hyperparameters
- Train the model on our formatted MedMCQA dataset
- Monitor loss and save checkpoints

In [None]:
# Cleanup (free GPU memory for next notebook)
del model
torch.cuda.empty_cache()
print("‚úÖ GPU memory cleared. Ready for training notebook!")

‚úÖ GPU memory cleared. Ready for training notebook!
