# Part 3: LoRA Fine-tuning with NeMo

This notebook demonstrates how to fine-tune Llama 3.1 8B Instruct using LoRA (Low-Rank Adaptation) with NVIDIA NeMo framework.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- Adds trainable low-rank matrices to frozen model weights
- Reduces memory requirements by 90%+
- Enables fine-tuning large models on consumer GPUs
- Produces small adapter files (~100-500MB for 8B models)

The focus of this workshop is not the specifics of LoRA, but to actually give everyone an guide on how to carry out the process of tuning your model. 

## IMPORTANT: NeMo Framework Setup

This notebook requires the NVIDIA NeMo framework for LoRA training. We'll clone the NeMo repository to access the necessary training scripts.

**NeMo Compatibility**: 
- The downloaded model uses standard NeMo format (.nemo file)
- The training scripts work directly without any modifications

**Training Experience**: In this workshop, you'll train your own LoRA adapter from scratch! This gives you hands-on experience with:
- Setting up training data
- Configuring LoRA parameters
- Running the actual training
- Testing your custom adapter

The training process takes approximately 5-10 minutes for our small example dataset.

## Clone NeMo Repo

This cell downloads the NVIDIA NeMo framework (if not already present) and verifies that the required training script are available.


In [1]:
# Clone NeMo repository if not already present
import os

# Use relative path for NeMo
nemo_path = './NeMo'

if not os.path.exists(nemo_path):
    print("Cloning NeMo repository...")
    !git clone https://github.com/NVIDIA/NeMo.git {nemo_path}
    print("NeMo repository cloned successfully!")
else:
    print("NeMo repository already exists.")
    
# Verify the training scripts exist
nemo_scripts = [
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py',
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py',
    f'{nemo_path}/scripts/nlp_language_modeling/merge_lora_weights/merge.py'
]

print("\nChecking for required NeMo scripts:")
for script in nemo_scripts:
    if os.path.exists(script):
        print(f"✓ Found: {os.path.basename(script)}")
    else:
        print(f"✗ Missing: {script}")

Cloning NeMo repository...
Cloning into './NeMo'...
remote: Enumerating objects: 253911, done.[K
remote: Counting objects: 100% (612/612), done.[K
remote: Compressing objects: 100% (277/277), done.[K
remote: Total 253911 (delta 466), reused 337 (delta 335), pack-reused 253299 (from 4)[K
Receiving objects: 100% (253911/253911), 461.26 MiB | 47.20 MiB/s, done.
Resolving deltas: 100% (192134/192134), done.
NeMo repository cloned successfully!

Checking for required NeMo scripts:
✓ Found: megatron_gpt_finetuning.py
✓ Found: megatron_gpt_generate.py
✓ Found: merge.py


We will only use the finetuning script

## 1. Setup Environment

In [2]:
# Install required packages
!pip install jsonlines transformers omegaconf pytorch-lightning

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


This cell imports the required Python libraries for LoRA training and verifies that PyTorch can access the GPU, displaying the GPU name and available memory.

Documentation on models and their required compute here https://docs.nvidia.com/nim/large-language-models/latest/supported-llm-specific-models.html

In [3]:
import os
import json
import jsonlines
from omegaconf import OmegaConf
import torch

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

PyTorch version: 2.3.0a0+ebedce2
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
GPU memory: 42.29 GB


Creating the directories needed

In [4]:
# Create directories
os.makedirs("lora_tutorial/data", exist_ok=True)
os.makedirs("lora_tutorial/models", exist_ok=True)
os.makedirs("lora_tutorial/configs", exist_ok=True)

## 2. Prepare Dataset

Time to create our training data! 

We're creating a customer support AI, but here's the key insight: we're not teaching it WHAT customer support is - Llama already knows that. We're teaching it HOW to do customer support in YOUR specific style.

This is a minimal set intended for demonstration purposes — just enough to walk through the end-to-end fine-tuning pipeline.
In practice, you'd want a much larger dataset to achieve meaningful performance improvements.

In [5]:
# Create a sample dataset for customer support fine-tuning
training_data = [
    {
        "input": "User: My order hasn't arrived yet. Order number is 12345.\\n\\nAssistant:",
        "output": "I apologize for the delay with your order #12345. Let me check the status for you right away. I'll need to verify some details first to ensure your privacy and security."
    },
    {
        "input": "User: How do I reset my password?\\n\\nAssistant:",
        "output": "I'd be happy to help you reset your password. For security, please click on 'Forgot Password' on the login page, enter your email address, and follow the instructions sent to your inbox."
    },
    {
        "input": "User: What is your return policy?\\n\\nAssistant:",
        "output": "Our return policy allows returns within 30 days of purchase with original receipt. Items must be in original condition with tags attached. Refunds are processed within 5-7 business days."
    },
    {
        "input": "User: I received a damaged product. What should I do?\\n\\nAssistant:",
        "output": "I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."
    },
    {
        "input": "User: Do you offer international shipping?\\n\\nAssistant:",
        "output": "Yes, we offer international shipping to over 50 countries. Shipping rates and delivery times vary by destination. You can check availability and costs at checkout."
    }
]

# Save training data
with jsonlines.open('lora_tutorial/data/train.jsonl', 'w') as writer:
    writer.write_all(training_data)

# Create validation data (smaller subset)
val_data = training_data[:2]
with jsonlines.open('lora_tutorial/data/val.jsonl', 'w') as writer:
    writer.write_all(val_data)

print(f"Created {len(training_data)} training examples")
print(f"Created {len(val_data)} validation examples")

Created 5 training examples
Created 2 validation examples


### Verify Prerequisites Before Training

Check for
- NeMo repo
- fine-tuning script
- llama 3.1 8b .nemo model file
- training data

In [6]:
# Verify prerequisites before training
import os
import glob

print("🔍 Checking prerequisites for training...\n")

# Check if NeMo is cloned - use relative path
nemo_path = "./NeMo"
if os.path.exists(nemo_path):
    print("✅ NeMo repository found")
else:
    print("❌ NeMo repository not found! Please run cell 2 to clone NeMo.")

# Check if training scripts exist
training_script = f"{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py"
if os.path.exists(training_script):
    print("✅ Training script found")
else:
    print("❌ Training script not found!")

# Check if model is downloaded - look in subdirectories since NGC creates them
model_files = glob.glob("lora_tutorial/models/llama-3_1-8b-instruct/**/*.nemo", recursive=True)
if model_files:
    model_path = model_files[0]  # Use the first .nemo file found
    # Check if it's a complete model (>10GB)
    size_gb = os.path.getsize(model_path) / (1024**3)
    if size_gb > 10:
        print("✅ Llama 3.1 8B model found")
        print(f"\n📁 Model Information:")
        print(f"   Path: {model_path}")
        print(f"   Size: {size_gb:.2f} GB")
        print(f"   Format: Standard NeMo checkpoint (.nemo)")
    else:
        print(f"⚠️  Incomplete model found ({size_gb:.1f} GB)")
        print("   Please re-run the download in 00_Workshop_Setup.ipynb")
else:
    print("❌ Model not found! Please run notebook 00_Workshop_Setup.ipynb first")

# Check if training data exists
if os.path.exists("lora_tutorial/data/train.jsonl"):
    print("✅ Training data found")
else:
    print("❌ Training data not found! Please run the data preparation cells")

print("\n🎯 Ready to train!" if all([
    os.path.exists(nemo_path),
    os.path.exists(training_script),
    len(model_files) > 0 and os.path.getsize(model_files[0]) / (1024**3) > 10,  # Check complete model
    os.path.exists("lora_tutorial/data/train.jsonl")
]) else "\n⚠️ Please fix the issues above before training!")

🔍 Checking prerequisites for training...

✅ NeMo repository found
✅ Training script found
✅ Llama 3.1 8B model found

📁 Model Information:
   Path: lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo
   Size: 14.96 GB
   Format: Standard NeMo checkpoint (.nemo)
✅ Training data found

🎯 Ready to train!


## 3. Run LoRA Training

### Actually Run the Training! 🚀

This is the exciting part - you'll train your own LoRA adapter! 

**What will happen:**
1. The model will load from the .nemo checkpoint (takes ~30 seconds)
2. Training will run for 50 steps (~5-10 minutes)
3. Checkpoints will be saved every 25 steps
4. A final LoRA adapter will be exported as a .nemo file

**Note about warnings**: You'll see many warnings about missing configuration fields - these are normal and can be ignored. They appear because NeMo supports many optional features that aren't used in this training.

Let's train your custom model:

This cell runs the LoRA fine-tuning process, training Llama 3.1 8B on the customer support dataset for 50 steps using a single GPU with LoRA adapters.

To improve how well your model learns
- Increase max_steps: Let the model see the data more times. For small datasets, 200–500 steps is a better starting point than 50.
- Adjust learning rate: Too high can prevent convergence; too low may be too slow. Try values like 1e-4 or 5e-5.
- Increase batch size: If memory allows, raise micro_batch_size and global_batch_size to make updates more stable.

In [7]:
%%bash

# Actually run the LoRA training!

# Find the model file dynamically (NGC creates subdirectories)
MODEL_DIR="lora_tutorial/models/llama-3_1-8b-instruct"
MODEL=$(find "$MODEL_DIR" -name "*.nemo" -type f | head -1)

if [ -z "$MODEL" ]; then
    echo "ERROR: No .nemo model file found in $MODEL_DIR"
    echo "Please run notebook 00_Workshop_Setup.ipynb first to download the model"
    exit 1
fi

TRAIN_DS="[lora_tutorial/data/train.jsonl]"
VALID_DS="[lora_tutorial/data/val.jsonl]"

# Use relative path to NeMo
NEMO_PATH="./NeMo"

echo "✅ Found Llama 3.1 8B model at $MODEL"

# Run training with NeMo
torchrun --nproc_per_node=1 \
"${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py" \
    exp_manager.exp_dir=lora_tutorial/experiments \
    exp_manager.name=customer_support_lora \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.5 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.micro_batch_size=1 \
    model.global_batch_size=2 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=lora \
    model.peft.lora_tuning.target_modules=[attention_qkv] \
    model.peft.lora_tuning.adapter_dim=32 \
    model.peft.lora_tuning.adapter_dropout=0.1 \
    model.optim.lr=5e-4

✅ Found Llama 3.1 8B model at lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo


    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2025-07-16 19:17:30 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2025-07-16 19:17:30 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 0.5
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: lora_tutorial/experiments
      name: customer_support_lora
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        sav

[NeMo W 2025-07-16 19:17:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True


[NeMo I 2025-07-16 19:17:30 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-07-16 19:17:30 exp_manager:773] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2025-07-16 19:17:30 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :lora_tutorial/experiments/customer_support_lora/checkpoints. Training from scratch.


[NeMo I 2025-07-16 19:17:30 exp_manager:396] Experiments will be logged at lora_tutorial/experiments/customer_support_lora
[NeMo I 2025-07-16 19:17:31 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2025-07-16 19:17:31 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does

[NeMo I 2025-07-16 19:17:50 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2025-07-16 19:17:50 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-07-16 19:17:50 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-07-16 19:17:50 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2025-07-16 19:17:50 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2025-07-16 19:17:50 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2025-07-16 19:17:50 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2025-07-16 19:17:50 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2025-07-16 19:17:50 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2025-07-16 19:17:50 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-07-16 19:17:50 megatron_init:310] All tensor model parallel group ranks: 

    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2025-07-16 19:17:50 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-16 19:17:50 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2025-07-16 19:18:11 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2025-07-16 19:18:57 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo.
[NeMo I 2025-07-16 19:18:57 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2025-07-16 19:18:57 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2025-07-16 19:19

[NeMo W 2025-07-16 19:19:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2025-07-16 19:19:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2025-07-16 19:19:00 megatron_gpt_sft_model:811] Building GPT SFT validation datasets.
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:116] Building data files
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.067944
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.053882
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:158] Loading data files
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:249] Loading lora_tutorial/data/val.jsonl
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001322
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:165] Computing global indices
[NeMo I 2025-07-16 19:19:00 megatron_gpt_sft_model:815] Length of val dataset: 2
[NeMo I 2025-07-16 19:19:00 megatron_gpt_sft_model:822] Building GPT SFT traing datasets.
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:116] Building data files
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.055920
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.055045
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:158] Loading data files
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:249] Loading lora_tutorial/data/train.jsonl
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000963
[NeMo I 2025-07-16 19:19:00 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2025-07-16 19:19:01 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.09 (sec)
[NeMo I 2025-07-16 19:19:01 megatron_gpt_sft_model:824] Length of train dataset: 101
[NeMo I 2025-07-16 19:19:01 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0
[NeMo I 2025-07-16 19:19:01 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2025-07-16 19:19:01 megatron_base_model:1199] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 50.


[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-16 19:19:01 adapter_mixins:435] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 8.0 B  | train
------------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2025-07-16 19:19:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
[NeMo W 2025-07-16 19:19:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
[NeMo W 2025-07-16 19:19:02 nemo_logging:

Epoch 0: :  50%|█████     | 25/50 [00:09<00:09, v_num=0, reduced_train_loss=0.00124, global_step=24.00, consumed_samples=50.00, train_step_timing in s=0.348]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  4.92it/s][A


Metric val_loss improved. New best score: 0.052
Epoch 0, global step 25: 'validation_loss' reached 0.05194 (best 0.05194), saving model to '/root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0.ckpt' as top 1
[NeMo W 2025-07-16 19:19:11 nlp_overrides:480] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: : 100%|██████████| 50/50 [00:19<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.345, val_loss=0.0519] 
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  4.66it/s][A


Metric val_loss improved by 0.052 >= min_delta = 0.001. New best score: 0.000
Epoch 0, global step 50: 'validation_loss' reached 0.00010 (best 0.00010), saving model to '/root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt' as top 1


Epoch 0: : 100%|██████████| 50/50 [00:19<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.345, val_loss=0.000104][NeMo I 2025-07-16 19:19:22 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0.ckpt
[NeMo I 2025-07-16 19:19:22 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0-last.ckpt


`Trainer.fit` stopped: `max_steps=50` reached.


Epoch 0: : 100%|██████████| 50/50 [00:19<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.345, val_loss=0.000104]


Restoring states from the checkpoint path at /root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt
Restored all states from the checkpoint at /root/verb-workspace/NIM-build-tune-deploy-participant/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt


<details>
<summary>📌 Parameter Explanations for NeMo LoRA Fine-tuning</summary>

- **`exp_manager.exp_dir=lora_tutorial/experiments`**  
  Sets the root directory for saving logs, checkpoints, and training artifacts.

- **`exp_manager.name=customer_support_lora`**  
  Defines the subfolder name under `exp_dir` for this specific run.

- **`trainer.devices=1`**  
  Specifies that we’re training on one GPU.

- **`trainer.num_nodes=1`**  
  Specifies that we’re only using one machine (no multi-node training).

- **`trainer.precision=bf16-mixed`**  
  Enables bfloat16 mixed precision training — this reduces memory usage and speeds up training while maintaining numerical stability.

- **`trainer.val_check_interval=0.5`**  
  Validation will be run every 0.5 of an epoch (i.e., halfway through).

- **`trainer.max_steps=50`**  
  Stops training after 50 optimization steps — short for demo purposes.

- **`model.megatron_amp_O2=True`**  
  Enables Megatron’s optimization level 2 for efficient memory use and fused operations.

- **`++model.mcore_gpt=True`**  
  Uses the newer modular GPT implementation (MCore) within NeMo, required for Llama models.

- **`model.tensor_model_parallel_size=1`**  
  No tensor parallelism — model layers are not split across GPUs.

- **`model.pipeline_model_parallel_size=1`**  
  No pipeline parallelism — forward/backward passes are not split across stages.

- **`model.micro_batch_size=1`**  
  Each GPU processes 1 example at a time.

- **`model.global_batch_size=2`**  
  The total batch size per optimization step is 2 — this means gradient accumulation is disabled.

- **`model.restore_from_path=${MODEL}`**  
  Loads the pretrained `.nemo` checkpoint from the specified file.

- **`model.data.train_ds.file_names=${TRAIN_DS}`**  
  Path to the training dataset file (JSONL format).

- **`model.data.train_ds.concat_sampling_probabilities=[1.0]`**  
  Since we’re using only one training file, we give it 100% sampling probability.

- **`model.data.validation_ds.file_names=${VALID_DS}`**  
  Path to the validation dataset.

- **`model.peft.peft_scheme=lora`**  
  Enables PEFT and specifies LoRA as the tuning method.

- **`model.peft.lora_tuning.target_modules=[attention_qkv]`**  
  LoRA will only be applied to the query/key/value attention weights — this is common for transformer-based models.

- **`model.peft.lora_tuning.adapter_dim=32`**  
  Sets the rank of the LoRA adapter matrices — a tradeoff between model capacity and training efficiency.

- **`model.peft.lora_tuning.adapter_dropout=0.1`**  
  Applies dropout during training of the LoRA adapters to regularize them.

- **`model.optim.lr=5e-4`**  
  Sets the learning rate for optimization — relatively high for a small batch/short run.

</details>


## 4. Verify Training Results

Look at what we've created! Let me break down these files:

1. **customer_support_lora.nemo** (21MB) - This is your golden ticket! Your custom AI adapter
2. **The .ckpt files** (147MB each) - Full training checkpoints with optimizer states


In [8]:
# Check if training created the LoRA adapter
!ls -la ./lora_tutorial/experiments/customer_support_lora*/checkpoints/

total 307504
drwxr-xr-x 2 root root      4096 Jul 16 19:19  .
drwxr-xr-x 4 root root      4096 Jul 16 19:19  ..
-rw-r--r-- 1 root root  21012480 Jul 16 19:19  customer_support_lora.nemo
-rw-r--r-- 1 root root 146930030 Jul 16 19:19 'megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0-last.ckpt'
-rw-r--r-- 1 root root 146930030 Jul 16 19:19 'megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt'


## Summary

Congratulations! You've successfully:
- ✅ Set up the NeMo training environment
- ✅ Created training data for your custom task
- ✅ Configured LoRA parameters for efficient training
- ✅ Trained your own LoRA adapter on Llama 3.1 8B

Your LoRA adapter is now ready to be deployed with NVIDIA NIM in the next notebook!

**Next Steps**: 
- Open `04_Deploy_LoRA_with_NIM.ipynb` to deploy your custom model