<a href="https://colab.research.google.com/github/zhuhadar/ai-summer-2025/blob/main/PEFT_LoRA_llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning LLaMa 3.2

This notebook demonstrates how to fine-tune the LLaMa 3.2 (1B model), Meta's latest small-scale LLM. We use the 1B parameter version to ensure this can run effectively on Google Colab.

Key features of LLaMa 3.2 1B:
- 1.23B parameters
- Multilingual support (8 officially supported languages)
- 128k context length
- Optimized for dialogue use cases

As we'll see, this same notebook can be used to train other, larger versions of LlaMa 3.2, simply by swapping the HuggingFace repo to one of the larger versions.



# Initial Setup


## Package Installation + Imports

First, let's install the required packages. We'll need the latest version of transformers (>= 4.43.0) to work with LLaMa 3.2.

In [None]:
# Install required packages
!pip install -q --upgrade transformers datasets
!pip install -q torch accelerate bitsandbytes
# Install PDF processing libraries. pdfplumber handles academic texts better
!pip install -q PyPDF2
!pip install -q pdfplumber

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m94.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# Standard libraries
import re
import os
import numpy as np
from typing import List

# AI/ML Libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import pipeline
from accelerate.state import AcceleratorState
from accelerate import Accelerator
from peft import LoraConfig, get_peft_model
import wandb # this will be optional, for training monitoring

# PDF reading
import pdfplumber
from pdfminer.high_level import extract_text
from PyPDF2 import PdfReader

ModuleNotFoundError: No module named 'pdfplumber'

### Package Overview
- `transformers`: Hugging Face's main library for working with transformer models
- `torch`: PyTorch deep learning framework
- `accelerate`: Library for easy mixed precision training and device placement
- `pdfminer`: A PDF-reading package that's particularly good for non-straightforward PDF's (like academic papers)


## Environment Check
Let's verify our setup and check available compute resources:

Make sure that `CUDA available`: prints **True**

In [2]:
# Check PyTorch version and CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

NameError: name 'torch' is not defined

## Hugging Face Authentication

LLaMa 3.2 requires authentication with Hugging Face to access the model. You'll need to:
1. Have a Hugging Face account
2. Accept the LLaMa 3.2 model terms of use on the Hugging Face model page
3. Create an access token on Hugging Face (https://huggingface.co/settings/tokens)

After you have your access token and have accepted the terms, the code below will help you log in:

In [None]:
from huggingface_hub import login
import getpass

token = getpass.getpass("Enter your Hugging Face token: ")
login(token=token)

# Verify login
print("Login status: Authenticated with Hugging Face")

Enter your Hugging Face token: ··········
Login status: Authenticated with Hugging Face


## Tokenizer Setup and Exploration

LLaMa 3.2 uses a sophisticated tokenizer that supports multiple languages. Understanding how the tokenizer works is crucial for:
- Preparing training data effectively
- Managing sequence lengths
- Understanding model behavior across languages

Let's load the tokenizer and explore its basic properties:

In [None]:
# Initialize tokenizer
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Make sure padding token is set
tokenizer.pad_token = tokenizer.eos_token

# Basic tokenizer information
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Model max length: {tokenizer.model_max_length}")
print(f"Padding token: {tokenizer.pad_token}")
print(f"End of sequence token: {tokenizer.eos_token}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Vocabulary size: 128256
Model max length: 131072
Padding token: <|end_of_text|>
End of sequence token: <|end_of_text|>


### Understanding Tokenization

Let's see how the tokenizer processes text in different languages. This will help us understand:
- How words are broken into tokens
- How special characters are handled
- Token counts for different languages

In [None]:
# Example texts in different languages
texts = {
    "English": "Hello, how are you today?",
    "Spanish": "¡Hola! ¿Cómo estás hoy?",
    "French": "Bonjour! Comment allez-vous aujourd'hui?",
    "German": "Hallo! Wie geht es dir heute?"
}

# Analyze tokenization for each language
for lang, text in texts.items():
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)

    print(f"\n{lang}:")
    print(f"Original text: {text}")
    print(f"Tokens: {tokens}")
    print(f"Number of tokens: {len(tokens)}")
    print(f"Token IDs: {token_ids}")


English:
Original text: Hello, how are you today?
Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', 'Ġtoday', '?']
Number of tokens: 7
Token IDs: [128000, 9906, 11, 1268, 527, 499, 3432, 30]

Spanish:
Original text: ¡Hola! ¿Cómo estás hoy?
Tokens: ['Â¡', 'Hola', '!', 'ĠÂ¿', 'CÃ³mo', 'Ġest', 'Ã¡s', 'Ġhoy', '?']
Number of tokens: 9
Token IDs: [128000, 40932, 69112, 0, 29386, 96997, 1826, 7206, 49841, 30]

French:
Original text: Bonjour! Comment allez-vous aujourd'hui?
Tokens: ['Bonjour', '!', 'ĠComment', 'Ġalle', 'z', '-vous', 'Ġaujourd', "'hui", '?']
Number of tokens: 9
Token IDs: [128000, 82681, 0, 12535, 12584, 89, 45325, 75804, 88253, 30]

German:
Original text: Hallo! Wie geht es dir heute?
Tokens: ['Hallo', '!', 'ĠWie', 'Ġgeht', 'Ġes', 'Ġdir', 'Ġheute', '?']
Number of tokens: 8
Token IDs: [128000, 79178, 0, 43716, 40364, 1560, 5534, 49714, 30]



Notice how:
1. Punctuations are their own tokens
2. The 'Ġ' symbol represents a space before the token
3. Some words are single tokens (like 'today') while others might be split

# Data Preparation for Fine-tuning

This notebook demonstrates how to update LLaMa 3.2's knowledge base with domain-specific information. This type of fine-tuning can help the model:
- Learn new domain-specific facts and concepts
- Update its knowledge about specific topics
- Improve its ability to discuss specialized subjects

For domain knowledge fine-tuning, we'll use the simplest appraoch of direct text chunks: the model will learn directly from the source material.

## Structuring the Data

Fine-tuning LLaMa 3.2 requires carefully formatted training data. The model expects:
- Input text in a specific format
- Response text that follows the input
- Proper formatting of system prompts and chat turns

We'll set up our dataset as such in this section.

Distinct pieces of data for the model are relatively small text segments. So here, we'll start by setting up a function to create chunks of text from longer, full documents for the model to train on.

### Important Parameters:
- `chunk_size`: Default 512 tokens. Can be adjusted based on your GPU memory and needs
- `tokenizer`: Uses LLaMa's tokenizer to ensure proper text splitting


In [None]:
def create_text_chunks(document: str, tokenizer, chunk_size: int = 512) -> List[str]:
    """
    Create chunks of text that:
    - Maintain sentence boundaries where possible
    - Have a reasonable minimum size
    - Don't include special tokens
    """
    # Clean the text first
    document = document.strip().replace('\n', ' ')

    # Split into sentences
    sentences = [s.strip() + '.' for s in document.split('.') if s.strip()]

    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        # Get token count for this sentence
        tokens = tokenizer.encode(sentence, add_special_tokens=False)
        sentence_length = len(tokens)

        if current_length + sentence_length > chunk_size:
            if current_chunk:
                # Join the current chunk and add it to chunks
                chunk_text = ' '.join(current_chunk)
                chunks.append(chunk_text)
                # Start new chunk with current sentence
                current_chunk = [sentence]
                current_length = sentence_length
        else:
            current_chunk.append(sentence)
            current_length += sentence_length

    # Don't forget the last chunk
    if current_chunk:
        chunk_text = ' '.join(current_chunk)
        chunks.append(chunk_text)

    # Print some diagnostics
    total_tokens = sum(len(tokenizer.encode(chunk, add_special_tokens=False))
                      for chunk in chunks)

    print(f"Created {len(chunks)} chunks with total {total_tokens} tokens")
    return chunks

Let's use an example chunk of text to see how this works before proceeding.

In [None]:
# Example document - we'll replace this with actual domain content later
sample_document = """
LLMs process text using attention mechanisms. These mechanisms allow the model
to weigh different parts of the input differently. The transformer architecture
revolutionized natural language processing. It introduced self-attention as a
core component. Modern language models build upon this foundation. As such, LLM's
are able to process sentences similar to the way that humans do, by understanding
words in the sentence relative to those around them.
"""


print("\nTesting with chunk_size=20:")
chunks = create_text_chunks(sample_document, tokenizer, chunk_size=20)
for i, chunk in enumerate(chunks[0:2]):
    tokens = tokenizer.tokenize(chunk)
    print(f"\nChunk {i+1}:")
    print(f"Text: {chunk}")
    print(f"Token count: {len(tokens)}")
    print(f"Tokens: {tokens}")


Testing with chunk_size=20:
Created 5 chunks with total 80 tokens

Chunk 1:
Text: LLMs process text using attention mechanisms.
Token count: 8
Tokens: ['LL', 'Ms', 'Ġprocess', 'Ġtext', 'Ġusing', 'Ġattention', 'Ġmechanisms', '.']

Chunk 2:
Text: These mechanisms allow the model to weigh different parts of the input differently.
Token count: 14
Tokens: ['These', 'Ġmechanisms', 'Ġallow', 'Ġthe', 'Ġmodel', 'Ġto', 'Ġweigh', 'Ġdifferent', 'Ġparts', 'Ġof', 'Ġthe', 'Ġinput', 'Ġdifferently', '.']


## Loading and Processing PDF Documents

We'll be creating these chunks from text documents.

In reality, we'll want to accommodate giving some PDF's as a document set to fine-tune on. We'll use pdfminer to read PDFs and extract their text content. Then we'll process this text into appropriate chunks for training.

In [None]:
def extract_text_from_pdfs(pdf_paths: List[str]) -> List[str]:
    """
    Extract text from multiple PDF files.

    Args:
        pdf_paths: List of paths to PDF files

    Returns:
        List of extracted text documents
    """
    all_texts = []

    for path in pdf_paths:
        try:
            text = extract_text(path)
            all_texts.append(text)
            print(f"Successfully processed: {path}")

        except Exception as e:
            print(f"Error processing {path}: {str(e)}")

    return all_texts

Academic papers often contain a lot of irregular, non-text components. Between images, captions, references, etc, there's a lot that can confuse the model if we aren't careful to remove it. So here, we'll write a function that cleans our text after we read it, attempting to remove as many of these artifacts as possible.

In [None]:
def clean_academic_text(text: str) -> str:
    """
    Clean academic text with improved word separation.
    """
    if not text or not isinstance(text, str):
        return ""

    original_length = len(text)

    # First clean pass - basic normalization
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace

    # Fix incorrectly joined words
    # Look for patterns like "wordWord" or "word-Word"
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)  # Split "wordWord"
    text = re.sub(r'([a-z])-\s*([a-z])', r'\1\2', text)  # Join "wo rd"
    text = re.sub(r'\s*-\s*', '-', text)  # Clean up hyphens

    # Fix common academic text patterns
    patterns = {
        r'(?<=\w)\.(?=\w)': '. ',  # Add space after period between words
        r'(?<=\w)\s+\(': ' (',      # Fix spacing around parentheses
        r'\)\s+(?=\w)': ') ',
        r'(?<=\d),(?=\d)': ', ',    # Add space after comma between numbers
        r'(?<=[a-z])(?=\d)': ' ',   # Add space between letters and numbers
    }

    for pattern, replacement in patterns.items():
        text = re.sub(pattern, replacement, text)

    # Remove common PDF artifacts
    artifacts = [
        r'Fig\.\s*\d+',
        r'Figure\s*\d+:?.*?\n',
        r'Table\s*\d+:?.*?\n',
        r'\[\d+(?:,\s*\d+)*\]',     # Citations [1] or [1,2,3]
        r'\(\w+\s+et\s+al\.,\s+\d{4}\)',  # Citations (Author et al., 2020)
        r'References.*$',            # Remove references section
        r'Bibliography.*$',
    ]

    for pattern in artifacts:
        text = re.sub(pattern, ' ', text)

    # Final cleanup
    text = re.sub(r'\s+', ' ', text)  # Normalize spaces again
    text = text.strip()

    # Print sample before/after for verification
    print("\nSample text cleaning comparison:")
    print("Original first 100 chars:", text[:100])
    sample_cleaned = text[:100]
    print("Cleaned first 100 chars:", sample_cleaned)

    final_length = len(text)
    removed = original_length - final_length
    if removed > 0:
        print(f"\nRemoved {removed} characters ({removed/original_length*100:.1f}% of original text)")

    return text

## Formatting Text Chunks for LLaMa

Finally, we need to properly format our text chunks into a format that the model expects and understands.

Each chunk will be formatted as:

`<|system|>Learn the following information: </s><|user|>{cleaned_text}</s><|assistant|>I understand this information.</s>`

This format follows LLaMa's chat template structure, which is crucial because:

1. LLaMa 3.2 was trained using a specific chat format with different roles:
   - `<|system|>`: Provides context or instructions to the model
   - `<|user|>`: Represents input content
   - `</s>`: Special token marking the end of each turn

2. The model expects input in the same format it was originally trained on, so using a different format might confuse the model or reduce learning effectiveness

In [None]:
def format_training_example(text: str) -> str:
    """Format text for training"""
    cleaned_text = text.strip()
    if not cleaned_text:  # Skip empty chunks
        return None
    return f"<|system|>Learn the following information: </s><|user|>{cleaned_text}</s><|assistant|>I understand this information.</s>"

## PDF's to Training Data Pipeline

We'll use these above functions to now build a pipeline for converting PDF documents into a format suitable for training LLaMa 3.2. The process involves:
1. Loading PDFs and extracting text
2. Cleaning and preprocessing the text
3. Chunking the text into appropriate sizes
4. Creating a dataset for training

In [None]:
from datasets import Dataset

def create_training_dataset(pdf_directory, tokenizer, chunk_size=512):
    """Create a training dataset from PDF documents"""

    # First get all PDFs and their text
    pdf_paths = [f"{pdf_directory}/{f}" for f in os.listdir(pdf_directory)
                if f.endswith('.pdf')]

    print(f"Found {len(pdf_paths)} PDFs")

    # Extract text from PDFs
    documents = extract_text_from_pdfs(pdf_paths)
    print(f"Extracted text from {len(documents)} documents")

    all_chunks = []

    for doc in documents:
        # Clean text first
        text = clean_academic_text(doc)
        # Create chunks from the document
        chunks = create_text_chunks(text, tokenizer, chunk_size)

        # Format each chunk
        formatted_chunks = [format_training_example(chunk) for chunk in chunks]
        all_chunks.extend(formatted_chunks)
    print("\n")
    print(f"\033[1m\033[91mCreated {len(all_chunks)} total chunks\033[0m")

    # Create dataset dictionary
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        "labels": []
    }

    # Process each chunk
    for chunk in all_chunks:
        encodings = tokenizer(
            chunk,
            truncation=True,
            max_length=chunk_size,
            padding="max_length",
            return_tensors="pt"
        )

        dataset_dict["input_ids"].append(encodings["input_ids"].squeeze().numpy())  # Convert to numpy
        dataset_dict["attention_mask"].append(encodings["attention_mask"].squeeze().numpy())
        dataset_dict["labels"].append(encodings["input_ids"].squeeze().numpy())

    # Convert lists to numpy arrays
    for key in dataset_dict:
        if dataset_dict[key]:  # Check if the list is not empty
            dataset_dict[key] = np.array(dataset_dict[key])

    return Dataset.from_dict(dataset_dict)

## Uploading our Docs + Making the Dataset

Now, all that's left is to create a directory of pdf's that we can point the dataset generation code to.

Below, you can upload a set of PDF's that will become the training data.

This will be automatically saved to a path called training_pdfs, which we'll use for setting up our dataset in the following cells.

In [None]:
from google.colab import files

# Create a directory for PDFs
!mkdir -p training_pdfs

print("Please upload your PDF files using the file uploader that appears below:")
uploaded = files.upload()

# Move uploaded files to our PDF directory
for filename in uploaded.keys():
    os.rename(filename, f"training_pdfs/{filename}")
    print(f"Moved {filename} to training_pdfs/")

Please upload your PDF files using the file uploader that appears below:


Saving LENA-Foundations-of-Literacy_Webinar-Slides.pdf to LENA-Foundations-of-Literacy_Webinar-Slides.pdf
Saving Lenhart & Lingel (2023) ECRQ.pdf to Lenhart & Lingel (2023) ECRQ.pdf
Saving Wexler on conversations.pdf to Wexler on conversations.pdf
Moved LENA-Foundations-of-Literacy_Webinar-Slides.pdf to training_pdfs/
Moved Lenhart & Lingel (2023) ECRQ.pdf to training_pdfs/
Moved Wexler on conversations.pdf to training_pdfs/


And putting it togehter:

In [None]:
# Create dataset
training_dataset = create_training_dataset("training_pdfs", tokenizer)

Found 3 PDFs




Successfully processed: training_pdfs/Lenhart & Lingel (2023) ECRQ.pdf




Successfully processed: training_pdfs/LENA-Foundations-of-Literacy_Webinar-Slides.pdf
Successfully processed: training_pdfs/Wexler on conversations.pdf
Extracted text from 3 documents

Sample text cleaning comparison:
Original first 100 chars: Early Childhood Research Quarterly 64 (2023) 119–128 Contents lists available at Science Direct Earl
Cleaned first 100 chars: Early Childhood Research Quarterly 64 (2023) 119–128 Contents lists available at Science Direct Earl

Removed 16396 characters (22.3% of original text)
Created 30 chunks with total 15642 tokens

Sample text cleaning comparison:
Original first 100 chars: Foundations of Literacy: The Science of Reading November 2, 2023 Our Mission LENA is a national nonp
Cleaned first 100 chars: Foundations of Literacy: The Science of Reading November 2, 2023 Our Mission LENA is a national nonp

Removed 599 characters (10.5% of original text)
Created 3 chunks with total 1150 tokens

Sample text cleaning comparison:
Original first 100 chars: 

## Training Dataset Size

The code above will tell you how many total chunks were generated from the PDF's that you uploaded. When fine-tuning LLaMa for domain knowledge, the number of training chunks is crucial:

### Recommended Chunk Numbers:
- **Minimum**: 200-300 chunks
 - Below this, the model may not learn effectively
 - Risk of overfitting to limited examples

- **Target**: 500-1000+ chunks
 - Provides enough examples for robust learning
 - Allows for diverse phrasings of similar concepts

- **Creating More Chunks**:
 - Add more domain documents
 - Use overlapping chunks (e.g., 50-token overlap)
 - Include related papers/documents
 - Consider smaller chunk sizes (but not below 256 tokens)

# Training LLaMa 3.2 on Domain Knowledge



## Model Configuration

For fine-tuning LLaMa 3.2, we need to carefully configure several parameters:
- Training precision (using bfloat16 for efficiency)
- Memory optimization settings
- Model configuration parameters

We'll start with a basic configuration that works well on Google Colab, but these parameters can be adjusted based on your specific needs and hardware capabilities.

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments

# Reload model without 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
model.config.pad_token_id = tokenizer.pad_token_id

print("Model configuration:")
print(f"Number of parameters: {model.num_parameters():,}")
print(f"Training device: {model.device}")

Model configuration:
Number of parameters: 1,235,814,400
Training device: cpu


## Training Setup and Configuration

We'll configure training with:
- A relatively small number of epochs since we're fine-tuning
- Gradient accumulation to handle larger effective batch sizes
- Learning rate with warmup

#### Key Training Parameters:
- `learning_rate=3e-4`: Higher than typical fine-tuning to allow new knowledge absorption
- `per_device_train_batch_size=4`: Adjust based on your GPU memory
- `gradient_accumulation_steps=8`: Effectively creates batch_size of 32
- `num_train_epochs=3`: Adjust based on dataset size and convergence

#### Using Different Model Sizes:

This setup can be adapted for different LLaMa 3.2 sizes (1B, 3B, etc.) and different hardware:

To use a larger LLaMa model, simply change the model ID:
```
# For 1B model (current)
model_id = "meta-llama/Llama-3.2-1B"

# For 3B model
# model_id = "meta-llama/Llama-3.2-3B"
```

Hardware Considerations:
- 1B model: Runs on most GPUs with 16GB+ memory
- 3B model: Recommended 24GB+ GPU memory
- For smaller GPUs: Reduce batch size and increase gradient accumulation
- For larger GPUs: Increase batch size for faster training

We'll also set up. regular checkpoints to save progress.

We can set up progress monitoring using wandb (Weights & Biases), which is a popular tool for tracking machine learning experiments. It creates nice visualizations of your training metrics (like loss over time), GPU usage, etc. When you run ML training, it sends the data to their website where you can view it in nice dashboards. If you want to use it, you'll need a wandb API key.

Iif you don't want to use it, you can set the `use_wandb` parameter at the top of this next cell to = `False`

In [None]:
use_wandb = False

# Clear any existing accelerator state
AcceleratorState._reset_state()

# Initialize accelerator
accelerator = Accelerator()

report_to = "none"
if use_wandb:
  # Initialize wandb
  wandb.init(
      project="llama-domain-training",
      name="domain-knowledge-run",
      config={
          "model": "LLaMa-3.2-1B",
          "dataset_size": len(training_dataset),
          "chunk_size": 512
      }
  )
  report_to = 'wandb'

# Training arguments with accelerator config

training_args = TrainingArguments(
    output_dir="./domain_trained_model",
    learning_rate=1e-4,              # Small learning rate
    per_device_train_batch_size=4,   # Start smaller, we can adjust
    gradient_accumulation_steps=8,   # Add gradient accumulation
    num_train_epochs=50,
    bf16=True,                      # Enable mixed precision
    logging_steps=1,                # Log every step so we can monitor
    save_strategy="epoch",
    optim="adamw_8bit",            # Use 8-bit optimizer
    weight_decay=0.01,             # Add weight decay
    warmup_steps=10                # Add warmup steps
)

## (Optional) PEFT Configuration

Parameter Efficient Fine-Tuning (PEFT) lets us fine-tune LLaMA using much less memory. We'll use LoRA (Low-Rank Adaptation), which is particularly effective for LLMs.

In [None]:
# Toggle for full fine-tuning
use_peft = True

if use_peft:
    # LoRA configuration
    peft_config = LoraConfig(
        r=16,                     # Rank of update matrices
        lora_alpha=32,           # Alpha parameter for LoRA scaling
        lora_dropout=0.05,       # Dropout probability for LoRA layers
        target_modules=[         # Which modules to apply LoRA to
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        bias="none",
        task_type="CAUSAL_LM"    # For causal language modeling
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()  # Shows % of parameters being trained



trainable params: 11,272,192 || all params: 1,247,086,592 || trainable%: 0.9039


## Training Loop

Now we'll set up the trainer and start training. We'll include:
- A simple progress callback to monitor training
- Basic error handling
- Checkpoint saving

In [None]:
from transformers import TrainerCallback

class ProgressCallback(TrainerCallback):
    """Simple callback to print progress during training"""
    def on_epoch_begin(self, args, state, control, **kwargs):
        print(f"\nStarting epoch {state.epoch + 1}/{args.num_train_epochs}")

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            # Handle the loss value more carefully
            loss = logs.get('loss', 'N/A')
            if isinstance(loss, (float, int)):
                print(f"Step {state.global_step}: Loss = {loss:.4f}")
            else:
                print(f"Step {state.global_step}: Loss = {loss}")

# Set up trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    callbacks=[ProgressCallback()]
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Monitoring Training

You can monitor the training in several ways:
1. Direct console output showing loss every 10 steps
2. Weights & Biases dashboard (wandb.ai) showing:
   - Training loss over time
   - Learning rate schedule
   - GPU memory usage
   - Training speed
3. Model checkpoints saved after each epoch

The training may take some time depending on your GPU. You'll see regular updates on:
- Current epoch
- Current step
- Loss value
- Any potential issues

In [None]:
# Training loop with detailed error tracking
try:
    print("Starting training...")

    # Track where we are in training
    current_step = 0
    try:
        trainer_output = trainer.train()
    except Exception as train_error:
        print("\nError during trainer.train():")
        print(f"Step when error occurred: {current_step}")
        print(f"Error type: {type(train_error)}")
        print(f"Error message: {str(train_error)}")
        # Print the full error traceback
        import traceback
        print("\nFull error traceback:")
        print(traceback.format_exc())
        raise  # Re-raise the error to see full stack trace

    print("\nTraining completed!")

    # Save the final model
    trainer.save_model("./final_model")
    print("Model saved to ./final_model")

except Exception as e:
    print("Final error catch:", str(e))

Starting training...


[34m[1mwandb[0m: Currently logged in as: [33mabigail-petulante[0m ([33mabigail-petulante-vanderbilt-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin



Starting epoch 1/50


  ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)


# Checking our Fine-Tuning Results!

If the above cell ran, then congrats! You've fine-tuned a LLaMa 3.2 1B model on some new domain knowledge! Let's test that it actually learned what we wanted it to.

We'll test the model's knowledge by:
1. Asking domain-specific questions
2. Comparing responses between original and fine-tuned models
3. Looking for improvements in accuracy and detail

First, let's just confirm that we've *actually* changed the model.

In [None]:
from peft import PeftModel, PeftConfig

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype=torch.float16,
    device_map="auto"
)

fine_tuned_path ="./final_model"

if use_peft:
    fine_tuned_path = "./final_model"
    config = PeftConfig.from_pretrained(fine_tuned_path)
    fine_tuned_model = PeftModel.from_pretrained(
        base_model,
        fine_tuned_path,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # To compare PEFT models, we need to look at the LoRA weights
    print("\nLoRA Adapter Information:")
    for name, param in fine_tuned_model.named_parameters():
        if 'lora' in name:  # Only look at LoRA parameters
            print(f"Found LoRA weights in {name}")
            print(f"Non-zero parameters: {torch.sum(param != 0).item()}")

    # The base parameters should be identical (that's the point of PEFT!)
    print("\nBase parameters are identical?:", torch.allclose(
        next(base_model.parameters()),
        next(fine_tuned_model.base_model.parameters())
    ))
else:
    fine_tuned_model = AutoModelForCausalLM.from_pretrained(
        fine_tuned_path,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Compare some weights to see if they're actually different
    base_params = next(base_model.parameters())
    fine_tuned_params = next(fine_tuned_model.parameters())

    print("Are the models identical?", torch.allclose(base_params, fine_tuned_params))

We can also double-check that the training generated the outputs that we expect.

In [None]:
# Look at training output directory
print("Training output contents:")
if os.path.exists("./domain_trained_model"):
    print("\ndomain_trained_model directory contains:")
    for item in os.listdir("./domain_trained_model"):
        print(f"- {item}")
        if os.path.isdir(f"./domain_trained_model/{item}"):
            print(f"  Contains: {os.listdir(f'./domain_trained_model/{item}')}")

# If we have a trainer_state.json, let's look at it
import json
if os.path.exists("./domain_trained_model/checkpoint-3/trainer_state.json"):
    with open("./domain_trained_model/checkpoint-3/trainer_state.json", 'r') as f:
        state = json.load(f)
    print("\nTraining history:")
    print(state.get('log_history', []))

Below, we'll set up a function to query our model with a question.

## Comparing Original vs Fine-Tuned Models

Now, we'll compare the performance of our fine-tuned model vs. the original model on some questions that are meant to test the model's domain knowledge.

### Test Questions

Let's define some questions to test our models' knowledge. These should be specific to your PDF content. These questions are for a dataset of PDF's that discuss "dialogic reading"

In [None]:
# Create test questions based on PDF content
test_questions = [
    "Who developed the concept of dialogic reading",
    "What ages is dialogic reading appropriate for?",
    "What are dialogic reading prompt types?",
]

### Original Model

First, let's look at how LLaMa 3.2 performs on these questions out of the box, before training to gain any additional knowledge.

In [None]:
# Create pipeline with LLaMA 3.2 original
pipe_original = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.2-1B",
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)
# Test each question
print("Testing original model responses:")
print("-" * 50)
for question in test_questions:
    print(f"\nQ: {question}")
    result = pipe_original(question, max_length=100)
    print(f"A: {result[0]['generated_text']}")
    print("-" * 50)

### Our Fine-Tuned Model

Now, we give those same questions to our model, which we've fine tuned for additional domain knowledge. We'll look for signs that the domain knowledge has been absorbed by the model by how it answers the question.

In [None]:
# Create pipeline with our fine-tuned model

if use_peft:
  peft_model = PeftModel.from_pretrained(
      base_model,
      fine_tuned_path,
      torch_dtype=torch.float16,
      device_map="auto"
  )

  # Get the merged model
  fine_tuned_model = peft_model.merge_and_unload()  # This combines PEFT and base weights

  pipe_finetune = pipeline(
      "text-generation",
      model=fine_tuned_model,  # base model + peft weights
      tokenizer=tokenizer,
      torch_dtype=torch.float16,
      device_map="auto"
  )

else:
  pipe_finetune = pipeline(
      "text-generation",
      model=fine_tuned_path,  # Path to our saved fine-tuned model
      tokenizer=tokenizer,
      torch_dtype=torch.float16,
      device_map="auto"
  )
# Test each question
print("Testing fine-tuned model responses:")
print("-" * 50)
for question in test_questions:
    print(f"\nQ: {question}")
    result = pipe_finetune(question, max_length=100)
    print(f"A: {result[0]['generated_text']}")
    print("-" * 50)

## Checking General Knowledge Retention

Often times for very small models, training them to gain specific domain knowledge leads to forgetting existing knowledge.

Here, we'll check a few basic knowledge questions to see if the model has retained understanding.

In [None]:
# Create test questions based on PDF content
test_questions = [
    "What does a zebra look like?",
    "What's the difference between a lake and a pond?",
    "2 + 2 = ?",
]

In [None]:
# Test each question
print("Testing original model responses:")
print("-" * 50)
for question in test_questions:
    print(f"\nQ: {question}")
    result = pipe_original(question, max_length=100)
    print(f"A: {result[0]['generated_text']}")
    print("-" * 50)

In [None]:
# Test each question
print("Testing fine-tuned model responses:")
print("-" * 50)
for question in test_questions:
    print(f"\nQ: {question}")
    result = pipe_finetune(question, max_length=100)
    print(f"A: {result[0]['generated_text']}")
    print("-" * 50)