# Supervised Fine-Tuning on Limited GPU Memory

This notebook demonstrates how to fine-tune a language model. We'll use parameter-efficient techniques and memory optimization strategies(PEFT).

## Cell 1: Install Required Dependencies

We install the necessary packages for model training, including Hugging Face transformers, PEFT for parameter-efficient fine-tuning, and bitsandbytes for quantization.

In [12]:
#%pip install transformers datasets peft bitsandbytes accelerate

## Cell 2: Import Libraries

Import all required libraries including transformers for model loading, datasets for data handling, and PEFT for efficient fine-tuning.

In [None]:
import json
import torch
import os
from dotenv import load_dotenv
load_dotenv()
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)

from peft import get_peft_model, LoraConfig, TaskTypeimport bitsandbytes as bnb

## Cell 3: Load and Prepare Dataset

Load the JSONL file and convert it into a Hugging Face Dataset. We assume each line contains 'prompt' and 'completion' fields.

In [None]:
def load_jsonl_dataset(file_path):
    """Load JSONL file and convert to Dataset format"""
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return Dataset.from_list(data)

# Replace with your actual file path
dataset_path = os.getenv("DATASET_PATH")
dataset = load_jsonl_dataset(dataset_path)
print(f"Dataset size: {len(dataset)}")
print(dataset[0])

Dataset size: 112
{'instruction': 'According to the Life Insurance Code of Practice, what is the effective date of the document?', 'output': 'March 2025'}


## Cell 4: Initialize Model and Tokenizer

Load a lightweight model suitable for 4GB memory. We use a quantized version of a small model and apply 4-bit quantization to reduce memory usage.

In [None]:
from transformers import BitsAndBytesConfig

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer (small model for limited memory)
model_name =  os.getenv("MODEL_NAME", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}  # Load entirely on GPU 0
)

## Cell 5: Configure Parameter-Efficient Fine-Tuning (LoRA)

Set up LoRA configuration to drastically reduce trainable parameters while maintaining model performance.

In [16]:
# LoRA configuration for parameter-efficient training
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,  # Low rank
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # Apply to attention layers
)

# Wrap model with LoRA adapters
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


## Cell 6: Preprocess Dataset

Tokenize the dataset and format it for causal language modeling. We concatenate prompt and completion for training.

In [17]:
def preprocess_function(examples):
    """Tokenize text and prepare for causal language modeling"""
    # Combine prompt and completion
    PROMPT_COLUMN = "instruction"
    COMPLETION_COLUMN = "output"
    texts = [
        f"{prompt} {completion}{tokenizer.eos_token}"
        for prompt, completion in zip(examples[PROMPT_COLUMN], examples[COMPLETION_COLUMN])
    ]
    
    # Tokenize with truncation
    model_inputs = tokenizer(
        texts,
        max_length=512,
        truncation=True,
        padding=False
    )
    
    # Create labels for causal LM (shifted by 1)
    model_inputs["labels"] = [
        [-100] * (len(token_ids) - 1) + [token_ids[-1]]
        for token_ids in model_inputs["input_ids"]
    ]
    
    return model_inputs

# Apply preprocessing to dataset
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

Map: 100%|██████████| 112/112 [00:00<00:00, 1274.94 examples/s]


## Cell 7: Configure Training Arguments

Set up training parameters optimized for limited memory, including gradient accumulation, small batch sizes, and memory-saving options.

In [18]:
# Memory-efficient training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,        # Minimum batch size
    gradient_accumulation_steps=8,        # Effective batch size of 8
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,                            # Mixed precision training
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    report_to="none",                     # Disable external tracking
    dataloader_pin_memory=False,          # Reduce memory pressure
    remove_unused_columns=False,          # Keep all columns
    optim="paged_adamw_8bit"              # Memory-efficient optimizer
)

## Cell 8: Initialize Trainer and Start Training

Create the trainer with our model, dataset, and training configuration, then start the fine-tuning process.

In [19]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

# Start training
trainer.train()

Step,Training Loss
10,2.2824
20,2.0789
30,2.0294
40,1.9781


TrainOutput(global_step=42, training_loss=2.0830026808239164, metrics={'train_runtime': 674.8586, 'train_samples_per_second': 0.498, 'train_steps_per_second': 0.062, 'total_flos': 291142917513216.0, 'train_loss': 2.0830026808239164, 'epoch': 3.0})

## Cell 9: Save the Fine-Tuned Model

Save the trained adapters and tokenizer for later use. The base model is not saved to save space.

In [20]:
# Save only the LoRA adapters (not the full model)
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
print("Model saved successfully!")

Model saved successfully!


## Cell 10: Test the Fine-Tuned Model

Test the model with a sample prompt to verify the fine-tuning results.

In [None]:
# Test the model
model.eval()
prompt = os.getenv("TEST_PROMPT", "Tell me a joke about cats.")
print(f"Using TEST_PROMPT: {prompt!r}")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True
    )
    
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is life insurance? How does it work, and what types of policies are available? Life insurance is designed to provide financial support to beneficiaries in case of the policyholder's death. Policies have life, term, and whole life insurance options, which can be tailored to fit specific needs and situations. Life insurance is designed to give peace of mind to beneficiaries and reduce the financial burden on loved ones. Term Life Insurance: This type of policy is designed to
