<a href="https://colab.research.google.com/github/sarahajbane/notebooks/blob/main/4_2_Fine_Tuning_Large_Language_Models_(LLMs)_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLaMA 2 with IMDB Dataset (Optimized for Speed and Memory Efficiency)


This script demonstrates how to fine-tune the LLaMA 2 large language model using the IMDB movie reviews dataset. The goal is to adapt the pre-trained model to better understand and generate text related to movie reviews, particularly focusing on sentiment analysis.

Optimizations for faster computation and reduced memory usage:
- **Reduced Training Epochs:** Limits to fewer epochs for quicker training.
- **Smaller Batch Size:** Reduces memory usage.
- **Gradient Accumulation:** Maintains effective batch size while saving computation.
- **Mixed Precision Training:** Utilizes FP16 for faster computation when using GPU.
- **Offloading Model to CPU if GPU Runs Out of Memory:** Avoids OutOfMemory errors by automatically handling device allocation.

Actions:
1. Install and import necessary libraries.
2. Load the IMDB dataset.
3. Configure the LLaMA 2 model.
4. Apply parameter-efficient fine-tuning using LoRA.
5. Train the model efficiently.
6. Ev

## Step 1: Install Required Libraries


In this step, we install the necessary libraries to support model training, fine-tuning, and dataset handling.


In [None]:
!pip install accelerate peft transformers trl datasets

Collecting trl
  Downloading trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x8

## Step 2: Import Libraries


Import essential libraries for data processing, model handling, and fine-tuning.


In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,  # Loads the pre-trained LLaMA model
    AutoTokenizer,         # Tokenizes the input data
    TrainingArguments,     # Defines model training configurations
    pipeline,              # Simplified interface for text generation
    logging,               # Handles logging outputs during training and inference
)
from peft import LoraConfig  # Enables parameter-efficient fine-tuning (PEFT) using LoRA
from trl import SFTTrainer   # Trainer designed for supervised fine-tuning tasks


## Step 3: Load Dataset

Here, we load 25% of the IMDB dataset to reduce computational costs while maintaining enough data for model learning.


In [None]:
dataset = load_dataset("imdb", split="train[:25%]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Step 4: Model Configuration


Define the base model for fine-tuning and set the name for the new fine-tuned model.


In [None]:
base_model = "NousResearch/Llama-2-7b-chat-hf"
new_model = "llama-2-7b-chat-imdb-optimized"

- We choose Llama-2-7b-chat-hf because it provides a good balance between model size and performance

## Step 5: Device Configuration


This section handles model loading and device configuration, prioritizing GPU if available, and switching to CPU if needed. Running this notebook locally could lead to issues so its better to use free cloud services.

In [None]:
# Device configuration with gradient checkpointing for memory efficiency
try:
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        device_map={"": 0} if torch.cuda.is_available() else {"": "cpu"},
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32  # Mixed precision
    )
except RuntimeError as e:
    print("CUDA Out of Memory. Switching to CPU...")
    model = AutoModelForCausalLM.from_pretrained(base_model, device_map={"": "cpu"})

# Optimize for memory efficiency
model.config.use_cache = False
model.gradient_checkpointing_enable()  # Reduces memory usage during backpropagation


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

- Gradient checkpointing saves memory but may increase computation time.

## Step 6: Load Tokenizer


In [None]:
# Load the tokenizer compatible with LLaMA 2 to convert text data into token IDs.
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Uses the end-of-sequence token for padding
tokenizer.padding_side = "right"           # Applies padding to the right side of the sequences


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

- Using trust_remote_code=True. allows the tokenizer to load custom code from the model repository, which is needed for LLaMA 2.
- Setting the pad token to eos_token ensures that padding tokens are handled consistently

## Step 7: Preprocess Dataset


In [None]:
# Reducing max_length to 32 for lower memory consumption
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=32)

dataset = dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/6250 [00:00<?, ? examples/s]

- We set max_length to 32 to reduce memory usage, though this might truncate longer reviews. In practice, you may need to adjust this value based on your dataset and model requirements.

## Step 8: PEFT Parameters


Configure LoRA settings to optimize the fine-tuning process with minimal resource consumption.


In [None]:
peft_params = LoraConfig(
    lora_alpha=4,         # Controls the learning rate for LoRA layers
    lora_dropout=0.1,     # Adds dropout regularization to prevent overfitting
    r=16,                 # Defines the rank of the adaptation matrix, balancing performance and efficiency
    bias="none",          # No additional bias applied
    task_type="CAUSAL_LM" # Specifies the task type as causal language modeling
)

## Step 9: Training Parameters


Define hyperparameters to guide the model's learning process during fine-tuning.


In [None]:
# Adjust hyperparameters for reduced memory usage
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,            # Keep batch size small
    gradient_accumulation_steps=2,            # Reduce accumulation steps
    optim="adamw_torch",
    save_steps=200,                           # Save less frequently to reduce overhead
    logging_steps=50,
    learning_rate=3e-5,                       # Slightly lower learning rate
    weight_decay=0.001,
    fp16=torch.cuda.is_available(),           # Mixed precision for GPUs
    max_grad_norm=0.5,                        # Gradient clipping
    warmup_steps=5,
    lr_scheduler_type="linear",
    report_to="none"
)


## Step 10: Fine-Tuning


Use the SFTTrainer to fine-tune the LLaMA model with the IMDB dataset.


In [None]:
trainer = SFTTrainer(
    model=model,                 # Base LLaMA model
    train_dataset=dataset,       # Preprocessed IMDB dataset
    peft_config=peft_params,     # LoRA fine-tuning configuration
    args=training_params         # Training hyperparameters
)

trainer.train()  # Initiates the fine-tuning process




Step,Training Loss
50,3.5341
100,3.2316
150,3.2337
200,3.0687
250,2.9202
300,2.8215
350,2.8627
400,2.6846
450,2.7185
500,2.6999


TrainOutput(global_step=3125, training_loss=2.7181155029296873, metrics={'train_runtime': 1563.551, 'train_samples_per_second': 3.997, 'train_steps_per_second': 1.999, 'total_flos': 7938878668800000.0, 'train_loss': 2.7181155029296873, 'epoch': 1.0})

- SFTTrainer handles the fine-tuning process, integrating the LoRA configuration with our base LLaMA model.

## Step 11: Save Model


Save the fine-tuned model and tokenizer for future use.

In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


('llama-2-7b-chat-imdb-optimized/tokenizer_config.json',
 'llama-2-7b-chat-imdb-optimized/special_tokens_map.json',
 'llama-2-7b-chat-imdb-optimized/tokenizer.model',
 'llama-2-7b-chat-imdb-optimized/added_tokens.json',
 'llama-2-7b-chat-imdb-optimized/tokenizer.json')

## Step 12: Test Model

Evaluate the performance of the fine-tuned model using a sample prompt.


In [None]:
# Evaluate the performance of the fine-tuned model using a sample prompt.
logging.set_verbosity(logging.CRITICAL)  # Suppresses detailed logs for clean output

prompt = "What do you think about the movie Inception?"  # Test prompt to evaluate the model
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=50)
result = pipe(f"<s>[INST] {prompt} [/INST]")  # Generates a response to the prompt
print(result[0]['generated_text'])  # Displays the model's generated response




<s>[INST] What do you think about the movie Inception? [/INST]  Inception is a thought-provoking and visually stunning science fiction film directed by Christopher Nolan. It is a complex and intricate movie
