# This project involves fine-tuning the Meta Llama 3.1-8B model on the SAMSum dataset to perform dialogue summarization.

in order to use the fine-tuned model you can skip to block #10, load the model and the tokenizer from Google Drive, and inference can be performed using new prompts.

# #1 Installation of Necessary Packages

Before running the code, install the necessary Python packages. This project requires unsloth, xformers, trl, peft, accelerate, bitsandbytes, triton, and other dependencies.

In [None]:
# Install necessary packages
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install py7zr "llama_recipes"
from torch import __version__
from packaging.version import Version as V


# Install xformers and other dependencies
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton


# #2 Model Loading and Configuration


In this step, the Meta Llama 3.1-8B model is loaded using the FastLanguageModel class from the unsloth library. The model is configured to use 4-bit quantization for efficient inference.

In [None]:
from unsloth import FastLanguageModel
import torch

# Define model parameters
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load the Meta Llama 3.1-8B model with the tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.4.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

# #3 Low-Rank Adaptation

To fine-tune the model efficiently, Low-Rank Adaptation (LoRA) is applied to specific layers of the model. This setup is crucial for reducing the number of trainable parameters while maintaining the model's performance.

In [None]:
# Set LoRA (Low-Rank Adaptation) parameters for efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Low-rank parameter
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.8 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


# #4 Inference with the model "befor fine tuning"

Before fine-tuning the model, a sample inference is performed using a predefined dialogue prompt to check the model's initial performance.

In [None]:
FastLanguageModel.for_inference(model)

eval_prompt = """
Summarize this dialog:
Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


Summarize this dialog:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
---
Summary:
Hannah is looking for Betty's number. Amanda says she'll check, but she can't find it. She suggests that Hannah ask Larry. Hannah says she doesn't know him well, and Amanda says he's very nice. Hannah says she'd rather Amanda texted him. Amanda says just to text him. Hannah says alright, and Amanda says bye.


#  #5 Dataset Preparation

The SAMSum dataset is prepared for training, validation, and testing. The dataset is tokenized and preprocessed using llama_recipes utilities, and DataLoaders are configured for efficient data handling.

In [None]:
from llama_recipes.configs import train_config as TRAIN_CONFIG
from llama_recipes.configs.datasets import samsum_dataset
from llama_recipes.utils.config_utils import get_dataloader_kwargs
from llama_recipes.utils.dataset_utils import get_preprocessed_dataset
from transformers import DataCollatorForSeq2Seq

# Set pad_token to eos_token for consistency in padding
tokenizer.pad_token = tokenizer.eos_token

# Load and preprocess the SAMSum dataset for training, validation, and testing
train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train').shuffle(seed=42).select(range(8000))
val_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'validation')
test_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'test')

# Get DataLoader configurations from llama_recipes' train_config
train_dl_kwargs = get_dataloader_kwargs(TRAIN_CONFIG, train_dataset, tokenizer, "train")
val_dl_kwargs = get_dataloader_kwargs(TRAIN_CONFIG, val_dataset, tokenizer, "validation")
test_dl_kwargs = get_dataloader_kwargs(TRAIN_CONFIG, test_dataset, tokenizer, "test")

# Create DataLoaders for training, validation, and testing
train_dataloader = torch.utils.data.DataLoader(train_dataset, num_workers=TRAIN_CONFIG.num_workers_dataloader, pin_memory=True, **train_dl_kwargs)
val_dataloader = torch.utils.data.DataLoader(val_dataset, num_workers=TRAIN_CONFIG.num_workers_dataloader, pin_memory=True, **val_dl_kwargs)
test_dataloader = torch.utils.data.DataLoader(test_dataset, num_workers=TRAIN_CONFIG.num_workers_dataloader, pin_memory=True, **test_dl_kwargs)


# Define a simple formatting function for the dataset
def formatting_func(example):
    return {"input_ids": example["input_ids"], "attention_mask": example["attention_mask"], "labels": example["labels"]}

# Define the data collator to pad input sequences to the same length
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)


print(len(train_dataset))
print(len(val_dataset))
print(len(test_dataset))


  from torch.distributed._shard.checkpoint import (


Downloading builder script:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] ט
The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

8000
818
818


# #6 Training Configuration and Setup


Training is configured using the SFTTrainer class from the trl library. The model is trained for three epochs with gradient accumulation, learning rate scheduling, and mixed-precision training.

In [None]:
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=None,
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    formatting_func=formatting_func,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # max_steps=1,
        num_train_epochs = 3,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        output_dir="outputs",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
    ),
    data_collator=data_collator,
)


# #7 Model Training and Evaluation

The model is trained using the prepared dataset, and results are printed after each epoch. After training, the model is evaluated on both validation and test datasets.


In [None]:

# Start training the model
train_results = trainer.train()
print("Training Results:", train_results)


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 8,000 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.0735
20,1.1295
30,1.0896
40,1.0595
50,1.1811
60,1.1128
70,1.0734
80,1.0261
90,1.0944
100,1.0703


Training Results: TrainOutput(global_step=3000, training_loss=0.7869794605573018, metrics={'train_runtime': 6551.8039, 'train_samples_per_second': 3.663, 'train_steps_per_second': 0.458, 'total_flos': 2.607958396498084e+17, 'train_loss': 0.7869794605573018, 'epoch': 3.0})


In [None]:
# Evaluate the model on the validation dataset
val_results = trainer.evaluate(eval_dataset=val_dataset)
print("Validation Results:", val_results)

Validation Results: {'eval_loss': 1.3123339414596558, 'eval_runtime': 31.3557, 'eval_samples_per_second': 26.088, 'eval_steps_per_second': 3.285, 'epoch': 3.0}


In [None]:
# Evaluate the model on the test dataset
test_results = trainer.evaluate(eval_dataset=test_dataset)
print("Test Results:", test_results)

Test Results: {'eval_loss': 1.3123339414596558, 'eval_runtime': 31.401, 'eval_samples_per_second': 26.05, 'eval_steps_per_second': 3.28, 'epoch': 3.0}



To evaluate the model's performance, the BLEU score is calculated on the test dataset. This step involves generating predictions, extracting summaries, and comparing them with the reference summaries.

In [None]:
from datasets import load_metric
import torch

# Load BLEU metric
bleu = load_metric("bleu")

# Function to extract summary from generated text
def extract_summary(text):
    return text.split("Summary:")[-1].strip()

# Generate predictions for the test dataset
predictions = []
references = []

total_samples = len(test_dataloader.dataset)
processed_samples = 0

for batch in test_dataloader:
    inputs = {k: v.to("cuda") for k, v in batch.items() if k != "labels"}
    outputs = model.generate(**inputs, max_new_tokens=128)
    decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Extract only the summary
    summaries = [extract_summary(pred) for pred in decoded_preds]

    # Decode labels
    labels = batch["labels"].masked_fill(batch["labels"] == -100, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    predictions.extend(summaries)
    references.extend(decoded_labels)



# Compute BLEU score
bleu_score = bleu.compute(predictions=[pred.split() for pred in predictions],
                          references=[[ref.split()] for ref in references])
print("BLEU Score on Test Dataset:", bleu_score)


  bleu = load_metric("bleu")


Downloading builder script:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

The repository for bleu contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/bleu.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
BLEU Score on Test Dataset: {'bleu': 0.8368428865767211, 'precisions': [0.8482252141982864, 0.8409836977688755, 0.833343048673856, 0.8249984697312848], 'brevity_penalty': 1.0, 'length_ratio': 1.1326020131396541, 'translation_length': 18791, 'reference_length': 16591}


# #8 Inference with the model "after fine tuning"

Inference After Fine-Tuning and check the fine tuned model performance

In [None]:
eval_prompt = """
Summarize this dialog:
Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


Summarize this dialog:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
---
Summary:
Hannah doesn't have Betty's number. Amanda suggests she asks Larry for it.


# #9 Saving the model

After training, the model and tokenizer are saved to Google Drive for later use.



In [None]:
from google.colab import drive

# Mount Google Drive to save the model
drive.mount('/content/drive')
output_dir = "/content/drive/My Drive/final_fine_tuned_model"

# Save the model and tokenizer to Google Drive
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Mounted at /content/drive
Model and tokenizer saved to /content/drive/My Drive/final_fine_tuned_model


# #10 Loading the model and inference

To use the fine-tuned model in the future, it can be loaded from Google Drive, and inference can be performed using new prompts.

In [1]:
# Install necessary packages
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install py7zr "llama_recipes"
from torch import __version__
from packaging.version import Version as V


# Install xformers and other dependencies
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton


from unsloth import FastLanguageModel
from google.colab import drive



# Mount Google Drive to save the model
drive.mount('/content/drive')
output_dir = "/content/drive/My Drive/final_fine_tuned_model"


# Load the model and tokenizer from Google Drive
model, tokenizer = FastLanguageModel.from_pretrained(output_dir)
print("Model and tokenizer loaded successfully")


In [None]:
import torch
FastLanguageModel.for_inference(model)

eval_prompt = """
Summarize this dialog:
Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))