<a href="https://colab.research.google.com/github/shiftkey-labs/GenAI-Course/blob/main/week_3/Lab3_GenAI_ShiftKey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ShiftKey GenAI Certification Lab3 - Fine Tuning Flan T5

## Overview
In this lab, we will guide you through the process of fine-tuning a pre-trained text generation model using the Hugging Face `transformers` library. The model we will use is Google's FLAN-T5, a state-of-the-art model for various sequence-to-sequence tasks such as text summarization. We will break down each step from loading the model and dataset to fine-tuning the model and using it to generate summaries. Finally, we will push the fine-tuned model to the Hugging Face Model Hub.

## Table of Contents:
1. Installing Necessary Libraries
2. Loading the Model and Tokenizer
3. Setting up the Device
4. Loading the Dataset
5. Splitting the Dataset
6. Preprocessing the Dataset
7. Tokenizing the Datasets
8. Setting Training Arguments
9. Creating the Trainer
10. Training the Model
11. Evaluating the Model
12. Summarization Function
13. Testing the Summarizer
14. Testing the Model for
15. Summarization Before Fine-Tuning


## Installing Necessary Libraries
We need to install the Hugging Face `transformers` and `datasets` libraries. These libraries provide pre-trained models and a variety of datasets that we can easily use for fine-tuning tasks.

In [None]:
!pip install transformers datasets

## Loading the Model and Tokenizer
In this section, we will load the FLAN-T5 small model, which is a sequence-to-sequence model designed for tasks like text summarization, translation, and other text generation tasks.

* `AutoTokenizer` loads the tokenizer corresponding to the model. The tokenizer will convert raw text to input tokens the model can understand.

* `AutoModelForSeq2SeqLM` loads a pre-trained model that is specifically designed for sequence-to-sequence tasks like summarization.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Define the model name and load the tokenizer and model
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

## Setting up the Device
Before we can use the model for training, we need to check whether a GPU is available. GPUs (Graphics Processing Units) are much faster than CPUs for model training tasks, so we should use one if possible.

* We check if a CUDA-enabled GPU is available. If yes, we use the GPU (`cuda`), otherwise, we fall back to the CPU.

* The model is then transferred to the appropriate device using `.to(device)`.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Print out which device we're using (GPU or CPU)
print(device)

## Testing the Model for Summarization Before Fine-Tuning
Before we fine-tune the model, let’s test how well the pre-trained model performs on a sample text. This will give us a baseline performance to compare with the fine-tuned model later.

In [None]:
def summarize(text):
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [None]:
# Define a sample text for summarization
sample_text =     """
Person A: Hey, did you hear about the new project management software our company is planning to implement?

Person B: Yeah, I heard a bit about it. What’s the deal with it?

Person A: It’s called "TaskFlow." The management thinks it’s going to streamline our workflow, especially with remote teams. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform.

Person B: That sounds interesting. But I’m a bit concerned about the learning curve. Is it user-friendly?

Person A: From what I’ve seen, it looks pretty intuitive. They’re also planning to run a couple of training sessions to get everyone up to speed. The first one is next Monday.

Person B: Okay, that helps. I guess I’ll have to attend that session. How does it compare to what we’re using now?

Person A: It’s supposed to be much more efficient. We’ll be able to track project progress more easily and get real-time updates. Plus, it has built-in analytics to help us with performance tracking.

Person B: That sounds promising. I just hope it doesn’t come with too many bugs at launch.

Person A: Yeah, that’s always a concern with new software. But they’ve been testing it for a while now, so fingers crossed it goes smoothly.

Person B: Let’s hope for the best. Thanks for the info!

Person A: No problem. See you at the training!
"""

# Summarize the sample text using the pre-trained model (without fine-tuning)
pre_finetuned_summary = summarize(sample_text)
print("Summary before fine-tuning:", pre_finetuned_summary)

## Loading the Dataset
Here, we load a popular dataset called `cnn_dailymail`, which is frequently used for summarization tasks. The dataset contains news articles paired with summaries, making it perfect for training a model to summarize text.

The `load_dataset` function from the `datasets` library is used to load the `cnn_dailymail` dataset, specifying version `3.0.0`.

In [None]:
from datasets import load_dataset

# Load the CNN/DailyMail dataset, which contains articles and summaries
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train")

## Splitting the Dataset
In this step, we split the dataset into training and evaluation sets. Training is done on one subset, while the other is used for evaluation (testing) to see how well the model generalizes to unseen data.

* `train_test_split` divides the dataset into training and testing subsets.
* We create a smaller training set for quick testing purposes.

In [None]:
# Split the dataset into training and testing subsets
dataset_split = dataset.train_test_split(test_size=0.1)

# Further reduce the training set size for faster testing during development
small_train_dataset = dataset_split['train'].train_test_split(test_size=0.99)['train']
eval_dataset = dataset_split['test']

## Preprocessing the Dataset
Before feeding the data into the model, we need to tokenize it. This function handles preprocessing by tokenizing the input text (the news articles) and the target text (the summaries).

* The `inputs` are tokenized and truncated to a maximum length of 512 tokens, which ensures they fit within the model’s input constraints.

* The target (summary) is tokenized separately. In this case, we use a maximum length of 128 tokens for the target sequence.

* The tokenized data is then moved to the appropriate device (GPU or CPU) for further processing.

In [None]:
def preprocess_function(examples):
  # Extract the articles from the dataset
  inputs = [doc for doc in examples['article']]

  # Tokenize the articles (inputs) with padding and truncation to a max length of 512
  model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True, return_tensors="pt")

  # Tokenize the summaries (labels) using the target tokenizer context
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['highlights'], max_length=128, padding="max_length", truncation=True, return_tensors="pt")

  # Attach the tokenized summaries as labels to the model inputs
  model_inputs["labels"] = labels["input_ids"]

  # Move the tokenized inputs and labels to the appropriate device (GPU/CPU)
  model_inputs = {k: v.to(device) for k, v in model_inputs.items()}

  return model_inputs

## Tokenizing the Datasets
We apply the preprocessing function to both the training and evaluation datasets. The `.map()` method applies the function to every element in the dataset.

Tokenization is done in batches to speed up the process.

In [None]:
# Tokenize the small training dataset
tokenized_train_dataset = small_train_dataset.map(preprocess_function, batched=True)

# Tokenize the evaluation dataset
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True)

## Setting Training Arguments
The next step is to define the training parameters. We use the `Seq2SeqTrainingArguments` class from the `transformers` library to specify key settings such as learning rate, batch size, and the number of epochs.

* `evaluation_strategy="epoch"` ensures the model is evaluated after every epoch.

* `num_train_epochs=3` sets the number of times the model will go through the entire training dataset.

* `predict_with_generate=True` ensures the model generates summaries during evaluation.

In [None]:
from transformers import Seq2SeqTrainingArguments

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',              # Directory to save the model checkpoints
    evaluation_strategy="epoch",         # Evaluate the model at the end of every epoch
    learning_rate=2e-5,                  # Learning rate for the optimizer
    per_device_train_batch_size=8,       # Batch size for training
    per_device_eval_batch_size=8,        # Batch size for evaluation
    weight_decay=0.01,                   # Regularization to prevent overfitting
    save_total_limit=3,                  # Only keep the last 3 checkpoints
    num_train_epochs=3,                  # Number of training epochs
    predict_with_generate=True,          # Enable text generation during evaluation
    logging_dir="./logs"                 # Directory for storing training logs
)

## Creating the Trainer
The `Seq2SeqTrainer` class handles the entire training and evaluation loop. We provide it with the model, the training arguments, and the datasets.

This class simplifies the training process and abstracts away the complexities of manually writing the training loop.

In [None]:
from transformers import Seq2SeqTrainer

# Create the trainer object
trainer = Seq2SeqTrainer(
    model=model,                         # The model to be trained
    args=training_args,                  # The training arguments defined earlier
    train_dataset=tokenized_train_dataset,  # The tokenized training dataset
    eval_dataset=tokenized_eval_dataset,    # The tokenized evaluation dataset
    tokenizer=tokenizer                  # The tokenizer to handle input and output
)

## Training the Model
Now, we start the training process. The trainer will iterate over the training data for the specified number of epochs.

This is the step where the model learns from the data and fine-tunes itself based on the task.

In [None]:
# Let's train
trainer.train()

## Evaluating the Model
After the training is complete, we evaluate the model on the evaluation dataset. This step measures how well the model generalizes to unseen data.

The trainer returns metrics such as accuracy and loss, which give us an idea of the model’s performance.

In [None]:
# Evaluate the model on the evaluation dataset
metrics = trainer.evaluate()

# Print the evaluation metrics
print(metrics)

## Summarization Function
To use the fine-tuned model, we define a function that takes in text and generates a summary. This function tokenizes the input text and uses the `model.generate()` function to create the summary.

`num_beams=4` specifies the number of beams used for beam search, which helps in generating better summaries by exploring multiple possibilities.

In [None]:
def summarize(text):
  # Tokenize the input text and move it to the correct device
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)

  # Generate the summary using the fine-tuned model
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)

  # Decode the generated summary back into text and return it
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

## Testing the Summarizer
Finally, we test the summarizer function with sample texts to see how well the fine-tuned model performs in real-world scenarios.

In [None]:
print(summarize(
    """
Person A: Hey, did you hear about the new project management software our company is planning to implement?

Person B: Yeah, I heard a bit about it. What’s the deal with it?

Person A: It’s called "TaskFlow." The management thinks it’s going to streamline our workflow, especially with remote teams. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform.

Person B: That sounds interesting. But I’m a bit concerned about the learning curve. Is it user-friendly?

Person A: From what I’ve seen, it looks pretty intuitive. They’re also planning to run a couple of training sessions to get everyone up to speed. The first one is next Monday.

Person B: Okay, that helps. I guess I’ll have to attend that session. How does it compare to what we’re using now?

Person A: It’s supposed to be much more efficient. We’ll be able to track project progress more easily and get real-time updates. Plus, it has built-in analytics to help us with performance tracking.

Person B: That sounds promising. I just hope it doesn’t come with too many bugs at launch.

Person A: Yeah, that’s always a concern with new software. But they’ve been testing it for a while now, so fingers crossed it goes smoothly.

Person B: Let’s hope for the best. Thanks for the info!

Person A: No problem. See you at the training!
"""
))