<a href="https://colab.research.google.com/github/zrghassabi/LLM/blob/main/Chapter5_Solution2_FlanT5_SQuAD_v2_Question_Answering%5B1%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Answering with Flan-T5 and SQuAD v2 Dataset
This notebook demonstrates how to perform question answering using a pre-trained Flan-T5 model fine-tuned on the SQuAD v2 dataset from Hugging Face.

In [None]:

# Install necessary packages
!pip install transformers datasets tensorflow rouge-score nltk


Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m


## Import Libraries
We begin by importing the necessary libraries.

In [None]:

import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq
from datasets import load_dataset


## Load the SQuAD v2 Dataset
Next, we load the SQuAD v2 dataset from Hugging Face.

In [None]:

# Load the SQuAD v2 dataset
dataset = load_dataset("squad_v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

## Load the Flan-T5 Tokenizer
We load the pre-trained Flan-T5 tokenizer to process the input text.

In [None]:

# Load the Flan-T5 tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## Tokenize the Dataset
We define a function to tokenize the dataset and apply it to the SQuAD v2 dataset.

In [None]:

# Tokenize the dataset["train"].select(range(25000))
def preprocess_function(examples):
    inputs = [context + " question: " + question for question, context in zip(examples["question"], examples["context"])]
    targets = [answer['text'][0] if len(answer['text']) > 0 else "" for answer in examples["answers"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=512, truncation=True, padding='max_length')

    model_inputs["labels"] = labels["input_ids"]
    model_inputs["decoder_input_ids"] = labels["input_ids"]
    return model_inputs

train_dataset = dataset["train"].select(range(25000)).map(preprocess_function, batched=True)
validation_dataset = dataset["validation"].select(range(2000)).map(preprocess_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]



Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Convert to TensorFlow Dataset
We convert the tokenized dataset to a format that can be used with TensorFlow.

In [None]:

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=None)

# Convert the tokenized dataset to a TensorFlow dataset
train_dataset = train_dataset.to_tf_dataset(
    columns=["input_ids", "attention_mask", "decoder_input_ids"],
    label_cols="labels",
    shuffle=True,
    batch_size=64,
    collate_fn=data_collator
)

validation_dataset = validation_dataset.to_tf_dataset(
    columns=["input_ids", "attention_mask", "decoder_input_ids"],
    label_cols="labels",
    shuffle=False,
    batch_size=64,
    collate_fn=data_collator
)


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


## Load and Configure the Flan-T5 Model
We load the pre-trained Flan-T5 model and configure it for sequence-to-sequence learning.

In [None]:

# Load the pre-trained Flan-T5 model
model = TFAutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")


All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Freeze the First 3 Layers
We freeze the first three layers (embedding, encoder, decoder) to focus training on the head.

In [None]:

# Freeze the first 3 layers (embedding, encoder, decoder)
for layer in model.layers[:3]:
    layer.trainable = False


In [None]:
model.summary()

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
 lm_head (Dense)             multiple                  24674304  
                                                                 
Total params: 247577856 (944.43 MB)
Trainable params: 24674304 (94.12 MB)
Non-trainable params: 222903552 (850.31 MB)
_________________________________________________________________


## Compile the Model
We compile the model with appropriate loss function, optimizer, and metrics.

In [None]:

# Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-2),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
)


## Train the Model
We train the model using the training dataset and validate it using the validation dataset.

In [None]:

# Train the model
model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=3
)


Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7ec282827f40>

## Save the Model
We save the trained model for future use.

In [None]:

# Save the model
model.save_pretrained("./flan-t5-squad-v2")


## Evaluate the Model
Finally, we evaluate the model using the validation dataset to check its performance using ROUGE-1 score.

In [None]:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk
nltk.download('punkt')

def answer(inputs):
    outputs = model.generate(inputs[0]["input_ids"], max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Function to calculate ROUGE and BLEU scores
def calculate_scores(reference, hypothesis):
    # Initialize scorers
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Calculate ROUGE scores
    rouge_scores = rouge.score(reference, hypothesis)

    return rouge_scores

# Evaluate translations and calculate scores
batch = next(iter(validation_dataset))
answer = answer(batch)
reference_text = tokenizer.decode(batch[1][0], skip_special_tokens=True)
rouge_scores = calculate_scores(reference_text, answer)
print(f"Reference: {reference_text}")
print(f"Translation: {answer}")
print(f"ROUGE Scores: {rouge_scores}")
print()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Reference: France
Translation: France
ROUGE Scores: {'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}

