# AI Makerspace Session 2
_Hosted by:_ Enrique Noriega-Atala
_Date:_ 10/11/24

This notebook is an implementation of the following tutorial: https://www.datacamp.com/tutorial/flan-t5-tutorial

In [1]:
%pip install nltk datasets transformers tokenizers evaluate rouge_score sentencepiece huggingface_hub accelerate

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/opt/ohpc/pub/apps/python/3.8.12/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m


[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import nltk
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [3]:
MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer, model)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [4]:
DATA_NAME= "yahoo_answers_qa"
yahoo_answers_qa = load_dataset(DATA_NAME)

In [5]:
yahoo_answers_qa

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'answer', 'nbestanswers', 'main_category'],
        num_rows: 87362
    })
})

In [6]:
yahoo_answers_qa = yahoo_answers_qa['train'].train_test_split(test_size=0.3)
yahoo_answers_qa

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'answer', 'nbestanswers', 'main_category'],
        num_rows: 61153
    })
    test: Dataset({
        features: ['id', 'question', 'answer', 'nbestanswers', 'main_category'],
        num_rows: 26209
    })
})

In [7]:
d = yahoo_answers_qa['test'].select([1256])
d['question'], d['answer']

(['What is the measurement of the yeast in a yeast packet?'],
 ['I believe it is slightly less than 1 table spoon but I exchange 1 tablespoon for the packet.  The only possible difference is that the rise might occur slightly faster but not that noticable.'])

In [8]:
# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

def preprocess_function(examples):
    """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
    
    inputs = [prefix + doc for doc in examples["question"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
  
    # The "labels" are the tokenized outputs:
    labels = tokenizer(text_target=examples["answer"], 
                      max_length=512,         
                      truncation=True)
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

x = preprocess_function(yahoo_answers_qa['test'].select([1256]))
x['input_ids']

[[863,
  1525,
  48,
  822,
  10,
  363,
  19,
  8,
  9753,
  13,
  8,
  17937,
  16,
  3,
  9,
  17937,
  13531,
  58,
  1]]

In [9]:
# Map the preprocessing function across our dataset
tokenized_dataset = yahoo_answers_qa.map(preprocess_function, batched=True)

Map:   0%|          | 0/61153 [00:00<?, ? examples/s]



Map:   0%|          | 0/26209 [00:00<?, ? examples/s]

In [10]:
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

In [11]:
def compute_metrics(eval_preds):
    generations, groundtruth = eval_preds

    # decode preds and labels
    groundtruth = np.where(groundtruth != -100, groundtruth, tokenizer.pad_token_id)
    decoded_generations = tokenizer.batch_decode(generations, skip_special_tokens=True)
    decoded_groundtruth = tokenizer.batch_decode(groundtruth, skip_special_tokens=True)

    # rougeLSum expects newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_generations]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_groundtruth]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
  
    return result

In [12]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3


# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)




In [14]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
