<a href="https://colab.research.google.com/github/sriksmachi/sriksml/blob/main/language-models/llm_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning FLAN T5

This notebook contains code artifacts for fine tuning Flan-T5-small model. The notebook does the following tasks.

1. Use a pre-trained google/flan-t5-small as the model.
2. Verify if the summarization task works.
3. Verify if the Q&A task works.
4. Verify if English to French translation task works.
5. Programmatically print the names of all the model layers and their dimensions.
6. Programmatically print the total number of parameters/weights in this model.
7. Set the tensor in final layer (decoder.final_layer_norm.weight) to all zeros.
8. Verify if the Q&A task works aWer reseXng the weights of the above layer.
9. Replace the decoder.final_layer_norm.weight with a layer of smaller dimensions and adjust all the dependent layers to match the dimension
10. Reload the original google/flan-t5-small model.
11. Train the model for a Q&A task that takes a context as additional input along with the queson. You can use SQuAD dataset. Choose an
appropriate task prefix/trigger word and justify the choice.
12. Evaluate the quality of the model

Paper: https://arxiv.org/abs/2210.11416 </br>
Official repo: https://github.com/google-research/t5x

In [1]:
%%bash
pip install -q transformers[torch]
pip install -q sentencepiece
pip install -q datasets
pip install -q tokenizers
pip install -q evaluate
pip install -q rouge_score
pip install -q nltk

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 261.4/261.4 kB 4.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 15.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 521.2/521.2 kB 8.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 16.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 17.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 2.2 MB/s eta 0:00:00


## Import Libraries

In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import pandas as pd
import warnings
import torch
from torch import nn
import nltk
import evaluate
import numpy as np
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset

warnings.filterwarnings('ignore')


In [3]:
from transformers.models.blip_2.modeling_blip_2 import AutoModelForSeq2SeqLM
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Metrics

In [4]:
import nltk
import evaluate
nltk.download('punkt', quiet=True)

# loading Rouge
rogue_metric = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

## Text Summarization

In [5]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
sentence = "How does summarization work ?"
sentence_encoded = tokenizer(sentence, return_tensors='pt')
print(sentence_encoded)
sentence_decoded = tokenizer.decode(sentence_encoded["input_ids"][0])
print(sentence_decoded)

{'input_ids': tensor([[ 571,  405, 4505, 1635, 1707,  161,    3,   58,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
How does summarization work?</s>


In [7]:
import random
random_idx = random.randint(0, len(dataset["train"]))
print("Dialogue")
sentence = dataset["train"][random_idx]["dialogue"]
print(sentence)
print("====="*10)
print("Baseline Summary")
baseline_summary = dataset["train"][random_idx]["summary"]
print(baseline_summary)
print("====="*10)
sentence_encoded = tokenizer(sentence, return_tensors='pt')
pred_summary = model.generate(sentence_encoded['input_ids'], max_new_tokens=50)
sentence_decoded = tokenizer.decode(pred_summary[0], skip_special_tokens=True)
print("Generated Summary without prompt")
print(sentence_decoded)
print("====="*10)
rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_summary])
print(rouge_score)

Dialogue
#Person1#: Well, what did you think about the last candidate? Do you think we should hire her?
#Person2#: She had a very impressive resume, but she seemed to lack the confidence that I think a good manager needs.
#Person1#: What made you think that she wasn't very confident?
#Person2#: Did you notice the way that she avoided making eye contact with us while she talked?
#Person1#: She was a bit nervous, I guess. What else?
#Person2#: When she first walked into the room to greet us, she didn't shake our hands or introduce herself at all. I thought that was a bit unprofessional.
#Person1#: You're right. If she walked into meeting with our clients like that, it would make our company look bad, wouldn't it?
#Person2#: It sure would. Did you also notice the way she slouched in her chair during most of the interview? She had horrible posture!
#Person1#: I agree. I guess I was paying more attention to her answers than her body language.
#Person2#: On top of that, she didn't seem to ha

In [8]:
prompt = f"You are a sentence summarization bot. Please summarize the conversation.Dialogue:\n{sentence}.\n"
print(prompt)
sentence_encoded = tokenizer(prompt, return_tensors='pt')
pred_summary = model.generate(sentence_encoded['input_ids'],max_new_tokens=50)
sentence_decoded = tokenizer.decode(pred_summary[0], skip_special_tokens=True)
print("Generated Summary with prompt")
print(sentence_decoded)
rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_summary])
print(rouge_score)

You are a sentence summarization bot. Please summarize the conversation.Dialogue:
#Person1#: Well, what did you think about the last candidate? Do you think we should hire her?
#Person2#: She had a very impressive resume, but she seemed to lack the confidence that I think a good manager needs.
#Person1#: What made you think that she wasn't very confident?
#Person2#: Did you notice the way that she avoided making eye contact with us while she talked?
#Person1#: She was a bit nervous, I guess. What else?
#Person2#: When she first walked into the room to greet us, she didn't shake our hands or introduce herself at all. I thought that was a bit unprofessional.
#Person1#: You're right. If she walked into meeting with our clients like that, it would make our company look bad, wouldn't it?
#Person2#: It sure would. Did you also notice the way she slouched in her chair during most of the interview? She had horrible posture!
#Person1#: I agree. I guess I was paying more attention to her answers

## Q & A Task

In [9]:
squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
squad

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1000
    })
})

In [10]:
squad["train"][0]

{'id': '56cebf6aaab44d1400b889a4',
 'title': '2008_Sichuan_earthquake',
 'context': "General Secretary and President Hu Jintao announced that the disaster response would be rapid. Just 90 minutes after the earthquake, Premier Wen Jiabao, who has an academic background in geomechanics, flew to the earthquake area to oversee the rescue work. Soon afterward, the Ministry of Health said that it had sent ten emergency medical teams to Wenchuan County. On the same day, the Chengdu Military Region Command dispatched 50,000 troops and armed police to help with disaster relief work in Wenchuan County. However, due to the rough terrain and close proximity of the quake's epicenter, the soldiers found it very difficult to get help to the rural regions of the province.",
 'question': 'How many troops were dispatched by the Chengdu military?',
 'answers': {'text': ['50,000'], 'answer_start': [430]}}

In [11]:
random_idx = random.randint(0, len(squad["train"]))
question = squad["train"][random_idx]["question"]
context = squad["train"][random_idx]["context"]
title = squad["train"][random_idx]["title"]
baseline_answer = squad["train"][random_idx]["answers"]
print("Context:")
print(context)
print("====="*10)
prompt = f"Please answer a question about the following article about {title}:\n\n{context}\n\n{question}"
print(prompt + question)
print("====="*10)
print("Baseline answer:")
print(baseline_answer['text'])
print("====="*10)
sentence_encoded = tokenizer(prompt + question, return_tensors='pt', max_length=256, truncation=True)
pred_answer = model.generate(sentence_encoded['input_ids'], max_new_tokens=100)
sentence_decoded = tokenizer.decode(pred_answer[0], skip_special_tokens=True)
print("Generated Answer")
print(sentence_decoded)
print("====="*10)
rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_answer['text']])
print(rouge_score)

Context:
As of 2012[update] research continued in many fields. The university president, John Jenkins, described his hope that Notre Dame would become "one of the pre–eminent research institutions in the world" in his inaugural address. The university has many multi-disciplinary institutes devoted to research in varying fields, including the Medieval Institute, the Kellogg Institute for International Studies, the Kroc Institute for International Peace studies, and the Center for Social Concerns. Recent research includes work on family conflict and child development, genome mapping, the increasing trade deficit of the United States with China, studies in fluid mechanics, computational science and engineering, and marketing trends on the Internet. As of 2013, the university is home to the Notre Dame Global Adaptation Index which ranks countries annually based on how vulnerable they are to climate change and how prepared they are to adapt.
Please answer a question about the following arti

## Language Translation

In [12]:
prompt = "Translate the following sentence to french \n It is a wonderful day"
inputs = tokenizer(prompt, return_tensors='pt')
print(prompt)
translated = model.generate(inputs["input_ids"], max_new_tokens=50)
pred = tokenizer.decode(translated[0], skip_special_tokens=True)
print(f"Predicted sentence: {pred}")
rogue_metric.compute(predictions=[pred], references=["C'est une journée merveilleuse"])

Translate the following sentence to french 
 It is a wonderful day
Predicted sentence: Il est un jour merveilleuse


{'rouge1': 0.3636363636363636,
 'rouge2': 0.0,
 'rougeL': 0.3636363636363636,
 'rougeLsum': 0.3636363636363636}

## Describe the model

In [13]:
# print(model)

In [14]:
def format_number_to_millions(number):
  return f'{number / 1_000_000:.2f}M'

def print_parameters_summary():
  total_params = 0
  trainable_params = 0
  params = model.named_parameters()
  for name, param in params:
      if param.requires_grad:
          trainable_params += param.numel()
      total_params += param.numel()
  return total_params, trainable_params

total_params, trainable_params = print_parameters_summary()
print(f'Total params: {format_number_to_millions(total_params)}')
print(f'Trainable params: {format_number_to_millions(total_params)}')

Total params: 76.96M
Trainable params: 76.96M


In [15]:
dict = {}
for name, param in model.named_parameters():
  dict[name] = param.shape
df = pd.DataFrame.from_dict(dict, orient='index')
df

Unnamed: 0,0,1
shared.weight,32128,512.0
encoder.block.0.layer.0.SelfAttention.q.weight,384,512.0
encoder.block.0.layer.0.SelfAttention.k.weight,384,512.0
encoder.block.0.layer.0.SelfAttention.v.weight,384,512.0
encoder.block.0.layer.0.SelfAttention.o.weight,512,384.0
...,...,...
decoder.block.7.layer.2.DenseReluDense.wi_1.weight,1024,512.0
decoder.block.7.layer.2.DenseReluDense.wo.weight,512,1024.0
decoder.block.7.layer.2.layer_norm.weight,512,
decoder.final_layer_norm.weight,512,


## Setting normalization weights to Zero

In [16]:
model.decoder.final_layer_norm.weight = nn.Parameter(torch.zeros(512))
model.decoder.final_layer_norm.weight[:5]

tensor([0., 0., 0., 0., 0.], grad_fn=<SliceBackward0>)

In [17]:
random_idx = random.randint(0, len(squad["train"]))
question = squad["train"][random_idx]["question"]
context = squad["train"][random_idx]["context"]
baseline_answer = squad["train"][random_idx]["answers"]
print("Context:")
print(context)
print("====="*10)
prompt = f"given the context, answer the question in few sentences .\n context\n {context} Question\n {question}"
print(prompt + question)
print("====="*10)
print("Baseline answer:")
print(baseline_answer)
print("====="*10)
sentence_encoded = tokenizer(prompt + question, return_tensors='pt', max_length=256, truncation=True)
pred_answer = model.generate(sentence_encoded['input_ids'], max_new_tokens=20)
sentence_decoded = tokenizer.decode(pred_answer[0], skip_special_tokens=False)
print("Generated Answer")
print(sentence_decoded)
print("====="*10)
rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_answer['text']])
print(rouge_score)

Context:
Despite being an original story, Spectre draws on Ian Fleming's source material, most notably in the character of Franz Oberhauser, played by Christoph Waltz. Oberhauser shares his name with Hannes Oberhauser, a background character in the short story "Octopussy" from the Octopussy and The Living Daylights collection, and who is named in the film as having been a temporary legal guardian of a young Bond in 1983. Similarly, Charmian Bond is shown to have been his full-time guardian, observing the back story established by Fleming. With the acquisition of the rights to Spectre and its associated characters, screenwriters Neal Purvis and Robert Wade revealed that the film would provide a minor retcon to the continuity of the previous films, with the Quantum organisation alluded to in Casino Royale and introduced in Quantum of Solace reimagined as a division within Spectre rather than an independent organisation.
given the context, answer the question in few sentences .
 context
 

## Change layer dimensions

In [18]:
# model.decoder.final_layer_norm.weight = nn.Parameter(torch.zeros(256))
# model.lm_head.weight = nn.Parameter(torch.zeros(32128, 256))
# for param in model.decoder.final_layer_norm.parameters():
#   param.requires_grad = True

In [19]:
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, d_model=256, ignore_mismatched_sizes=True)
dict = {}
for name, param in model.named_parameters():
  dict[name] = param.shape
df = pd.DataFrame.from_dict(dict, orient='index')
df

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at google/flan-t5-small and are newly initialized because the shapes did not match:
- decoder.block.0.layer.0.SelfAttention.k.weight: found shape torch.Size([384, 512]) in the checkpoint and torch.Size([384, 256]) in the model instantiated
- decoder.block.0.layer.0.SelfAttention.o.weight: found shape torch.Size([512, 384]) in the checkpoint and torch.Size([256, 384]) in the model instantiated
- decoder.block.0.layer.0.SelfAttention.q.weight: found shape torch.Size([384, 512]) in the checkpoint and torch.Size([384, 256]) in the model instantiated
- decoder.block.0.layer.0.SelfAttention.v.weight: found shape torch.Size([384, 512]) in the checkpoint and torch.Size([384, 256]) in the model instantiated
- decoder.block.0.layer.0.layer_norm.weight: found shape torch.Size([512]) in the checkpoint and torch.Size([256]) in the model instantiated
- decoder.block.0.layer.1.EncDecAttention.k.weight: found sha

Unnamed: 0,0,1
shared.weight,32128,256.0
encoder.block.0.layer.0.SelfAttention.q.weight,384,256.0
encoder.block.0.layer.0.SelfAttention.k.weight,384,256.0
encoder.block.0.layer.0.SelfAttention.v.weight,384,256.0
encoder.block.0.layer.0.SelfAttention.o.weight,256,384.0
...,...,...
decoder.block.7.layer.2.DenseReluDense.wi_1.weight,1024,256.0
decoder.block.7.layer.2.DenseReluDense.wo.weight,256,1024.0
decoder.block.7.layer.2.layer_norm.weight,256,
decoder.final_layer_norm.weight,256,


In [20]:
random_idx = random.randint(0, len(squad["train"]))
question = squad["train"][random_idx]["question"]
context = squad["train"][random_idx]["context"]
baseline_answer = squad["train"][random_idx]["answers"]
print("Context:")
print(context)
print("====="*10)
prompt = f"given the context, answer the question in few sentences .\n context\n {context} Question\n {question}"
print(prompt + question)
print("====="*10)
print("Baseline answer:")
print(baseline_answer)
print("====="*10)
sentence_encoded = tokenizer(prompt + question, return_tensors='pt', max_length=512, truncation=True)
pred_answer = model.generate(sentence_encoded['input_ids'], max_new_tokens=20)
sentence_decoded = tokenizer.decode(pred_answer[0], skip_special_tokens=False)
print("Generated Answer")
print(sentence_decoded)
print("====="*10)
rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_answer['text']])
print(rouge_score)

Context:
Not all reviewers were enthusiastic. Some lamented the use of poor white Southerners, and one-dimensional black victims, and Granville Hicks labeled the book "melodramatic and contrived". When the book was first released, Southern writer Flannery O'Connor commented, "I think for a child's book it does all right. It's interesting that all the folks that are buying it don't know they're reading a child's book. Somebody ought to say what it is." Carson McCullers apparently agreed with the Time magazine review, writing to a cousin: "Well, honey, one thing we know is that she's been poaching on my literary preserves."
given the context, answer the question in few sentences .
 context
 Not all reviewers were enthusiastic. Some lamented the use of poor white Southerners, and one-dimensional black victims, and Granville Hicks labeled the book "melodramatic and contrived". When the book was first released, Southern writer Flannery O'Connor commented, "I think for a child's book it does

## Training

In [21]:
model_name = "google/flan-t5-small"
flant5 = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [22]:
squad = squad.flatten()
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
        num_rows: 1000
    })
})

In [23]:
squad['train'][1]

{'id': '5733c0064776f41900661198',
 'title': 'University_of_Notre_Dame',
 'context': 'The television station, NDtv, grew from one show in 2002 to a full 24-hour channel with original programming by September 2006. WSND-FM serves the student body and larger South Bend community at 88.9 FM, offering students a chance to become involved in bringing classical music, fine arts and educational programming, and alternative rock to the airwaves. Another radio station, WVFI, began as a partner of WSND-FM. More recently, however, WVFI has been airing independently and is streamed on the Internet.',
 'question': 'Which television station finds its home at Notre Dame?',
 'answers.text': ['NDtv'],
 'answers.answer_start': [24]}

#### Justification for prompt

The prompts are collected from Flan 2022 Collection (Chung et al, arXiv:2210.11416).

Ref: https://arxiv.org/abs/2301.13688
Prompts are selected from this github https://github.com/google-research/FLAN/blob/main/flan/v2/flan_templates_branched.py



In [24]:
def preprocess_data(examples):
  """Adds prefix, tokenizes and sets the labels"""
  questions = examples["question"]
  contexts = examples["context"]
  titles = examples["title"]
  answers = []
  for answer in examples["answers.text"]:
    answers.append(answer[0])
  prefix = f"""Answer a question about this article:\n{context}\nQ:{question}A:"""
  inputs = [prefix.format(context=context.strip(), question=question.strip()) for question, context in zip(contexts, questions)]
  model_inputs = tokenizer(inputs,
                           truncation=True,
                           padding="max_length",
                           return_tensors='tf',
                           max_length=512)
  labels = tokenizer(text_target=answers, max_length=512, truncation=True)
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

tensored_data = squad.map(preprocess_data, remove_columns=squad["train"].column_names, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [25]:
tensored_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [26]:
tokenizer.decode(tensored_data["train"][0]["input_ids"][0], skip_special_tokens=True)

'Answer'

In [27]:
tokenizer.decode(tensored_data["train"][0]["labels"], skip_special_tokens=True)

'50,000'

In [28]:
tokenizer.decode(tensored_data["train"][4]["input_ids"][0], skip_special_tokens=True)

'Answer'

In [29]:
tokenizer.decode(tensored_data["train"][4]["labels"], skip_special_tokens=True)

'WMA'

In [30]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

In [34]:
# Global Parameters
L_RATE = 5e-5
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 8
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 5

def compute_metrics(eval_preds):
   preds, labels = eval_preds
   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
   result = rogue_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
   return result

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [35]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tensored_data["train"],
   eval_dataset=tensored_data["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

In [36]:
# with 5k rows in data, training time on T4 -> 10 min
# Complete dataset, 10 epochs -> 1-2 hours
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,31.0132,26.353035,0.007478,0.0,0.007381,0.007459
2,29.4854,25.090464,0.007478,0.0,0.007381,0.007459
3,28.0739,24.379848,0.007478,0.0,0.007381,0.007459
4,27.2083,22.702253,0.005048,0.0,0.005017,0.005051
5,26.5801,22.607971,0.004114,0.0,0.004093,0.004132


TrainOutput(global_step=2500, training_loss=28.472177734375, metrics={'train_runtime': 674.8628, 'train_samples_per_second': 29.636, 'train_steps_per_second': 3.704, 'total_flos': 1858905047040000.0, 'train_loss': 28.472177734375, 'epoch': 5.0})

## Evaluate

In [37]:
trainer.evaluate()

{'eval_loss': 22.60797119140625,
 'eval_rouge1': 0.0041138485052958735,
 'eval_rouge2': 0.0,
 'eval_rougeL': 0.004092684070315649,
 'eval_rougeLsum': 0.0041324068475384265,
 'eval_runtime': 36.7731,
 'eval_samples_per_second': 27.194,
 'eval_steps_per_second': 3.399,
 'epoch': 5.0}

In [38]:
last_checkpoint = "./results/checkpoint-2500"
finetuned_model = T5ForConditionalGeneration.from_pretrained(last_checkpoint)
tokenizer = T5Tokenizer.from_pretrained(last_checkpoint)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [39]:
my_question = "What do you think about the benefit of Artificial Intelligence?"
inputs = "Please answer to this question: " + my_question
inputs

'Please answer to this question: What do you think about the benefit of Artificial Intelligence?'

In [40]:
inputs = tokenizer(inputs, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

acialrechnung


In [44]:
for i in range(3):
  random_idx = random.randint(0, len(squad["train"]))
  question = squad["train"][random_idx]["question"]
  context = squad["train"][random_idx]["context"]
  title = squad["train"][random_idx]["title"]
  baseline_answer = squad["train"][random_idx]["answers.text"]
  prompt = f"""Please answer a question about the following article about {title}\n{context}\n\nQ: {question}"""
  print(prompt)
  print("====="*10)
  print("Baseline answer:")
  print(baseline_answer)
  print("====="*10)
  sentence_encoded = tokenizer(prompt, return_tensors='pt', max_length=256, truncation=True)
  pred_answer = finetuned_model.generate(sentence_encoded['input_ids'], max_new_tokens=100)
  sentence_decoded = tokenizer.decode(pred_answer[0], skip_special_tokens=True)
  print("Generated Answer")
  print(sentence_decoded)
  print("====="*10)
  rouge_score = rogue_metric.compute(predictions=[sentence_decoded], references=[baseline_answer])
  print(rouge_score)
  print("#####"*20)

Please answer a question about the following article about Beyoncé
At the 57th Annual Grammy Awards in February 2015, Beyoncé was nominated for six awards, ultimately winning three: Best R&B Performance and Best R&B Song for "Drunk in Love", and Best Surround Sound Album for Beyoncé. She was nominated for Album of the Year but the award was won by Beck for his Morning Phase album. In August, the cover of the September issue of Vogue magazine was unveiled online, Beyoncé as the cover star, becoming the first African-American artist and third African-American woman in general to cover the September issue. She headlined the 2015 Made in America festival in early September and also the Global Citizen Festival later that month. Beyoncé made an uncredited featured appearance on the track "Hymn for the Weekend" by British rock band Coldplay, on their seventh studio album A Head Full of Dreams (2015), which saw release in December. On January 7, 2016, Pepsi announced Beyoncé would perform alon

## Appendix

## Apply PEFT

[TBD]

## Fine tuning using Human feedback

[TBD]