# Fine-tune FLAN-T5 for chat & dialogue summarization

In this blog, you will learn how to fine-tune [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl) for chat & dialogue summarization using Hugging Face Transformers. If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.

In this example we will use the [samsum](https://huggingface.co/datasets/samsum) dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

You will learn how to:

1. [Setup Development Environment](#1-setup-development-environment)
2. [Load and prepare samsum dataset](#2-load-and-prepare-samsum-dataset)
3. [Fine-tune and evaluate FLAN-T5](#3-fine-tune-and-evaluate-flan-t5)
4. [Run Inference and summarize ChatGPT dialogues](#4-run-inference-and-summarize-chatgpt-dialogues)

Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments.

## Quick intro: FLAN-T5, just a better T5

FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models.

![flan-t5](../assets/flan-t5.png)

* Paper: https://arxiv.org/abs/2210.11416
* Official repo: https://github.com/google-research/t5x

---

Now we know what FLAN-T5 is, let's get started. 🚀

_Note: This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including a NVIDIA T4._

# **1. Setup Development Environment**

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [1]:
# python
!pip install -q pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade
!pip install -q evaluate
!pip install -q accelerate -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m111.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.7/66.7 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.4 MB/s[0m et

In [2]:
# install git-fls for pushing model and logs to the hugging face hub
!sudo apt-get install git-lfs --yes

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


In [3]:
import re
import os
import torch
import pandas as pd

In [4]:
#Training datasets-

#Dialogue summarization- samsum
#Table Q&A- DongfuTingle/FeTaQA

dataset_id = "DongfuTingle/FeTaQA" #'samsum', 'DongfuTingle/FeTaQA'

# prompt_colname = "" #'dialogue', to be created

if(dataset_id=="samsum"):
  cols_to_keep = ["dialogue", "summary"]
  source_key = "dialogue"
  RESPONSE_COLNAME = "summary"
if(dataset_id=="DongfuTingle/FeTaQA"):
  cols_to_keep = ["table_page_title", "table_section_title", "table_array", "question", "answer"]
  source_key = "table_array"
  RESPONSE_COLNAME = "answer"


#Prompt prep.
# context-
# - in dialogue summarization- 'dialogue'
# - in table Q&A- 'table'
context_format_required_for_prompt = "linearization"
# LOV-
# 'original'- no change
# 'markdown'- (for table) data frame to markdown
# 'linearization'- (for table) data frame to text


In [5]:
MODELNAME="google/flan-t5-base"


modelname_for_save = re.sub("[^\w]", "_", MODELNAME)
print(os.getcwd())
print(os.listdir())
output_dir = f"/content/output/finetuned/{modelname_for_save}"
os.makedirs(output_dir, exist_ok=True)

/content
['.config', 'sample_data']


In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [7]:
epoch_count = 20

This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join).
If you already have an account, you can skip this step.
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk.

In [8]:
# from huggingface_hub import notebook_login

# notebook_login()

# **2. Load and prepare dataset**

we will use the [samsum](https://huggingface.co/datasets/samsum) dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```json
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [9]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
#samsum-
# Train dataset size: 14732
# Test dataset size: 819

print(dataset)

Downloading readme:   0%|          | 0.00/663 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.77M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.61M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Train dataset size: 7326
Test dataset size: 2003
DatasetDict({
    train: Dataset({
        features: ['feta_id', 'table_source_json', 'page_wikipedia_url', 'table_page_title', 'table_section_title', 'table_array', 'highlighted_cell_ids', 'question', 'answer'],
        num_rows: 7326
    })
    validation: Dataset({
        features: ['feta_id', 'table_source_json', 'page_wikipedia_url', 'table_page_title', 'table_section_title', 'table_array', 'highlighted_cell_ids', 'question', 'answer'],
        num_rows: 1001
    })
    test: Dataset({
        features: ['feta_id', 'table_source_json', 'page_wikipedia_url', 'table_page_title', 'table_section_title', 'table_array', 'highlighted_cell_ids', 'question', 'answer'],
        num_rows: 2003
    })
})


Lets checkout an example of the dataset.

In [10]:
# from random import randrange

sample = dataset['train'][0] #[randrange(len(dataset["train"]))]

#samsum-
# print(f"dialogue: \n{sample['dialogue']}\n---------------")
# print(f"summary: \n{sample['summary']}\n---------------")

for colname in cols_to_keep:
  print(f"{colname}: \n{sample[colname]}\n---------------")

table_page_title: 
1982 Illinois gubernatorial election
---------------
table_section_title: 
Results
---------------
table_array: 
[['Party', 'Party', 'Candidate', 'Votes', '%', '±'], ['-', 'Republican', 'James R. Thompson (incumbent)', '1,816,101', '49.44', '-'], ['-', 'Democratic', 'Adlai Stevenson III', '1,811,027', '49.30', '-'], ['-', 'Libertarian', 'Bea Armstrong', '24,417', '0.66', '-'], ['-', 'Taxpayers', 'John E. Roche', '22,001', '0.60', '-'], ['-', 'N/A', 'write-ins', '161', '0.00', 'n-a'], ['Majority', 'Majority', 'Majority', '5,074', '0.14', '-'], ['Turnout', 'Turnout', 'Turnout', '3,673,707', '-', '-'], ['-', 'Republican hold', 'Republican hold', 'Swing', '-', '-']]
---------------
question: 
Who won the 1982 Illinois gubernatorial election, and how many votes was the margin?
---------------
answer: 
Thompson prevailed in the 1982 Illinois gubernatorial election by a 5,074 vote margin.
---------------


## **Data preprocessing**

In [11]:
def get_names_of_cols_to_delete_and_dataset(d, cols_to_keep):
  all_cols = d["train"].column_names
  print(f"all_cols- {all_cols}")
  print(f"cols_to_keep- {cols_to_keep}")
  # cols_to_keep = ["table_page_title", "table_section_title", "table_array", "question", "answer"]
  cols_to_delete = set(all_cols)-set(cols_to_keep)
  print(f"cols_to_delete- {cols_to_delete}")
  d = d.remove_columns(cols_to_delete)
  return(cols_to_delete, d)

In [12]:
cols_to_delete, dataset2 = get_names_of_cols_to_delete_and_dataset(dataset, cols_to_keep)

print(dataset2)

all_cols- ['feta_id', 'table_source_json', 'page_wikipedia_url', 'table_page_title', 'table_section_title', 'table_array', 'highlighted_cell_ids', 'question', 'answer']
cols_to_keep- ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer']
cols_to_delete- {'highlighted_cell_ids', 'feta_id', 'page_wikipedia_url', 'table_source_json'}
DatasetDict({
    train: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer'],
        num_rows: 7326
    })
    validation: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer'],
        num_rows: 1001
    })
    test: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer'],
        num_rows: 2003
    })
})


### **Dec. data size**

In [13]:
#training_dataset = dataset2["train"].select(range(examples_count))
print(type(dataset2["train"])) #dictionary whose keys are 'Dataset''s features

train_examples_count = 70
valid_examples_count = 20
test_examples_count = 10


train_dataset = dataset2["train"][:train_examples_count] #dictionary whose keys are 'Dataset''s features
valid_dataset = dataset2["validation"][:valid_examples_count]
test_dataset = dataset2["test"][:test_examples_count]

<class 'datasets.arrow_dataset.Dataset'>


### **Prompt prep.**

In [14]:
#context_format_required_for_prompt

def get_row_wise_sentence(x):
  # print(x)
  sentence = ""
  for index, value in zip(x.index, x.values):
    # print("#"*25)
    # print(type(index))
    # print(index)
    # print(value)
    sentence += f"{index} is {value}. "
  sentence = sentence.strip()
  return(sentence)



def get_content_in_proper_format_for_prompt(d, source_key, context_format):
  print(context_format)

  if(context_format=="original"):
    #if no processing required for creating context (& source_key sufficient to be used as context)
    d["context"] = d[source_key]


  else:
    #if context is required to be prepared in a particular format (eg- wrt Table Q&A)

    context_for_prompt = []

    for i, table_list in enumerate(d[source_key]): #source_key- eg ('table_array' in FeTaQA)
      # if(i==1): #for table1
      print(str(i)+" "+"*"*50)
      #print(table_list)
      col_list = table_list[0] #header row of this table
      print(col_list)
      del table_list[0] #deleting header row
      df_table = pd.DataFrame(table_list, columns=col_list)
      #print(df_table)

      if(context_format=="markdown"):
        context_for_prompt.append(df_table.to_markdown())
      elif(context_format=="linearization"):
        df_table["row_wise_sentence"] = df_table.apply(lambda x: get_row_wise_sentence(x), axis=1)
        #print(df_table["row_wise_sentence"])
        context_for_prompt.append(" ".join(df_table["row_wise_sentence"].to_list()))


    d["context"] = context_for_prompt

  return(d)

train_dataset = get_content_in_proper_format_for_prompt(train_dataset, source_key, context_format_required_for_prompt)
print("#"*100)
valid_dataset = get_content_in_proper_format_for_prompt(valid_dataset, source_key, context_format_required_for_prompt)
print("#"*100)
test_dataset = get_content_in_proper_format_for_prompt(test_dataset, source_key, context_format_required_for_prompt)

linearization
0 **************************************************
['Party', 'Party', 'Candidate', 'Votes', '%', '±']
1 **************************************************
['Finish', 'Start', 'No', 'Name', 'Qual', 'Laps', 'Status']
2 **************************************************
['No.', 'Album', 'Artist', 'Released', 'Chart', 'Sales']
3 **************************************************
['Aircraft', 'In Service', 'Orders', 'Passengers', 'Notes']
4 **************************************************
['Year', 'Production', 'Role', 'Venue', 'Notes']
5 **************************************************
['User equipment Category', 'Max. L1 data rate Downlink (Mbit/s)', 'Max. number of DL MIMO layers', 'Max. L1 data rate Uplink (Mbit/s)', '3GPP Release']
6 **************************************************
['Year', 'Title', 'Role', 'Notes']
7 **************************************************
['-', '-', '-', 'Regular season', 'Regular season', 'Regular season', 'Regular season', 'Regular 

In [15]:
print(train_dataset["context"][0])

Party is -. Party is Republican. Candidate is James R. Thompson (incumbent). Votes is 1,816,101. % is 49.44. ± is -. Party is -. Party is Democratic. Candidate is Adlai Stevenson III. Votes is 1,811,027. % is 49.30. ± is -. Party is -. Party is Libertarian. Candidate is Bea Armstrong. Votes is 24,417. % is 0.66. ± is -. Party is -. Party is Taxpayers. Candidate is John E. Roche. Votes is 22,001. % is 0.60. ± is -. Party is -. Party is N/A. Candidate is write-ins. Votes is 161. % is 0.00. ± is n-a. Party is Majority. Party is Majority. Candidate is Majority. Votes is 5,074. % is 0.14. ± is -. Party is Turnout. Party is Turnout. Candidate is Turnout. Votes is 3,673,707. % is -. ± is -. Party is -. Party is Republican hold. Candidate is Republican hold. Votes is Swing. % is -. ± is -.


In [16]:
train_dataset.keys()

dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context'])

In [17]:
#prompt creation
def create_prompt(d, records_count):
  print("Calling create_prompt()-")

  print(d.keys())

  # for key in d: #d- dict
  #   d_size = len(d[key])
  #   print(d_size)
  #   break
  d_size = records_count
  print(d_size)

  prompt_list = []

   #'samsum'------------------------------------------------------------------------------------------------------------
  if(dataset_id == "samsum"): #'samsum'
    for i in range(d_size):
      prompt_list.append(f"""Question:
Summarize the following dialogue.
Dialogue:
{d["dialogue"][i]}
""")

  elif(dataset_id== "DongfuTingle/FeTaQA"): #'DongfuTingle/FeTaQA'------------------------------------------------------------------------------------------------------------
    for i in range(d_size):
      prompt_list.append(f"""Question:
{d["question"][i]}
Table:
{d["table_page_title"][i]}: {d["table_section_title"][i]}
{d["context"][i]}
  """)

  d["prompt"] = prompt_list
  print(d.keys())
  return(d)

train_dataset = create_prompt(train_dataset, train_examples_count)
print("#"*100)
valid_dataset = create_prompt(valid_dataset, valid_examples_count)
print("#"*100)
test_dataset = create_prompt(test_dataset, test_examples_count)

Calling create_prompt()-
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context'])
70
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'])
####################################################################################################
Calling create_prompt()-
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context'])
20
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'])
####################################################################################################
Calling create_prompt()-
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context'])
10
dict_keys(['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'])


In [18]:
print(type(train_dataset[source_key])) #O/p- <class 'list'>
print(len(train_dataset[source_key])) #O/p- 70

print(type(train_dataset["context"])) #O/p- <class 'list'>
print(len(train_dataset["context"])) #O/p- 70

print(type(train_dataset["prompt"])) #O/p- <class 'list'>
print(len(train_dataset["prompt"])) #O/p- 70

<class 'list'>
70
<class 'list'>
70
<class 'list'>
70


In [19]:
print(train_dataset["prompt"][0])

Question:
Who won the 1982 Illinois gubernatorial election, and how many votes was the margin?
Table:
1982 Illinois gubernatorial election: Results
Party is -. Party is Republican. Candidate is James R. Thompson (incumbent). Votes is 1,816,101. % is 49.44. ± is -. Party is -. Party is Democratic. Candidate is Adlai Stevenson III. Votes is 1,811,027. % is 49.30. ± is -. Party is -. Party is Libertarian. Candidate is Bea Armstrong. Votes is 24,417. % is 0.66. ± is -. Party is -. Party is Taxpayers. Candidate is John E. Roche. Votes is 22,001. % is 0.60. ± is -. Party is -. Party is N/A. Candidate is write-ins. Votes is 161. % is 0.00. ± is n-a. Party is Majority. Party is Majority. Candidate is Majority. Votes is 5,074. % is 0.14. ± is -. Party is Turnout. Party is Turnout. Candidate is Turnout. Votes is 3,673,707. % is -. ± is -. Party is -. Party is Republican hold. Candidate is Republican hold. Votes is Swing. % is -. ± is -.
  


In [20]:
#Consolidating all datasets into 1-
from datasets import Dataset
train_dataset2 = Dataset.from_dict(train_dataset)
valid_dataset2 = Dataset.from_dict(valid_dataset)
test_dataset2 = Dataset.from_dict(test_dataset)


from datasets import DatasetDict
dataset3 = DatasetDict({
    "train": train_dataset2,
    "valid": valid_dataset2,
    "test": test_dataset2
    })
dataset3

DatasetDict({
    train: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'],
        num_rows: 70
    })
    valid: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'],
        num_rows: 20
    })
    test: Dataset({
        features: ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt'],
        num_rows: 10
    })
})

## **Tokenization**

To train our model we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

In [21]:
from transformers import AutoTokenizer


# Load tokenizer of FLAN-t5-small
tokenizer = AutoTokenizer.from_pretrained(MODELNAME)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

before we can start training we need to preprocess our data. Abstractive Summarization is a text2text-generation task. This means our model will take a text as input and generate a summary as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data.

In [22]:
from datasets import concatenate_datasets

cols_to_keep2 = ["prompt", RESPONSE_COLNAME]

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([train_dataset2, valid_dataset2, test_dataset2])\
                   .map(lambda x: tokenizer(x["prompt"], truncation=True), batched=True, remove_columns=get_names_of_cols_to_delete_and_dataset(dataset3, cols_to_keep2)[0])

max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")



# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([train_dataset2, valid_dataset2, test_dataset2])\
                    .map(lambda x: tokenizer(x[RESPONSE_COLNAME], truncation=True), batched=True, remove_columns=get_names_of_cols_to_delete_and_dataset(dataset3, cols_to_keep2)[0])

max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

all_cols- ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt']
cols_to_keep- ['prompt', 'answer']
cols_to_delete- {'table_section_title', 'table_page_title', 'table_array', 'question', 'context'}


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Max source length: 512
all_cols- ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt']
cols_to_keep- ['prompt', 'answer']
cols_to_delete- {'table_section_title', 'table_page_title', 'table_array', 'question', 'context'}


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Max target length: 69


In [23]:
cols_to_keep2

['prompt', 'answer']

In [24]:
get_names_of_cols_to_delete_and_dataset(dataset3, cols_to_keep2)[0]

all_cols- ['table_page_title', 'table_section_title', 'table_array', 'question', 'answer', 'context', 'prompt']
cols_to_keep- ['prompt', 'answer']
cols_to_delete- {'table_section_title', 'table_page_title', 'table_array', 'question', 'context'}


{'context',
 'question',
 'table_array',
 'table_page_title',
 'table_section_title'}

In [25]:
def preprocess_function(sample, padding="max_length"):
    # add prefix to the input for t5
    #inputs = ["summarize: " + item for item in sample["dialogue"]]
    inputs = sample["prompt"]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    #labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)
    labels = tokenizer(text_target=sample[RESPONSE_COLNAME], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



tokenized_dataset = dataset3.map(preprocess_function, batched=True, remove_columns=dataset3["train"].column_names) #["dialogue", "summary", "id"])
print(tokenized_dataset)
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 70
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
})
Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model.
_I plan to do a follow-up post on how to fine-tune the `xxl` version of the model using Deepspeed._


In [26]:
from transformers import AutoModelForSeq2SeqLM

# # huggingface hub model id
# MODELNAME="google/flan-t5-small"

# # load model from the hub

model = AutoModelForSeq2SeqLM.from_pretrained(MODELNAME).to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

We want to evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics`.  
The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries

We are going to use `evaluate` library to evaluate the `rogue` score.

In [27]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library.

In [28]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [29]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
# repository_id = f"{MODELNAME.split('/')[1]}-{dataset_id}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir, #repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=epoch_count,
    # logging & evaluation strategies
    logging_dir=f"{output_dir}/logs", #{repository_id}/logs"
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    # push_to_hub=False,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    compute_metrics=compute_metrics,
)

We can start our training by using the `train` method of the `Trainer`.

In [30]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.567747,24.7154,12.8866,21.3052,22.4976,14.9
2,No log,1.470194,46.2632,23.846,39.2459,39.3804,16.6
3,No log,1.442961,51.1859,29.0608,45.2421,45.2022,17.25
4,No log,1.428791,50.1386,27.1946,43.4078,43.3846,17.25
5,No log,1.422435,48.7138,25.3103,41.6024,41.761,17.35
6,No log,1.418585,49.5364,26.9706,42.8548,42.8634,17.75
7,No log,1.423429,50.4234,27.295,43.1929,43.6969,17.85
8,No log,1.431807,50.6056,27.5849,43.0881,43.6187,17.9
9,No log,1.44318,51.117,28.419,43.9933,44.3871,17.9
10,No log,1.457182,48.9974,26.7685,42.3144,42.7819,17.9


TrainOutput(global_step=180, training_loss=1.2165564643012152, metrics={'train_runtime': 724.2146, 'train_samples_per_second': 1.933, 'train_steps_per_second': 0.249, 'total_flos': 958660293427200.0, 'train_loss': 1.2165564643012152, 'epoch': 20.0})


![flan-t5-tensorboard](../assets/flan-t5-tensorboard.png)

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [31]:
trainer.evaluate()

{'eval_loss': 1.4185845851898193,
 'eval_rouge1': 49.5364,
 'eval_rouge2': 26.9706,
 'eval_rougeL': 42.8548,
 'eval_rougeLsum': 42.8634,
 'eval_gen_len': 17.75,
 'eval_runtime': 3.1322,
 'eval_samples_per_second': 6.385,
 'eval_steps_per_second': 0.958,
 'epoch': 20.0}

The best score we achieved is an `rouge1` score of `47.23`.

Lets save our results and tokenizer to the Hugging Face Hub and create a model card.

In [32]:
# Push the results to the hub
trainer.save_model(output_dir) #trainer.push_to_hub()
trainer.create_model_card() #saving model card

# Saving our tokenizer
tokenizer.save_pretrained(output_dir) #repository_id

('/content/output/finetuned/google_flan_t5_base/tokenizer_config.json',
 '/content/output/finetuned/google_flan_t5_base/special_tokens_map.json',
 '/content/output/finetuned/google_flan_t5_base/tokenizer.json')

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [33]:
#Ref.- https://opendelta.readthedocs.io/en/latest/notes/autodelta.html
# t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-large")
# t5_tokenizer = AutoTokenizer.from_pretrained("t5-large")
# # A running example
# inputs_ids = t5_tokenizer.encode("Is Harry Poter wrtten by JKrowling", return_tensors="pt")
# t5_tokenizer.decode(t5.generate(inputs_ids)[0])


# from transformers import pipeline
from transformers import GenerationConfig

# load model and tokenizer from huggingface hub with pipeline
#summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum", device=0)

summarizer = AutoModelForSeq2SeqLM.from_pretrained(output_dir)
generation_config = GenerationConfig(
    #max_new_tokens=200, do_sample=True, top_k=2, eos_token_id=summarizer.config.eos_token_id, pad_token_id=tokenizer.pad_token_id
    max_new_tokens=200, do_sample=False, eos_token_id=summarizer.config.eos_token_id, pad_token_id=tokenizer.pad_token_id
)

tokenizer = AutoTokenizer.from_pretrained(output_dir)

def get_text(model, tokenizer, input_prompt):
  inputs_ids = tokenizer.encode(input_prompt, max_length=max_source_length, padding="max_length", truncation=True, return_tensors="pt") #tokenized txt
  # print(inputs_ids)
  # prompt_length = len(tokenizer.decode(inputs_ids[0]))
  # print(prompt_length)

  generated_txt = tokenizer.decode(model.generate(inputs_ids, generation_config=generation_config)[0], skip_special_tokens=True)

  #print(f"flan-t5-small summary:\n{generated_txt}")
  return(generated_txt)


In [51]:
selected_index = 3

#1) dataset- samsum
# sample = dataset['test'][selected_index = 3] #[randrange(len(dataset["test"]))]
# print(f"dialogue: \n{sample['dialogue']}\n---------------")
# input_prompt = sample["dialogue"]

# input_prompt = """Ishant: What are you doing?
# Hemant: I am watching movie.
# Ishant: Which movie are you watching?
# Hemant: Avengers.
# Ishant: How much have you watched?
# Hemant: Just started.
# Ishant: How long is it?
# Hemant: 2 hr long."""

# print(input_prompt)



#2) dataset- FeTaQA
# print(dataset3['train'].column_names)

# print("*"*50)
# sample = dataset3['train'][selected_index] #[randrange(len(dataset["train"]))]
# input_prompt = sample["prompt"]
# print(input_prompt)
# print("\n")

# print("*"*50)
# print("original response:\n")
# print(sample[RESPONSE_COLNAME])
# print("\n")

input_prompt = f"""Question:
How many patients had injury complications?
Table:
Number of patients evaluable for TEAEs is	328.
Number of patients with at least 1 TEAE	is 72 (22.0)
Number of TEAEs	is 89
General disorders and administration site conditions	is	31 (9.5)
Drug ineffective	is	29 (8.8)
Infections and infestations	is	19 (5.8)
Upper respiratory tract infection	is	5 (1.5)
Injury, poisoning and procedural complications	is	17 (5.2)
Infusion related reaction	is	17 (5.2)
Investigations	is	4 (1.2)
Skin and subcutaneous tissue disorders	is	4 (1.2)
"""
print(input_prompt)



#3)
print("generated text:")
print("*"*50)
get_text(summarizer, tokenizer, input_prompt) #'Hemant is watching Avengers movie. He has just started.'

Question:
How many patients had injury complications?
Table:
Number of patients evaluable for TEAEs is	328.
Number of patients with at least 1 TEAE	is 72 (22.0)
Number of TEAEs	is 89
General disorders and administration site conditions	is	31 (9.5)
Drug ineffective	is	29 (8.8)
Infections and infestations	is	19 (5.8)
Upper respiratory tract infection	is	5 (1.5)
Injury, poisoning and procedural complications	is	17 (5.2)
Infusion related reaction	is	17 (5.2)
Investigations	is	4 (1.2)
Skin and subcutaneous tissue disorders	is	4 (1.2)

generated text:
**************************************************


'Injuries and complications are 17 (5.2), while infusion related reaction is 4 (1.2).'

# **Table Q&A**

In [35]:
#Ref.- https://huggingface.co/RUCAIBox/mvp

from transformers import MvpTokenizerFast, MvpForConditionalGeneration

tokenizer = MvpTokenizerFast.from_pretrained("RUCAIBox/mvp")
model = MvpForConditionalGeneration.from_pretrained("RUCAIBox/mvp")

inputs = tokenizer(
    "Describe the following data: Iron Man | instance of | Superhero [SEP] Stan Lee | creator | Iron Man",
    return_tensors="pt",
)
generated_ids = model.generate(**inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)


SyntaxError: ignored

In [None]:
user_input = f"""Describe the following data:
SOC | Placebo (N = 89) n (% pts) | 0.3mg (N = 91) n (% pts) | 1.0mg (N = 94) n (% pts) | 2.0mg (N = 91) n (% pts) | Total (N=365) n (% pts)
injection site erythema | 0 | 20 (22.0%) | 24 (25.5%) | 27 (29.7%) | 71 (19.5%)"""


inputs = tokenizer(
    f"{user_input}",
    return_tensors="pt",
)
generated_ids = model.generate(**inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)


In [None]:
#Ref.- https://huggingface.co/tasks/table-question-answering

from transformers import pipeline
import pandas as pd

# prepare table + question
data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
table = pd.DataFrame.from_dict(data)
question = "how many movies does Leonardo Di Caprio have?"

# pipeline model
# Note: you must to install torch-scatter first.
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")

# result

print(tqa(table=table, query=question)['cells'][0])
#53


# **Training dataset prep.**

In [None]:
import pandas as pd

In [None]:
df_tables = pd.read_csv("training_dataset.csv")

col_list = df_tables.columns.to_list()
print(col_list)
col_list_updated = []
for colname in col_list:
  print(colname)
  col_list_updated.append(colname.replace("\n", " "))
print(col_list_updated)
df_tables.columns = col_list_updated
df_tables

In [None]:
df_markdown = df_tables.to_markdown()
df_markdown

In [None]:
from io import StringIO

pd.read_csv(StringIO(df_markdown), sep="|", header=0, index_col=1, skipinitialspace=True)\
  .dropna(axis=1, how='all')\
  .iloc[1:]

# Read a markdown file, getting the header from the first row and inex from the second column
# Drop the leftmost and rightmost null columns
# Drop the header underline row