# Long Form Question Answering using ELI5(explain like I'm five!) Dataset

In this project we are going to create a natural language processing model for generating long-form answers using the ELI5 dataset.

The first step is to load the dataset and import and install the required libraries.

For simplicity, we are using. the ELI5 dataset in the huggingface website since creating the dataset from scratch requires multiple days of running a script.

In this notebook, we included the category as the format supported by T5 but did not include the scores of different answers, we only joint all answers together to have one big answer as string.

### install libraries

In [2]:
!pip install transformers datasets torch

!pip install accelerate==0.27.0

import accelerate
print(accelerate.__version__)

0.27.0


### import libraries

In [3]:
from datasets import load_dataset

from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments

### load ELI5 dataset

In [4]:
dataset = load_dataset("eli5_category")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 91772
    })
    validation1: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 5446
    })
    validation2: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 2375
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 5411
    })
})

### preprocess data

Now, we have to pre-process the data and make it ready for the pre-trained models e,g, T5 to be fine-tuned on the **"ELI5-category"** dataset.

T5 model requires the input training data be in the following format:



> input as "category: {category} question: {question}"
>
> outputs as "{answer}"





In [5]:
def preprocess_data_with_category(examples):

    # create input question as format ""
    inputs = [f"category: {cat} explain: {q}" for cat, q in zip(examples['category'], examples['title'])]

    # create output answers as a single string.
    # in the first attempt, we only joint al the answers of a question together...
    # later, try to include the scores as initial weights
    all_answers_text = []
    mix_answers = []
    for item in examples['answers']:
      mix_answers = []
      for ans in item['text']:
        mix_answers.append(ans)
      all_answers_text.append(''.join(mix_answers))

    return {'input_text': inputs, 'target_text': all_answers_text}

# Apply the preprocessing function to each split
train_dataset = dataset['train'].map(preprocess_data_with_category, batched=True)
validation_dataset = dataset['validation1'].map(preprocess_data_with_category, batched=True)


In [6]:
train_dataset['input_text'][0]

train_dataset['target_text'][0]

"the rotation of the earth is not a constant. in fact the rotation of the earth is slowing down, which means that a full day is getting slightly longer. without leap seconds our clocks would slowly drift ever so slightly out of sync with the actual day. we could deal with this by redefining how how long 1 second is, making it slightly longer so that one day is still exactly 24*60*60 seconds. but in practice that is really inconvenient for a lot of our technology which relies on very precise timing. its easier to just move us ahead one second every couple of years or so.The Earth's rotation is not regular. It varies a bit, so sometimes we add a second. We do this to ensure that noon is always going to be sometime around mid-day. If we did not add leap seconds, over a very long period of time where the Earth's rotation slowly changed, noon could end up being at dusk. We want to keep 7am in the morning, noon at mid-day, 7pm around evening, etc. Though we have never had one, it's also poss

Seperate only 10 percent of data

In [7]:
type(train_dataset)

In [8]:
import pandas as pd

In [9]:
train_set=pd.DataFrame(train_dataset)
train_set.head()

Unnamed: 0,q_id,title,selftext,category,subreddit,answers,title_urls,selftext_urls,input_text,target_text
0,5lchat,Why there was a 'leap second' added to the end...,,Other,explainlikeimfive,"{'a_id': ['dbuoyxl', 'dbur7gi', 'dbuotht'], 't...",[url],[url],category: Other explain: Why there was a 'leap...,the rotation of the earth is not a constant. i...
1,5lcjq6,How do you claim undiscovered land?,"If your on a boat, sailing through lets say th...",Other,explainlikeimfive,"{'a_id': ['dbuplm8', 'dbuocvb', 'dbux9vf'], 't...",[url],[url],category: Other explain: How do you claim undi...,Imagine you are out walking in the woods near ...
2,5lcl43,Why do we fail to do realistic human CGI (like...,"Title pretty much, thanks for answers in advance!",Technology,explainlikeimfive,"{'a_id': ['dbuns7l', 'dbunw2c', 'dbup34d', 'db...",[url],[url],category: Technology explain: Why do we fail t...,It's more that we're really good at picking up...
3,5lcr1h,Why is it that we calm down when we take a dee...,,Biology,explainlikeimfive,"{'a_id': ['dbuusst'], 'text': ['Anxiety/stress...",[url],[url],category: Biology explain: Why is it that we c...,Anxiety/stress are the result of your sympathe...
4,5lcsyf,Why does 1080p on a 4k TV look better than 108...,,Technology,explainlikeimfive,"{'a_id': ['dbuq0qt', 'dbuqstj'], 'text': ['In ...",[url],[url],category: Technology explain: Why does 1080p o...,In a 1080p screen each pixel is represented by...


In [10]:
valid_set = pd.DataFrame(validation_dataset)
valid_set.head()

Unnamed: 0,q_id,title,selftext,category,subreddit,answers,title_urls,selftext_urls,input_text,target_text
0,5lcw7q,why is paedophilia so much more common in men ...,The percentages of people convicted for child ...,Culture,explainlikeimfive,"{'a_id': ['dbuqun9', 'dbuqji6', 'dbusmj0', 'db...",[url],[url],category: Culture explain: why is paedophilia ...,Whilst I'm no expert I listened to a very good...
1,5le9jl,Why is it okay to make fun of people who are f...,,Culture,explainlikeimfive,"{'a_id': ['dbv1hdv'], 'text': ['Honestly I don...",[url],[url],category: Culture explain: Why is it okay to m...,Honestly I don't know but as a southerner I fi...
2,5leb73,"Why do we, as humans, crave social interaction...",Just curious but why the hell does every perso...,Culture,explainlikeimfive,"{'a_id': ['dbv1r7v'], 'text': ['Lots of people...",[url],[url],"category: Culture explain: Why do we, as human...",Lots of people prefer to do things on their ow...
3,5lf0p0,"What was Nietzche's philosophy, exactly?",I am at a loss as to what they mean. I'm havin...,Culture,explainlikeimfive,"{'a_id': ['dbv6m77', 'dbv9ldo', 'dbv9hm6', 'db...",[url],[url],category: Culture explain: What was Nietzche's...,"Nietzsche's ""Ubermensch"" is the goal that soci..."
4,5lf5ir,The Political Spectrum,,Culture,explainlikeimfive,"{'a_id': ['dbv7nls'], 'text': ['The political ...",[url],[url],category: Culture explain: The Political Spectrum,The political spectrum varies widely from coun...


In [11]:
valid_set.shape
int(valid_set.shape[0]*0.1)

544

In [12]:
train_set.shape
int(train_set.shape[0]*0.1)

9177

In [13]:
train_rows = int(train_set.shape[0]*0.1)
train_set = train_set.head(train_rows)

valid_rows = int(valid_set.shape[0]*0.1)
valid_set = valid_set.head(valid_rows)

In [14]:
print(train_rows, valid_rows)

9177 544


In [15]:
train_set.head()

Unnamed: 0,q_id,title,selftext,category,subreddit,answers,title_urls,selftext_urls,input_text,target_text
0,5lchat,Why there was a 'leap second' added to the end...,,Other,explainlikeimfive,"{'a_id': ['dbuoyxl', 'dbur7gi', 'dbuotht'], 't...",[url],[url],category: Other explain: Why there was a 'leap...,the rotation of the earth is not a constant. i...
1,5lcjq6,How do you claim undiscovered land?,"If your on a boat, sailing through lets say th...",Other,explainlikeimfive,"{'a_id': ['dbuplm8', 'dbuocvb', 'dbux9vf'], 't...",[url],[url],category: Other explain: How do you claim undi...,Imagine you are out walking in the woods near ...
2,5lcl43,Why do we fail to do realistic human CGI (like...,"Title pretty much, thanks for answers in advance!",Technology,explainlikeimfive,"{'a_id': ['dbuns7l', 'dbunw2c', 'dbup34d', 'db...",[url],[url],category: Technology explain: Why do we fail t...,It's more that we're really good at picking up...
3,5lcr1h,Why is it that we calm down when we take a dee...,,Biology,explainlikeimfive,"{'a_id': ['dbuusst'], 'text': ['Anxiety/stress...",[url],[url],category: Biology explain: Why is it that we c...,Anxiety/stress are the result of your sympathe...
4,5lcsyf,Why does 1080p on a 4k TV look better than 108...,,Technology,explainlikeimfive,"{'a_id': ['dbuq0qt', 'dbuqstj'], 'text': ['In ...",[url],[url],category: Technology explain: Why does 1080p o...,In a 1080p screen each pixel is represented by...


In [16]:
valid_set.head()

Unnamed: 0,q_id,title,selftext,category,subreddit,answers,title_urls,selftext_urls,input_text,target_text
0,5lcw7q,why is paedophilia so much more common in men ...,The percentages of people convicted for child ...,Culture,explainlikeimfive,"{'a_id': ['dbuqun9', 'dbuqji6', 'dbusmj0', 'db...",[url],[url],category: Culture explain: why is paedophilia ...,Whilst I'm no expert I listened to a very good...
1,5le9jl,Why is it okay to make fun of people who are f...,,Culture,explainlikeimfive,"{'a_id': ['dbv1hdv'], 'text': ['Honestly I don...",[url],[url],category: Culture explain: Why is it okay to m...,Honestly I don't know but as a southerner I fi...
2,5leb73,"Why do we, as humans, crave social interaction...",Just curious but why the hell does every perso...,Culture,explainlikeimfive,"{'a_id': ['dbv1r7v'], 'text': ['Lots of people...",[url],[url],"category: Culture explain: Why do we, as human...",Lots of people prefer to do things on their ow...
3,5lf0p0,"What was Nietzche's philosophy, exactly?",I am at a loss as to what they mean. I'm havin...,Culture,explainlikeimfive,"{'a_id': ['dbv6m77', 'dbv9ldo', 'dbv9hm6', 'db...",[url],[url],category: Culture explain: What was Nietzche's...,"Nietzsche's ""Ubermensch"" is the goal that soci..."
4,5lf5ir,The Political Spectrum,,Culture,explainlikeimfive,"{'a_id': ['dbv7nls'], 'text': ['The political ...",[url],[url],category: Culture explain: The Political Spectrum,The political spectrum varies widely from coun...


### Tokenization

Next, we need to tokenize the inputs. Also to ensure the tokens don't exceed the model’s maximum sequence length which for T5 model is 512 tokens.

In [17]:
# Load the tokenizer
model_name = 't5-base'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


convert dataframe to hugging face dataset arrow type for further training

In [18]:
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

### convert to Huggingface dataset
train_dataset1 = Dataset(pa.Table.from_pandas(train_set))
validation_dataset1 = Dataset(pa.Table.from_pandas(valid_set))

In [19]:
def tokenize_function(examples):
    # Tokenize the inputs and targets
    model_inputs = tokenizer(examples['input_text'], max_length=256, truncation=True, padding="max_length")
    labels = tokenizer(examples['target_text'], max_length=256, truncation=True, padding="max_length").input_ids
    # Replace tokenizer.pad_token_id with -100 for the labels
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in label_example] for label_example in labels]
    model_inputs['labels'] = labels
    return model_inputs

# Apply tokenization to each dataset split
train_dataset1 = train_dataset1.map(tokenize_function, batched=True)
validation_dataset1 = validation_dataset1.map(tokenize_function, batched=True)


Map:   0%|          | 0/9177 [00:00<?, ? examples/s]

Map:   0%|          | 0/544 [00:00<?, ? examples/s]

### fine-tuning T5 model with the preprocessed and tokenized model


#### Set Training Arguments

In [20]:
# Initialize the T5-base model

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=0.5,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)


#### Initialize the Trainer and Start Training

In [21]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset1,
    eval_dataset=validation_dataset1
)

Run this cell to fix the problem of **"Cuda out-of-memory"**

In [22]:
# import torch

# torch.cuda.empty_cache()

# import gc
# del dataset
# gc.collect()

# torch.cuda.memory_summary(device=None, abbreviated=False)

In [None]:
# Start training
trainer.train()

### test the model on a sample question

In [None]:
# Select a sample question
sample_data = dataset['test'][0]  # Get the first item in the dataset
question = sample_data['title']

# Format the question for T5
input_text = f"explain: {question}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True)

In [None]:
inputs

In [None]:
question

In [None]:
# Generate the answer
output_sequences = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_length=512,
    num_beams=5,
    early_stopping=True
)

# Decode and print the answer
answer = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
print(f"Question: {question}")
print(f"Answer: {answer}")

### Evaluate Model using ***Rouge f1*** score

To do so, we have to take a look at this [repository](https://github.com/facebookresearch/ELI5/tree/main/model_code).

### Save Model in Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Define the path where you want to save the model
model_path = '/content/drive/my_finetuned_t5_attempt_1'

# Save the model and the tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

### Load Model fron Google Drive

In [None]:
# Define the path where you want to save the model
model_path = '/content/drive/my_finetuned_t5_attempt_1'

# Load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(model_path)