# Long Form Question Answering using ELI5(explain like I'm five!) Dataset

In this project we are going to create a natural language processing model for generating long-form answers using the ELI5 dataset.

The first step is to load the dataset and import and install the required libraries.

For simplicity, we are using. the ELI5 dataset in the huggingface website since creating the dataset from scratch requires multiple days of running a script.

In this notebook, we included the category as the format supported by T5 but did not include the scores of different answers, we only joint all answers together to have one big answer as string.

### install libraries

In [1]:
!pip install transformers datasets torch

!pip install accelerate==0.27.0

import accelerate
print(accelerate.__version__)

0.27.0


### import libraries

In [2]:
from datasets import load_dataset

from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments

### load ELI5 dataset

In [32]:
dataset = load_dataset("eli5_category")
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 91772
    })
    validation1: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 5446
    })
    validation2: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 2375
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 5411
    })
})

### preprocess data

Now, we have to pre-process the data and make it ready for the pre-trained models e,g, T5 to be fine-tuned on the **"ELI5-category"** dataset.

T5 model requires the input training data be in the following format:



> input as "category: {category} question: {question}"
>
> outputs as "{answer}"





In [33]:
def preprocess_data_with_category(examples):

    # create input question as format ""
    inputs = [f"category: {cat} explain: {q}" for cat, q in zip(examples['category'], examples['title'])]

    # create output answers as a single string.
    # in the first attempt, we only joint al the answers of a question together...
    # later, try to include the scores as initial weights
    all_answers_text = []
    mix_answers = []
    for item in examples['answers']:
      mix_answers = []
      for ans in item['text']:
        mix_answers.append(ans)
      all_answers_text.append(''.join(mix_answers))

    return {'input_text': inputs, 'target_text': all_answers_text}

# Apply the preprocessing function to each split
train_dataset = dataset['train'].map(preprocess_data_with_category, batched=True)
validation_dataset = dataset['validation1'].map(preprocess_data_with_category, batched=True)


In [34]:
train_dataset['input_text'][0]

train_dataset['target_text'][0]

"the rotation of the earth is not a constant. in fact the rotation of the earth is slowing down, which means that a full day is getting slightly longer. without leap seconds our clocks would slowly drift ever so slightly out of sync with the actual day. we could deal with this by redefining how how long 1 second is, making it slightly longer so that one day is still exactly 24*60*60 seconds. but in practice that is really inconvenient for a lot of our technology which relies on very precise timing. its easier to just move us ahead one second every couple of years or so.The Earth's rotation is not regular. It varies a bit, so sometimes we add a second. We do this to ensure that noon is always going to be sometime around mid-day. If we did not add leap seconds, over a very long period of time where the Earth's rotation slowly changed, noon could end up being at dusk. We want to keep 7am in the morning, noon at mid-day, 7pm around evening, etc. Though we have never had one, it's also poss

### Tokenization

Next, we need to tokenize the inputs. Also to ensure the tokens don't exceed the model’s maximum sequence length which for T5 model is 512 tokens.

In [35]:
# Load the tokenizer
model_name = 't5-base'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [36]:
def tokenize_function(examples):
    # Tokenize the inputs and targets
    model_inputs = tokenizer(examples['input_text'], max_length=256, truncation=True, padding="max_length")
    labels = tokenizer(examples['target_text'], max_length=256, truncation=True, padding="max_length").input_ids
    # Replace tokenizer.pad_token_id with -100 for the labels
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in label_example] for label_example in labels]
    model_inputs['labels'] = labels
    return model_inputs

# Apply tokenization to each dataset split
train_dataset = train_dataset.map(tokenize_function, batched=True)
validation_dataset = validation_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/91772 [00:00<?, ? examples/s]

Map:   0%|          | 0/5446 [00:00<?, ? examples/s]

### fine-tuning T5 model with the preprocessed and tokenized model


#### Set Training Arguments

In [37]:
# Initialize the T5-base model

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)


#### Initialize the Trainer and Start Training

In [38]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset
)

Run this cell to fix the problem of **"Cuda out-of-memory"**

In [29]:
# import torch

# torch.cuda.empty_cache()

# import gc
# del dataset
# gc.collect()

# torch.cuda.memory_summary(device=None, abbreviated=False)

In [39]:
# Start training
trainer.train()

Step,Training Loss
10,12.4272
20,11.9906
30,12.0066
40,10.8418
50,11.1631
60,10.6088
70,9.2651
80,7.3578
90,5.9798
100,4.8982


KeyboardInterrupt: 

### test the model on a sample question

In [None]:
# Select a sample question
sample_data = dataset[0]  # Get the first item in the dataset
question = sample_data['question']

# Load model and tokenizer
model_name = 't5-large'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Format the question for T5
input_text = f"explain: {question}"
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the answer
output_sequences = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_length=512,
    num_beams=5,
    early_stopping=True
)

# Decode and print the answer
answer = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
print(f"Question: {question}")
print(f"Answer: {answer}")

### Save Model in Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Define the path where you want to save the model
model_path = '/content/drive/my_finetuned_t5_attempt_1'

# Save the model and the tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

### Load Model fron Google Drive

In [None]:
# Define the path where you want to save the model
model_path = '/content/drive/my_finetuned_t5_attempt_1'

# Load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(model_path)