# Fine Tuning With Flan T5 Large

Author : Sivaprasad Puthumadathil Rameshan Nair

Flan T5 Large Model

- 783M params
- Model type: Language model
- Flan-T5 Large is based on the T5 (Text-To-Text Transfer Transformer) architecture developed by Google.
- The model uses a transformer-based architecture which is highly effective for natural language processing (NLP) tasks.
- It excels in tasks such as translation, summarization, question answering, and text generation.
- Flan-T5 Large has shown improvements in benchmarks over models that were not fine-tuned with instruction-based data.
- It has demonstrated strong performance on tasks that require understanding and generating human-like text.

## 1. Story generation before fine tuning

In [44]:
import time
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Using device: {device}")

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").to(device)  # Ensure the model is on the same device

# Function to generate text
def generate_story_with_keywords(keywords, emotion, userpref, max_length=150):
    # Create the prompt for the story generation
    prompt = (
        f"Generate a story that evokes a {emotion} emotion. The story should feature a "
        f"{keywords[0]}, a {keywords[1]}, and a {keywords[2]}. "
        f"Additionally, incorporate elements of {userpref} to enhance the narrative. "
        f"Ensure the {userpref} aspects are seamlessly integrated and contribute to the overall {emotion} tone of the story."
    )

    # Encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)  # Ensure the inputs are on the same device

    # Generate the story
    start_time = time.time()
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        early_stopping=True,
        temperature=0.7,  # Control the randomness of predictions
        top_k=50  # Limit the sampling pool to top_k tokens
    )
    end_time = time.time()

    # Decode the generated text
    story = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"Time taken to generate the story: {end_time - start_time:.2f} seconds")
    return story

# Define the keywords, emotion, and user preferences
keywords = ["dog", "sun", "beach"]
emotion = "happy"
userpref = "history"

# Generate the story
story = generate_story_with_keywords(keywords, emotion, userpref, max_length=150)

# Function to save the story to a file
def save_story_to_file(story, filename='/content/sample_data/generated_story_flanT5_large.txt'):
    with open(filename, 'w') as file:
        file.write(story)

# Save the generated story to a file
save_story_to_file(story)

# Print the generated story
print(story)


Using device: cuda




tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Time taken to generate the story: 1.33 seconds
The dog was happy to be on the beach. He was laying in the sun and enjoying the warm weather. The sun was shining brightly and the dog loved the warmth.


## 2. Dataset Preprocessing

In [None]:
import pandas as pd

# Read the CSV file while skipping problematic lines
processed_data = pd.read_csv('/content/sample_data/stories_with_features_with_genre.csv', on_bad_lines='warn')
processed_data.shape


Skipping line 397: expected 9 fields, saw 10



(1230, 9)

In [None]:
processed_data.head()

Unnamed: 0,id,story,genre,characters,objects,locations,vehicles,professions,emotions
0,457580,"In the year 2250, Earth had made significant s...",Science Fiction,"scientist, star, Shadowbeast, Reynolds, UEG, e...","ship, game","spacecraft, fortress, field, moon Europa",,inventor,"despair, hope, excitement"
1,297904,"In a land far away, where the sun shone bright...",Fantasy,"the Shadow Beast's, Thorn, Eldoria, sorcerer, ...","Sword, puzzle, scroll, sword","the Sword of Eldoria, The Sword of Eldoria, br...",,adventurer,"determination, excitement"
2,620436,"Once upon a time, in a small, tranquil town ca...",Mystery,"detective, Thomas, Johnathan, Whispering Shado...",,"valley, town, warehouse, city, square",,,"shock, love, hope, determination, gratitude, g..."
3,634687,"Once upon a time in the 16th century, a small ...",Historical Adventure,"William, Elias, the Emerald Amulet, Blackwood,...",key,"temple, town, trail, village",,,"hope, determination, gratitude"
4,513427,In the sun-drenched coastal city of St. August...,Thriller,"Alex, Florida, Katie, Sarah, Thomas, artist, P...","computer, map, game","bar, city, ocean, Laboratory",,lawyer,"despair, hope, determination"


In [None]:
processed_data.columns

Index(['id', 'story', 'genre', 'characters', 'objects', 'locations',
       'vehicles', 'professions', 'emotions'],
      dtype='object')

In [None]:
processed_data.isna().sum()

id                0
story             0
genre             3
characters       11
objects         456
locations        83
vehicles       1202
professions     911
emotions        209
dtype: int64

In [33]:
df = processed_data.drop(columns=['vehicles'])
df.columns

Index(['id', 'story', 'genre', 'characters', 'objects', 'locations',
       'professions', 'emotions'],
      dtype='object')

In [None]:
# Handle NaN values: Fill NaN values with a default string
df.fillna('Unknown', inplace=True)
df.isna().sum()

id             0
story          0
genre          0
characters     0
objects        0
locations      0
professions    0
emotions       0
dtype: int64

In [None]:
df.head()

Unnamed: 0,id,story,genre,characters,objects,locations,professions,emotions
0,457580,"In the year 2250, Earth had made significant s...",Science Fiction,"scientist, star, Shadowbeast, Reynolds, UEG, e...","ship, game","spacecraft, fortress, field, moon Europa",inventor,"despair, hope, excitement"
1,297904,"In a land far away, where the sun shone bright...",Fantasy,"the Shadow Beast's, Thorn, Eldoria, sorcerer, ...","Sword, puzzle, scroll, sword","the Sword of Eldoria, The Sword of Eldoria, br...",adventurer,"determination, excitement"
2,620436,"Once upon a time, in a small, tranquil town ca...",Mystery,"detective, Thomas, Johnathan, Whispering Shado...",Unknown,"valley, town, warehouse, city, square",Unknown,"shock, love, hope, determination, gratitude, g..."
3,634687,"Once upon a time in the 16th century, a small ...",Historical Adventure,"William, Elias, the Emerald Amulet, Blackwood,...",key,"temple, town, trail, village",Unknown,"hope, determination, gratitude"
4,513427,In the sun-drenched coastal city of St. August...,Thriller,"Alex, Florida, Katie, Sarah, Thomas, artist, P...","computer, map, game","bar, city, ocean, Laboratory",lawyer,"despair, hope, determination"


## 3. Loading the data for fine tuning

let's concatenates various columns from the dataset into a single text string. This string is intended to be used as input for fine-tuning the model. The idea is to create a rich and informative prompt that includes multiple aspects of the story, such as characters, objects, locations, professions, and emotions.

How It Works:

- data['story']: Contains the main story text.
- data['characters']: Contains characters mentioned in the story.
- data['objects']: Contains objects referenced in the story.
- data['locations']: Contains locations where the story takes place.
- data['professions']: Contains professions of characters in the story.
- data['emotions']: Contains emotions depicted in the story.

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer

# Concatenate input features into a single input string, without the vehicle column
df['input_text'] = df.apply(lambda row: f"Genre: {row['genre']} Characters: {row['characters']} Objects: {row['objects']} Locations: {row['locations']} Professions: {row['professions']} Emotions: {row['emotions']}", axis=1)
df['target_text'] = df['story']

# Convert the DataFrame to a Hugging Face dataset
dataset = Dataset.from_pandas(df[['input_text', 'target_text']])


## 4. Tokenising the data

Tokenize the text data to prepare it for model training.

Tokenizing the data is a crucial step in preparing text for model training, especially for transformer models like T5

Why Tokenize the Data?

- Converting Text to Numerical Format: Machine learning models, particularly neural networks, require numerical input. Tokenization converts text into a sequence of numbers (token IDs) that the model can process.
- Handling Vocabulary: Tokenization breaks down text into smaller units (tokens), such as words or subwords, and maps each token to a unique ID in the model's vocabulary. This helps the model understand and generate text.
- Managing Input Length: Tokenization ensures that text inputs are appropriately truncated or padded to a fixed length. This uniformity is essential for batch processing in model training.
- Preserving Meaning: Advanced tokenizers (like the one used for T5) often use subword units, which helps in handling out-of-vocabulary words and preserving the semantic meaning of the text.

What Does Tokenization Involve?

- Splitting Text into Tokens: The text is split into smaller units (tokens), which can be words, subwords, or characters.
- Mapping Tokens to IDs: Each token is mapped to a unique ID in the model’s vocabulary.
- Truncating or Padding Sequences: The tokenized sequences are truncated to a maximum length if they are too long, or padded with special tokens if they are too short. This ensures all sequences in a batch have the same length.

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')

# Tokenize the dataset with padding
def tokenize_function(examples):
    model_inputs = tokenizer(examples['input_text'], max_length=512, padding="max_length", truncation=True)
    labels = tokenizer(examples['target_text'], max_length=512, padding="max_length", truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/1230 [00:00<?, ? examples/s]

In [None]:
# Visualize a few samples of the tokenized data
for i in range(3):  # Change the range to see more or fewer samples
    print(f"Sample {i+1}:")
    print("Tokenized Input IDs:", tokenized_dataset[i]['input_ids'])
    print("Tokenized Labels IDs:", tokenized_dataset[i]['labels'])
    print("")

Sample 1:
Tokenized Input IDs: [5945, 60, 10, 2854, 24525, 20087, 7, 10, 17901, 6, 2213, 6, 18136, 115, 11535, 6, 27815, 6, 412, 8579, 6, 9739, 6, 30059, 6, 10498, 6, 11856, 6, 8, 638, 7, 3113, 391, 99, 17, 19746, 2661, 7038, 6, 12202, 27815, 6, 13622, 6, 24308, 6, 19553, 6, 3, 4256, 6, 8114, 6, 4030, 6, 30486, 6, 1079, 96, 683, 4365, 121, 27815, 6, 17687, 6, 13962, 6, 736, 13240, 10498, 6, 37, 907, 4030, 3141, 6, 11566, 120, 29, 96, 427, 162, 1686, 12202, 6, 205, 13223, 6, 160, 32, 3, 17057, 7, 10, 4383, 6, 467, 10450, 7, 10, 628, 6696, 6, 21, 9746, 6, 1057, 6, 8114, 5578, 749, 17585, 7, 10, 21244, 262, 7259, 7, 10, 25802, 6, 897, 6, 10147, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## 5. Fine tuning the model


In [34]:
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-large')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=3,
    save_steps=10_000,
    logging_dir='./logs',
    gradient_accumulation_steps=4,
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,  # You might want to create a separate validation dataset
)

# Fine-tune the model
trainer.train()




Epoch,Training Loss,Validation Loss
0,No log,2.991496
1,No log,1.614128
2,No log,1.582997


TrainOutput(global_step=459, training_loss=2.7955197269880174, metrics={'train_runtime': 1091.9496, 'train_samples_per_second': 3.379, 'train_steps_per_second': 0.42, 'total_flos': 8463107915513856.0, 'train_loss': 2.7955197269880174, 'epoch': 2.99})

In [35]:
model.save_pretrained('/content/sample_data/fine-tuned-t5')
tokenizer.save_pretrained('/content/sample_data/fine-tuned-t5')

('/content/sample_data/fine-tuned-t5/tokenizer_config.json',
 '/content/sample_data/fine-tuned-t5/special_tokens_map.json',
 '/content/sample_data/fine-tuned-t5/spiece.model',
 '/content/sample_data/fine-tuned-t5/added_tokens.json',
 '/content/sample_data/fine-tuned-t5/tokenizer.json')

## 6. Generate Stories with the Fine-Tuned Model

In [41]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the fine-tuned model and tokenizer
model_path = '/content/sample_data/fine-tuned-t5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# Create a pipeline for text generation
text2text_generator = pipeline('text2text-generation', model=model, tokenizer=tokenizer)

# Define the prompt
keywords = ["dog", "sun", "beach"]
emotion = "happy"
userpref = "history"

prompt = (
    f"Generate a story that evokes a {emotion} emotion. The story should feature a "
    f"{keywords[0]}, a {keywords[1]}, and a {keywords[2]}. "
    f"Additionally, incorporate elements of {userpref} to enhance the narrative. "
    f"Ensure the {userpref} aspects are seamlessly integrated and contribute to the overall {emotion} tone of the story."
)

# Generate the story
generated_story = text2text_generator(prompt, max_length=512, do_sample=True, top_p=0.95, num_return_sequences=1)

# Print the generated story
print(generated_story[0]['generated_text'])


After school i started reading books that had a fun plot. i bought a book called dog sagas for my dog. the title was "Disconnect with the Dog". this book was a great place to read stories with the dogs that were a few years before. when my dog went to the book he had never heard of, he thought it was awesome. and when he went there he noticed the book, and thought it was the best. so i went there and bought the book. the book is called the beach house. this was a small island with lots of trees, and very little people. there was a dog named jacky who was a giant husky, with a beautiful hair cut and very bright colors. jacky was a beautiful dog. when jacky went to the beach he was very excited and looked very cute. and that is when it all started. jacky's owner had a dog named george who he named Jacky. the two kids were very happy and friendly. one day, jacky saw the beach house and immediately gave his dog the dog name, oscar, which was a dog named jackosa. they named this dog after t

In [43]:
# Extract the generated text
story_text = generated_story[0]['generated_text']

# Save the generated story to a text file
output_file_path = '/content/sample_data/generated_story_flan_t5_large_finetuned.txt'
with open(output_file_path, 'w') as file:
    file.write(story_text)

print(f"Generated story saved to {output_file_path}")

Generated story saved to /content/sample_data/generated_story_flan_t5_large_finetuned.txt


## 7. Performance

### 7.1 Qualitative Evaluation

#### **Human Evaluation**

- Content Relevance: The story should feature a dog, the sun, and the beach.

    - Check: The story prominently features a dog named Jacky and mentions a beach house.

- Emotion Elicitation: The story should evoke a happy emotion.

    - Check: The story tries to convey happiness through the characters' experiences and interactions.

- Incorporation of User Preference (History): The story should include historical elements seamlessly.

   - Check: The story references past events and books but doesn't strongly incorporate historical elements.

- Coherence and Fluency: The story should be coherent and fluent.

    - Check: The story has some coherence issues and repetitions, which can affect fluency.


### 7.2 Quantitative Evaluation

####  **BLEU and ROUGE Scores**

In [55]:
from datasets import load_metric

# Load the BLEU and ROUGE metrics
bleu = load_metric('bleu')
rouge = load_metric('rouge')

# Load the generated stories from the files
with open('/content/sample_data/generated_story_flanT5_.txt', 'r', encoding='utf-8') as file:
    story_before_fine_tuning = file.read()

with open('/content/sample_data/generated_story_flan_t5_large_finetuned.txt', 'r', encoding='utf-8') as file:
    story_after_fine_tuning = file.read()

# Load the reference story from the file
with open('/content/sample_data/reference_story.txt', 'r', encoding='utf-8') as file:
    reference_story = file.read()

# Tokenize the generated and reference stories
generated_tokens_before = [story_before_fine_tuning.split()]
generated_tokens_after = [story_after_fine_tuning.split()]
reference_tokens = [[reference_story.split()]]  # Reference story tokenized

# Compute BLEU scores
bleu_result_before = bleu.compute(predictions=generated_tokens_before, references=reference_tokens)
bleu_result_after = bleu.compute(predictions=generated_tokens_after, references=reference_tokens)

print("BLEU Score Before Fine-Tuning:", bleu_result_before['bleu'])
print("BLEU Score After Fine-Tuning:", bleu_result_after['bleu'])

# Compute ROUGE scores
rouge_result_before = rouge.compute(predictions=[story_before_fine_tuning], references=[reference_story])
rouge_result_after = rouge.compute(predictions=[story_after_fine_tuning], references=[reference_story])

print("ROUGE Scores Before Fine-Tuning:", rouge_result_before)
print("ROUGE Scores After Fine-Tuning:", rouge_result_after)


BLEU Score Before Fine-Tuning: 0.0
BLEU Score After Fine-Tuning: 0.0
ROUGE Scores Before Fine-Tuning: {'rouge1': AggregateScore(low=Score(precision=0.3448275862068966, recall=0.034722222222222224, fmeasure=0.06309148264984228), mid=Score(precision=0.3448275862068966, recall=0.034722222222222224, fmeasure=0.06309148264984228), high=Score(precision=0.3448275862068966, recall=0.034722222222222224, fmeasure=0.06309148264984228)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.3103448275862069, recall=0.03125, fmeasure=0.056782334384858045), mid=Score(precision=0.3103448275862069, recall=0.03125, fmeasure=0.056782334384858045), high=Score(precision=0.3103448275862069, recall=0.03125, fmeasure=0.056782334384858045)), 'rougeLsum': AggregateScore(low=Score(precision=0.3103448275862069, recall=0.03125, fmeasure=0.0

#### **Perplexity**

In [56]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the tokenizer and model
model_path = '/content/sample_data/fine-tuned-t5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# Load the generated stories from the files
with open('/content/sample_data/generated_story_flanT5_.txt', 'r', encoding='utf-8') as file:
    story_before_fine_tuning = file.read()

with open('/content/sample_data/generated_story_flan_t5_large_finetuned.txt', 'r', encoding='utf-8') as file:
    story_after_fine_tuning = file.read()

# Tokenize the input texts
inputs_before = tokenizer(story_before_fine_tuning, return_tensors='pt')
inputs_after = tokenizer(story_after_fine_tuning, return_tensors='pt')

# Move tensors to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
inputs_before = {key: value.to(device) for key, value in inputs_before.items()}
inputs_after = {key: value.to(device) for key, value in inputs_after.items()}

# Compute loss for before fine-tuning
with torch.no_grad():
    outputs_before = model(**inputs_before, labels=inputs_before["input_ids"])
    loss_before = outputs_before.loss
    perplexity_before = torch.exp(loss_before)

# Compute loss for after fine-tuning
with torch.no_grad():
    outputs_after = model(**inputs_after, labels=inputs_after["input_ids"])
    loss_after = outputs_after.loss
    perplexity_after = torch.exp(loss_after)

print("Perplexity Before Fine-Tuning:", perplexity_before.item())
print("Perplexity After Fine-Tuning:", perplexity_after.item())


Perplexity Before Fine-Tuning: 1.529300332069397
Perplexity After Fine-Tuning: 1.1176501512527466


- The decrease in perplexity from 1.529 to 1.118 after fine-tuning demonstrates an improvement in the model's ability to generate text that is similar to the training data.

- A lower perplexity value after fine-tuning indicates that the model has become more adept at capturing the patterns and structures in the text, leading to more accurate predictions.