# Fine Tuning T5 based on custom dataset

## 1. Data Preprocessing

In [2]:
import pandas as pd

processed_data = pd.read_csv('stories_with_features_with_genre.csv')
processed_data.shape

(1000, 9)

In [3]:
processed_data.head()

Unnamed: 0,id,story,genre,characters,objects,locations,vehicles,professions,emotions
0,457580,"In the year 2250, Earth had made significant s...",Science Fiction,"scientist, star, Shadowbeast, Reynolds, UEG, e...","ship, game","spacecraft, fortress, field, moon Europa",,inventor,"despair, hope, excitement"
1,297904,"In a land far away, where the sun shone bright...",Fantasy,"the Shadow Beast's, Thorn, Eldoria, sorcerer, ...","Sword, puzzle, scroll, sword","the Sword of Eldoria, The Sword of Eldoria, br...",,adventurer,"determination, excitement"
2,620436,"Once upon a time, in a small, tranquil town ca...",Mystery,"detective, Thomas, Johnathan, Whispering Shado...",,"valley, town, warehouse, city, square",,,"shock, love, hope, determination, gratitude, g..."
3,634687,"Once upon a time in the 16th century, a small ...",Historical Adventure,"William, Elias, the Emerald Amulet, Blackwood,...",key,"temple, town, trail, village",,,"hope, determination, gratitude"
4,513427,In the sun-drenched coastal city of St. August...,Thriller,"Alex, Florida, Katie, Sarah, Thomas, artist, P...","computer, map, game","bar, city, ocean, Laboratory",,lawyer,"despair, hope, determination"


In [4]:
processed_data.columns

Index(['id', 'story', 'genre', 'characters', 'objects', 'locations',
       'vehicles', 'professions', 'emotions'],
      dtype='object')

In [5]:
processed_data.isna().sum()

id               0
story            0
genre            0
characters       1
objects        344
locations       47
vehicles       976
professions    728
emotions       133
dtype: int64

since most rows of the vehicles are empty , we will remove the coloumn

In [6]:
data = processed_data.drop(columns=['vehicles'])
data.columns

Index(['id', 'story', 'genre', 'characters', 'objects', 'locations',
       'professions', 'emotions'],
      dtype='object')

In [7]:
# Handle NaN values: Fill NaN values with a default string
data.fillna('Unknown', inplace=True)
data.isna().sum()

id             0
story          0
genre          0
characters     0
objects        0
locations      0
professions    0
emotions       0
dtype: int64

## 2. Loading the data for fine tuning

let's concatenates various columns from the dataset into a single text string. This string is intended to be used as input for fine-tuning the model. The idea is to create a rich and informative prompt that includes multiple aspects of the story, such as characters, objects, locations, professions, and emotions.

How It Works:

- data['story']: Contains the main story text.
- data['characters']: Contains characters mentioned in the story.
- data['objects']: Contains objects referenced in the story.
- data['locations']: Contains locations where the story takes place.
- data['professions']: Contains professions of characters in the story.
- data['emotions']: Contains emotions depicted in the story.

In [8]:
from datasets import Dataset

# Combine relevant columns to create a rich input prompt
data['text'] = 'Story: ' + data['story'] + ' Characters: ' + data['characters'] + \
               ' Objects: ' + data['objects'] + ' Locations: ' + data['locations'] + \
               ' Professions: ' + data['professions'] + \
               ' Emotions: ' + data['emotions']

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(data[['text']])

In [14]:
# Print one sample to verify
print("Sample concatenated text:")
print(data.loc[0, 'text'])

Sample concatenated text:
Story: In the year 2250, Earth had made significant strides in space exploration and interstellar travel. The United Earth Government (UEG) had established colonies on Mars, Jupiter's moon Europa, and Saturn's moon Titan. The advancements in technology and science had led to the creation of the Cosmic Rift Exploration Agency (CREA), a government-funded organization tasked with exploring the unknown regions of space and discovering new worlds and resources.

    Dr. Amelia Hart, a brilliant astrophysicist, was the lead scientist at CREA's headquarters on Luna. She had devoted her entire life to understanding the mysteries of the universe and had become a pioneer in her field. She was determined to uncover the secrets of the cosmic rifts, a series of mysterious and seemingly unconnected energy anomalies that had started appearing throughout the galaxy.

    Dr. Hart assembled a diverse team of experts for her next mission, including her trusted second-in-command

## 3. Tokenising the data

Tokenize the text data to prepare it for model training.

Tokenizing the data is a crucial step in preparing text for model training, especially for transformer models like T5

Why Tokenize the Data?

- Converting Text to Numerical Format: Machine learning models, particularly neural networks, require numerical input. Tokenization converts text into a sequence of numbers (token IDs) that the model can process.
- Handling Vocabulary: Tokenization breaks down text into smaller units (tokens), such as words or subwords, and maps each token to a unique ID in the model's vocabulary. This helps the model understand and generate text.
- Managing Input Length: Tokenization ensures that text inputs are appropriately truncated or padded to a fixed length. This uniformity is essential for batch processing in model training.
- Preserving Meaning: Advanced tokenizers (like the one used for T5) often use subword units, which helps in handling out-of-vocabulary words and preserving the semantic meaning of the text.

What Does Tokenization Involve?

- Splitting Text into Tokens: The text is split into smaller units (tokens), which can be words, subwords, or characters.
- Mapping Tokens to IDs: Each token is mapped to a unique ID in the modelâ€™s vocabulary.
- Truncating or Padding Sequences: The tokenized sequences are truncated to a maximum length if they are too long, or padded with special tokens if they are too short. This ensures all sequences in a batch have the same length.

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('t5-small')

def preprocess_function(examples):
    model_inputs = tokenizer(examples['text'], max_length=512, truncation=True, padding='max_length')
    model_inputs["labels"] = tokenizer(examples['text'], max_length=512, truncation=True, padding='max_length').input_ids
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)




Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [20]:
# Visualize a few samples of the tokenized data
for i in range(3):  # Change the range to see more or fewer samples
    print(f"Sample {i+1}:")
    print("Tokenized Input IDs:", tokenized_datasets[i]['input_ids'])
    print("Tokenized Labels IDs:", tokenized_datasets[i]['labels'])
    print("")

Sample 1:
Tokenized Input IDs: [8483, 10, 86, 8, 215, 204, 11434, 6, 4030, 141, 263, 1516, 5765, 9361, 16, 628, 9740, 11, 1413, 7, 6714, 291, 1111, 5, 37, 907, 4030, 3141, 41, 5078, 517, 61, 141, 2127, 27200, 30, 11856, 6, 24308, 31, 7, 8114, 5578, 6, 11, 24037, 31, 7, 8114, 13622, 5, 37, 14500, 7, 16, 748, 11, 2056, 141, 2237, 12, 8, 3409, 13, 8, 638, 7, 3113, 391, 99, 17, 19746, 2661, 7038, 41, 254, 13223, 201, 3, 9, 789, 18, 18532, 1470, 3, 17, 23552, 28, 6990, 8, 7752, 6266, 13, 628, 11, 17452, 126, 296, 7, 11, 1438, 5, 707, 5, 736, 13240, 10498, 6, 3, 9, 6077, 38, 17, 29006, 7, 447, 343, 6, 47, 8, 991, 17901, 44, 205, 13223, 31, 7, 13767, 30, 17687, 5, 451, 141, 3, 12895, 160, 1297, 280, 12, 1705, 8, 29063, 13, 8, 8084, 11, 141, 582, 3, 9, 11200, 16, 160, 1057, 5, 451, 47, 4187, 12, 19019, 8, 13951, 13, 8, 28332, 3, 22722, 7, 6, 3, 9, 939, 13, 15124, 11, 13045, 73, 19386, 827, 23236, 725, 24, 141, 708, 16069, 1019, 8, 24856, 5, 707, 5, 10498, 17583, 3, 9, 2399, 372, 13, 2273, 21, 

## 4. Fine Tune the model

In [23]:
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load the pre-trained T5 model
model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,  # Usually, you would have a separate validation set
)
# Fine-tune the model
trainer.train()


The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: text. If text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 60506624
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your 

  0%|          | 0/750 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: text. If text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.14797264337539673, 'eval_runtime': 373.5232, 'eval_samples_per_second': 2.677, 'eval_steps_per_second': 0.669, 'epoch': 1.0}
{'loss': 0.861, 'learning_rate': 6.666666666666667e-06, 'epoch': 2.0}


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: text. If text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.08196964859962463, 'eval_runtime': 325.6256, 'eval_samples_per_second': 3.071, 'eval_steps_per_second': 0.768, 'epoch': 2.0}


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: text. If text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.06718628108501434, 'eval_runtime': 325.9198, 'eval_samples_per_second': 3.068, 'eval_steps_per_second': 0.767, 'epoch': 3.0}




Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 6186.7876, 'train_samples_per_second': 0.485, 'train_steps_per_second': 0.121, 'train_loss': 0.6508353678385417, 'epoch': 3.0}


TrainOutput(global_step=750, training_loss=0.6508353678385417, metrics={'train_runtime': 6186.7876, 'train_samples_per_second': 0.485, 'train_steps_per_second': 0.121, 'train_loss': 0.6508353678385417, 'epoch': 3.0})

let's save the fine tuned model

In [24]:
model.save_pretrained('../model/fine-tuned-t5')
tokenizer.save_pretrained('../model/fine-tuned-t5')


Configuration saved in ../model/fine-tuned-t5\config.json
Configuration saved in ../model/fine-tuned-t5\generation_config.json
Model weights saved in ../model/fine-tuned-t5\pytorch_model.bin
tokenizer config file saved in ../model/fine-tuned-t5\tokenizer_config.json
Special tokens file saved in ../model/fine-tuned-t5\special_tokens_map.json


('../model/fine-tuned-t5\\tokenizer_config.json',
 '../model/fine-tuned-t5\\special_tokens_map.json',
 '../model/fine-tuned-t5\\tokenizer.json')

## 5. Generate Stories with the Fine-Tuned Model

In [29]:
from transformers import pipeline

# Load the fine-tuned model and tokenizer
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained('../model/fine-tuned-t5')
fine_tuned_tokenizer = AutoTokenizer.from_pretrained('../model/fine-tuned-t5')

# Create a text generation pipeline
generator = pipeline('text2text-generation', model=fine_tuned_model, tokenizer=fine_tuned_tokenizer)

# Generate a new story
prompt = "Story: The sun had set. Characters: scientist. Objects: window. Locations: city. Professions: unknown. Emotions: fear."
generated_text = generator(prompt, max_length=512, num_return_sequences=1)

print(generated_text[0]['generated_text'])


loading configuration file ../model/fine-tuned-t5\config.json
Model config T5Config {
  "_name_or_path": "../model/fine-tuned-t5",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {


Story: The sun had set. Characters: scientist. Objects: window. Locations: city. Professions: unknown. Emotions: fear.
