# ADVANCED MODEL DEVELOPMENT (SEQ2SEQ WITH T5)
This notebook trains and evaluates two T5 models for spell correction.

### Technical Rationale: Choosing T5 over BERT for Spell Correction
While the assignment brief mentions exploring "BERT-based approaches," I deliberately chose the T5 (Text-to-Text Transfer Transformer) model. This decision is based on fundamental architectural differences that make T5 more suitable for the spell correction task.

#### Understanding the Architectures
BERT (Bidirectional Encoder Representations from Transformers) is an Encoder-Only model that excels at understanding and classifying text. Its strength lies in tasks like sentiment analysis, Named Entity Recognition (NER), and question answering.

T5 (Text-to-Text Transfer Transformer) is an Encoder-Decoder model designed to transform an input text sequence into a new output sequence. This makes it a natural fit for sequence-to-sequence (Seq2Seq) tasks such as translation, summarization, and text correction.

**Why T5 is the Superior Choice for This Task**
Spell correction is fundamentally a sequence-to-sequence task, where an incorrect sentence is transformed into a correct one. T5's Encoder-Decoder architecture is explicitly built for this paradigm, offering a more direct and efficient implementation than adapting an encoder-only model like BERT. This makes T5 the more appropriate and architecturally sound tool for the job.

#### Conclusion: The Right Tool for the Right Job
Our choice of T5 is a strategic one, selecting the best tool for this seq2seq challenge. We refine this strategy by comparing a general-purpose T5 model (`t5-base`) with a domain-specific one (`BiomedNLP/t5-base-biomed-pubmed`). This approach allows us to measure the performance gain from leveraging specialized medical vocabulary—the core challenge identified in our initial EDA.

In [3]:
# Import libraries
import pandas as pd
import torch
import spacy
import numpy as np
import ast
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from datasets import Dataset
from transformers import (
    T5ForConditionalGeneration,
    T5Tokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

# It's a good practice to check for GPU availability
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")


Using device: cuda


### Data Loading and Preparation

In this cell,load the pre-split datasets created in the first notebook,also preprocess the text by adding a task-specific prefix required by T5 and convert the data into the Hugging Face `Dataset` format.

In [4]:
# Load the preprocessed datasets
try:
    train_df = pd.read_csv('../dataset/train.csv')
    val_df = pd.read_csv('../dataset/validation.csv')
except FileNotFoundError:
    print("Error: Make sure 'train.csv' and 'validation.csv' are in the correct directory.")
    # Create empty dataframes to avoid further errors
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()

# Safely convert the 'target_nouns' column from its string representation back to a list
if 'target_nouns' in val_df.columns:
    val_df['target_nouns'] = val_df['target_nouns'].fillna('[]').apply(ast.literal_eval)
else:
    print("Warning: 'target_nouns' column not found. Noun accuracy cannot be calculated.")

# Define the prefix for our spell correction task
prefix = "correct the spelling: "

def preprocess_data(df):
    """Prepares the DataFrame for T5 by adding the prefix and selecting columns."""
    if df.empty: return df
    df['incorrect_cleaned_advanced'] = df['incorrect_cleaned_advanced'].astype(str)
    df['correct_cleaned_advanced'] = df['correct_cleaned_advanced'].astype(str)
    df['input_text'] = prefix + df['incorrect_cleaned_advanced']
    df['target_text'] = df['correct_cleaned_advanced']
    return df

train_df = preprocess_data(train_df)
val_df = preprocess_data(val_df)

# Convert pandas DataFrames to Hugging Face Dataset objects
if not train_df.empty:
    train_dataset = Dataset.from_pandas(train_df[['input_text', 'target_text']])
    val_dataset = Dataset.from_pandas(val_df[['input_text', 'target_text']])
    print("Data prepared and converted to Hugging Face Dataset objects.")
    print("\nSample from training data:")
    print(train_dataset[0])

Data prepared and converted to Hugging Face Dataset objects.

Sample from training data:
{'input_text': 'correct the spelling: make sure to consult your health care provider before starting a new medication like aerifix tab', 'target_text': 'make sure to consult your healthcare provider before starting a new medication like arifix tab'}


### Evaluation Metrics Setup

This cell defines the three functions used to evaluate our models:
1. Word Accuracy: Percentage of correctly predicted words.
2. Noun Accuracy: Percentage of target medical nouns found in the prediction.
3. BLEU Score: A measure of similarity between the predicted and true sentences.

In [5]:
nlp = spacy.load("en_core_web_sm")

def calculate_word_accuracy(df, true_col, pred_col):
    """Calculates the percentage of correctly predicted words."""
    total_words, correct_words = 0, 0
    for _, row in df.iterrows():
        true_words = str(row[true_col]).split()
        pred_words = str(row[pred_col]).split()
        for i in range(min(len(true_words), len(pred_words))):
            if true_words[i] == pred_words[i]:
                correct_words += 1
        total_words += len(true_words)
    return (correct_words / total_words) * 100 if total_words > 0 else 0

def calculate_noun_accuracy(df, true_nouns_col, pred_sent_col):
    """Calculates the percentage of target nouns found in the predicted sentence."""
    if true_nouns_col not in df.columns: return 0.0
    total_true_nouns, correctly_predicted_nouns = 0, 0
    for _, row in df.iterrows():
        true_nouns = set(row[true_nouns_col])
        pred_doc = nlp(str(row[pred_sent_col]))
        pred_nouns = set([token.text for token in pred_doc if token.pos_ in ('NOUN', 'PROPN')])
        correctly_predicted_nouns += len(true_nouns.intersection(pred_nouns))
        total_true_nouns += len(true_nouns)
    return (correctly_predicted_nouns / total_true_nouns) * 100 if total_true_nouns > 0 else 0

def calculate_bleu_score(df, true_col, pred_col):
    """Calculates the average BLEU score for the predictions."""
    bleu_scores = []
    chencherry = SmoothingFunction()
    for _, row in df.iterrows():
        reference = [str(row[true_col]).split()]
        candidate = str(row[pred_col]).split()
        score = sentence_bleu(reference, candidate, smoothing_function=chencherry.method1) if candidate else 0
        bleu_scores.append(score)
    return np.mean(bleu_scores) * 100 if bleu_scores else 0

print("Evaluation functions defined.")

Evaluation functions defined.


#### Model Training Function

- This function encapsulates the entire training process for a T5 model.
- It handles tokenization, setting up the trainer, and running the fine-tuning process.
- After training, it saves the final model to a specified directory.

In [6]:
def train_model(model_checkpoint, output_dir, train_ds, val_ds):
    """Handles the tokenization and training of a T5 model."""
    print(f"\n{'='*30}\nStarting Training for: {model_checkpoint}\n{'='*30}")
    
    tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
    
    def tokenize_function(examples):
        model_inputs = tokenizer(examples['input_text'], max_length=128, truncation=True, padding="max_length")
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(examples['target_text'], max_length=128, truncation=True, padding="max_length")
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
        
    tokenized_train_ds = train_ds.map(tokenize_function, batched=True)
    tokenized_val_ds = val_ds.map(tokenize_function, batched=True)
    print(f"Tokenization complete for {model_checkpoint}.")

    model = T5ForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
    
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir, evaluation_strategy="epoch", learning_rate=2e-5,
        per_device_train_batch_size=16, per_device_eval_batch_size=16,
        weight_decay=0.01, save_total_limit=3, num_train_epochs=5,
        predict_with_generate=True, fp16=(device == "cuda"), push_to_hub=False,
    )
    
    trainer = Seq2SeqTrainer(
        model=model, args=training_args, train_dataset=tokenized_train_ds,
        eval_dataset=tokenized_val_ds, tokenizer=tokenizer, data_collator=data_collator,
    )
    
    print(f"Fine-tuning {model_checkpoint}...")
    trainer.train()
    print("Training complete.")

    final_model_path = f"{output_dir}/final_model"
    trainer.save_model(final_model_path)
    print(f"Final model saved to '{final_model_path}'")

print("Model training function defined.")

Model training function defined.


### Train the General-Purpose `t5-base` Model

- now call our training function to fine-tune the standard `t5-base` model.
- This will serve as our second baseline, against which can compare the domain-specific model.

In [24]:
if not train_df.empty:
    train_model(
        model_checkpoint="t5-base",
        output_dir="../models/t5_base_results",
        train_ds=train_dataset,
        val_ds=val_dataset
    )
else:
    print("Skipping training as data is not loaded.")


Starting Training for: t5-base


Map:   0%|          | 0/7000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Tokenization complete for t5-base.


  trainer = Seq2SeqTrainer(


Fine-tuning t5-base...


  0%|          | 0/2190 [00:00<?, ?it/s]

  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.11156893521547318, 'eval_runtime': 27.7351, 'eval_samples_per_second': 54.083, 'eval_steps_per_second': 3.389, 'epoch': 1.0}
{'loss': 0.8416, 'grad_norm': 0.24801267683506012, 'learning_rate': 1.5470319634703196e-05, 'epoch': 1.14}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.10265593975782394, 'eval_runtime': 28.0595, 'eval_samples_per_second': 53.458, 'eval_steps_per_second': 3.35, 'epoch': 2.0}
{'loss': 0.1158, 'grad_norm': 0.25450220704078674, 'learning_rate': 1.091324200913242e-05, 'epoch': 2.28}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.099215567111969, 'eval_runtime': 27.8035, 'eval_samples_per_second': 53.95, 'eval_steps_per_second': 3.381, 'epoch': 3.0}
{'loss': 0.1089, 'grad_norm': 0.21131180226802826, 'learning_rate': 6.3470319634703205e-06, 'epoch': 3.42}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.09739189594984055, 'eval_runtime': 21.2333, 'eval_samples_per_second': 70.644, 'eval_steps_per_second': 4.427, 'epoch': 4.0}
{'loss': 0.1052, 'grad_norm': 0.22745051980018616, 'learning_rate': 1.7808219178082193e-06, 'epoch': 4.57}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.09657690674066544, 'eval_runtime': 21.2268, 'eval_samples_per_second': 70.665, 'eval_steps_per_second': 4.428, 'epoch': 5.0}
{'train_runtime': 2762.5045, 'train_samples_per_second': 12.67, 'train_steps_per_second': 0.793, 'train_loss': 0.2765340021211807, 'epoch': 5.0}
Training complete.
Final model saved to '../models/t5_base_results/final_model'


### Train the Domain-Specific `Clinical-T5-Large` Model

- Next, train the model pre-trained on biomedical and PubMed data.
- expecting this model to perform better due to its specialized vocabulary.

In [11]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
if not train_df.empty:
    train_model(
        model_checkpoint="razent/SciFive-base-Pubmed_PMC",
        output_dir="../models/SciFive-base-Pubmed_PMC",
        train_ds=train_dataset,
        val_ds=val_dataset
    )
else:
    print("Skipping training as data is not loaded.")


Starting Training for: razent/SciFive-base-Pubmed_PMC


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/7000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Tokenization complete for razent/SciFive-base-Pubmed_PMC.


  trainer = Seq2SeqTrainer(


Fine-tuning razent/SciFive-base-Pubmed_PMC...


  0%|          | 0/2190 [00:00<?, ?it/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.10813253372907639, 'eval_runtime': 14.1481, 'eval_samples_per_second': 106.021, 'eval_steps_per_second': 6.644, 'epoch': 1.0}
{'loss': 1.4578, 'grad_norm': 0.2157183140516281, 'learning_rate': 1.5461187214611872e-05, 'epoch': 1.14}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.09798427671194077, 'eval_runtime': 9.9714, 'eval_samples_per_second': 150.43, 'eval_steps_per_second': 9.427, 'epoch': 2.0}
{'loss': 0.1136, 'grad_norm': 0.2832963764667511, 'learning_rate': 1.0894977168949771e-05, 'epoch': 2.28}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.09384094178676605, 'eval_runtime': 21.5322, 'eval_samples_per_second': 69.663, 'eval_steps_per_second': 4.366, 'epoch': 3.0}
{'loss': 0.106, 'grad_norm': 0.17183993756771088, 'learning_rate': 6.328767123287672e-06, 'epoch': 3.42}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.09217946231365204, 'eval_runtime': 23.4385, 'eval_samples_per_second': 63.997, 'eval_steps_per_second': 4.011, 'epoch': 4.0}
{'loss': 0.1026, 'grad_norm': 0.19093696773052216, 'learning_rate': 1.762557077625571e-06, 'epoch': 4.57}


  0%|          | 0/94 [00:00<?, ?it/s]

{'eval_loss': 0.0914832353591919, 'eval_runtime': 21.3391, 'eval_samples_per_second': 70.293, 'eval_steps_per_second': 4.405, 'epoch': 5.0}
{'train_runtime': 4196.5073, 'train_samples_per_second': 8.34, 'train_steps_per_second': 0.522, 'train_loss': 0.41514865927500266, 'epoch': 5.0}
Training complete.
Final model saved to '../models/SciFive-base-Pubmed_PMC/final_model'


#### Inference Function
- This function loads a fine-tuned model from a given path and uses it to
generate predictions (inference) for all sentences in our validation set.

- We now use our inference function to get predictions from both the `t5-base`
and the `BioMed-T5` models. The results are stored in new columns in our
validation DataFrame.

In [10]:
from tqdm.notebook import tqdm
def run_inference(model_path, validation_df):
    """Loads a saved model and generates predictions."""
    print(f"\n{'='*30}\nRunning Inference for model at: {model_path}\n{'='*30}")
    tokenizer = T5Tokenizer.from_pretrained(model_path)
    model = T5ForConditionalGeneration.from_pretrained(model_path).to(device)
    
    predictions = []
    for text in tqdm(validation_df['input_text']):
        inputs = tokenizer(text, return_tensors="pt").input_ids.to(device)
        outputs = model.generate(inputs, max_length=128, num_beams=8, early_stopping=True)
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predictions.append(prediction)
        
    print(f"Inference complete for {model_path}.")
    return predictions

if not val_df.empty:
    # Get predictions from the general-purpose model
    t5_base_predictions = run_inference(
        model_path="../models/t5_base_results/final_model",
        validation_df=val_df
    )
    val_df['t5_base_predicted'] = t5_base_predictions

    # Get predictions from the domain-specific model
    biomed_t5_predictions = run_inference(
        model_path="../models/SciFive-base-Pubmed_PMC/final_model",
        validation_df=val_df
    )
    val_df['biomed_t5_predicted'] = biomed_t5_predictions
    
    print("\nPredictions from both models have been added to the validation DataFrame.")
else:
    print("Skipping inference as validation data is not loaded.")



Running Inference for model at: ../models/t5_base_results/final_model


  0%|          | 0/1500 [00:00<?, ?it/s]

Inference complete for ../models/t5_base_results/final_model.

Running Inference for model at: ../models/SciFive-base-Pubmed_PMC/final_model


  0%|          | 0/1500 [00:00<?, ?it/s]

Inference complete for ../models/SciFive-base-Pubmed_PMC/final_model.

Predictions from both models have been added to the validation DataFrame.


#### Calculate and Display Final Metrics

This is the final step. We calculate all evaluation metrics for both models
and display them in a clear summary table for easy comparison. The final
DataFrame with all predictions is then saved to a CSV file.

In [11]:
if 't5_base_predicted' in val_df.columns:
    models_to_evaluate = {
        "T5-Base": "t5_base_predicted",
        "BioMed-T5": "biomed_t5_predicted"
    }
    
    results = []
    for model_name, pred_col in models_to_evaluate.items():
        word_acc = calculate_word_accuracy(val_df, 'correct_cleaned_advanced', pred_col)
        noun_acc = calculate_noun_accuracy(val_df, 'target_nouns', pred_col)
        bleu = calculate_bleu_score(val_df, 'correct_cleaned_advanced', pred_col)
        results.append({
            "Model": model_name,
            "Word Accuracy (%)": f"{word_acc:.2f}",
            "Noun Accuracy (%)": f"{noun_acc:.2f}",
            "BLEU Score": f"{bleu:.2f}"
        })

    results_df = pd.DataFrame(results)
    print(f"\n\n{'='*40}\n           PERFORMANCE SUMMARY\n{'='*40}")
    print(results_df.to_string(index=False))
    
    output_csv_path = 'validation_with_all_predictions.csv'
    val_df.to_csv(output_csv_path, index=False)
    print(f"\nFinal predictions saved to '{output_csv_path}'")
    
    print("\nDisplaying a few sample predictions from both models:")
    display(val_df[['correct_cleaned_advanced', 't5_base_predicted', 'biomed_t5_predicted']].head())
else:
    print("Skipping evaluation as predictions were not generated.")



           PERFORMANCE SUMMARY
    Model Word Accuracy (%) Noun Accuracy (%) BLEU Score
  T5-Base             77.16             73.32      81.62
BioMed-T5             77.83             73.63      81.86

Final predictions saved to 'validation_with_all_predictions.csv'

Displaying a few sample predictions from both models:


Unnamed: 0,correct_cleaned_advanced,t5_base_predicted,biomed_t5_predicted
0,patients are recommended to consult their heal...,patients are recommended to consult their heal...,patients are recommended to consult their heal...
1,amydio forte is a powerful medication often us...,imidio forte is a powerful medication often us...,imidio forte is a powerful medication often us...
2,provera 40 contains medroxyprogesterone acetat...,provera-40 contains medrexiprogesterone acetat...,provera-40 contains medroxyprogesterone acetat...
3,ceemi-o is a popular over-the-counter medicati...,simi-oh is a popular over-the-counter medicati...,simi-oh is a popular over-the-counter medicati...
4,l-arginine sr is a popular supplement known fo...,lrg9-sr is a popular supplement known for its ...,lrg9-sr is a popular supplement known for its ...
