T5-Small Baseline: Validation Test without Fine-Tuning


This section runs the validation dataset through the pre-trained T5-Small model as a baseline. The model is not fine-tuned on any specific data, allowing for an evaluation of its out-of-the-box performance. This baseline serves as a reference point to compare the effectiveness of fine-tuned models in generating SMART goals.

In [None]:
 from google.colab import drive
 drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Install required libraries
!pip install -q transformers peft sentence-transformers wandb --quiet
!pip install -q bert_score

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import warnings
import pandas as pd
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
from sentence_transformers import SentenceTransformer, util
from bert_score import score as bert_score
import numpy as np



# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")


t5_model = T5ForConditionalGeneration.from_pretrained("t5-small")
t5_tokenizer =T5Tokenizer.from_pretrained("t5-small")

# Move model to device
device = "cuda" if torch.cuda.is_available() else "cpu"
t5_model = t5_model.to(device)

# Load Sentence-BERT model for faithfulness calculation
sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Validation dataset
val_dataset = pd.read_csv("/content/drive/My Drive/validation_data_cleaned.csv")


 #Create a DataLoader for validation dataset
from torch.utils.data import DataLoader
val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)

# Helper function to calculate Perplexity
def calculate_perplexity(input_ids, model):
    with torch.no_grad():
        outputs = model(input_ids=input_ids, labels=input_ids)
        return torch.exp(outputs.loss).item()


 # Helper function for Faithfulness calculation
def calculate_faithfulness(input_text, generated_text):
    input_embedding = sbert_model.encode(input_text, convert_to_tensor=True)
    output_embedding = sbert_model.encode(generated_text, convert_to_tensor=True)
    return util.pytorch_cos_sim(input_embedding, output_embedding).ite


# DataFrame to store results
results_df = pd.DataFrame(columns=["Input", "Reference", "Output", "BERTScore", "Perplexity", "Faithfulness"])


# Iterate through validation dataset rows
for _, row in val_dataset.iterrows():
    input_text = row["Augmented Vague Goal"]
    reference_text = row["SMART Goal"]

    # Generate output using T5
    input_ids = t5_tokenizer.encode(input_text, return_tensors="pt", truncation=True).to(device)
    output_ids = t5_model.generate(input_ids, max_length=100, num_beams=1)
    output_text = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Calculate BERTScore
    P, R, F1 = bert_score([output_text], [reference_text], lang="en")
    bert_score_avg = F1.mean().item()

    # Calculate Perplexity
    perplexity = calculate_perplexity(input_ids, t5_model)

    # Calculate Faithfulness using Sentence-BERT
    embeddings_ref = sbert_model.encode(reference_text, convert_to_tensor=True)
    embeddings_out = sbert_model.encode(output_text, convert_to_tensor=True)
    faithfulness = util.cos_sim(embeddings_out, embeddings_ref).item()

    # Add results to DataFrame
    results_df = pd.concat([results_df, pd.DataFrame([{
        "Input": input_text,
        "Reference": reference_text,
        "Output": output_text,
        "BERTScore": bert_score_avg,
        "Perplexity": perplexity,
        "Faithfulness": faithfulness
    }])], ignore_index=True)

# Display results and average scores
print("Sample Results for Human Evaluation:")
print(results_df.head())  # Display a few results for verification
print("\nAverage Scores:")
print(results_df[["BERTScore", "Perplexity", "Faithfulness"]].mean())

Save results to a CSV file"/content/drive/My Drive/model_evaluation_T5_Base_new.csv"
results_df.to_csv(, index=False)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
  results_df = pd.concat([results_df, pd.DataFrame([{
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['rob

Sample Results for Human Evaluation:
                                               Input  \
0  i’m looking to explore ways to bounce back and...   
1  i’m looking to take some time to reflect on ho...   
2  i’m thinking it might be worthwhile to explore...   
3  ive been thinking it might be nice to connect ...   
4  i’m looking to explore ways to bring more vari...   

                                           Reference  \
0  to navigate the challenges of my work environm...   
1  by the end of the next quarter, i will dedicat...   
2  by the end of the next quarter, i will enhance...   
3  in the spirit of fostering stronger connection...   
4  by the end of the next quarter, i will impleme...   

                                              Output  BERTScore  Perplexity  \
0  i’m looking to explore ways to bounce back and...   0.861804    1.203286   
1  at work. i’m looking to take some time to refl...   0.872158    1.246733   
2  i’m thinking it might be worthwhile to explore...