# Code Explanations

In [None]:
# ==================== INSTALLATION ====================
!pip install sentence-transformers scikit-learn pandas

# ==================== IMPORTS ====================
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split, ParameterGrid, ParameterSampler
import pandas as pd
import numpy as np
import time
import os
import shutil
import random

At the beginning of my Google Colab project, I set up all the tools I know I'll need for my work by listing a bunch of imports. This feels to me like grabbing my books and supplies before starting a big assignment. I know that the first group, which mentions things like SentenceTransformer and InputExample, is for dealing with sentences and making it so I can compare how similar different sentences are, almost like giving each sentence its own fingerprint. The next few imports, especially from a library called sklearn, offer me ways to measure how well my models perform, letting me check things like how often my answers are right or wrong, and even help me split my data fairly so I don't end up grading myself on questions I already practiced.

The rest of the imports might not sound as fancy, but they really pull their weight. For example, pandas and numpy let me handle big sets of data and do math with them, DataLoader helps me avoid crashing my computer by breaking data into small chunks, and random, time, os, and shutil make it easier to shuffle things up, keep track of time, and work with files and folders. By gathering all these imports up front, I make sure that once I get going, I have everything laid out and I won't have to stop in the middle of my work to hunt for another tool. That way, I can focus completely on my actual analysis instead of worrying about missing pieces.

In [None]:
# ==================== DISABLE WANDB ====================
os.environ['WANDB_DISABLED'] = 'true'

# ==================== MOUNT GOOGLE DRIVE ====================
from google.colab import drive
drive.mount('/content/drive')

# ==================== LOAD DATASET ====================
print("\n" + "="*60)
print("LOADING DATASET")
print("="*60)

file_path = '/content/drive/MyDrive/TAGALOG-ESSAYS.csv'
df = pd.read_csv(file_path)

print("Dataset loaded successfully.")
print(f"Total samples: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nLabel distribution:\n{df['LABEL'].value_counts()}")

# Clean data
df = df.dropna(subset=['TITLE', 'ESSAY', 'LABEL'])
print(f"After cleaning: {len(df)} samples")

I set a piece of code to turn off something called "wandb" by tweaking an environment variable. Even though I don't use any technical language for this, what it really means for me is preventing a tool from popping up that keeps track of experiments, which I might not need for this project.

Next, I make sure I can reach my Google Drive files directly from Colab. I do this by running a command that connects my notebook to my Drive, it gives me a link to grant permission, and after doing that, I see my Drive files show up as if they're part of my computer's own folders. After I'm able to access my Drive, I focus on loading my data. I find my CSV file in MyDrive, then I use pandas to open it up. This lets me see how many samples I have, what columns are in the dataset, and gives me a count of each label, helping me understand the data better. Because messy data can cause issues, I use another line to drop any samples that are missing the key information like title, essay, or label.

For me, these steps feel a lot like organizing my workspace before starting the actual project. Once all of this is done, I'm set to start working with my essays in Colab, knowing that my data is both available and cleaned up, and that any files I use or save will stay safe in my Google Drive.

In [None]:
# ==================== SPLIT DATA ====================
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['LABEL'])
print(f"\nTraining samples: {len(train_df)}")
print(f"Testing samples: {len(test_df)}")

# Prepare training examples
train_examples = [
    InputExample(texts=[str(row['TITLE']), str(row['ESSAY'])], label=float(row['LABEL']))
    for index, row in train_df.iterrows()
]

# Get true labels for evaluation
true_labels = test_df['LABEL'].tolist()

In this part of my code, I focus on splitting my data into two main groups: one for training the model and the other for testing how well the model works. I use a popular function called train_test_split from scikit-learn, which helps with this kind of task by randomly mixing up the data and then splitting it based on the percentage I choose. In my example, I make 80 percent of the essays go into the training group and 20 percent go into the testing group, which is a common split. I also add an option called stratify, which makes sure that the different types of labels are spread out evenly between both groups so that neither side ends up missing any important categories.

After the split, I print out how many essays ended up in each group. Then, I start preparing the training examples by making a list where each item includes the title and essay together, along with the label that shows what kind of essay it is. This is an important step because later on, the model needs to see examples that combine the actual input text with the right answer. For the testing group, I also make a list of the correct labels, which I will use in the future to check if my model can predict correctly. This whole section is about organizing my data the right way so my model can learn from part of it and be graded fairly using the rest, making sure any results I get can be trusted.

In [None]:
# ==================== HYPERPARAMETERS ====================
batch_size = 16 # Reduced batch size
num_epochs = 4 # Reduced number of epochs
model_name = 'all-mpnet-base-v2'  # SBERT model
threshold = 0.4

print("\n" + "="*60)
print("CONFIGURATION")
print("="*60)
print(f"Model: {model_name}")
print(f"Batch Size: {batch_size}")
print(f"Epochs: {num_epochs}")
print(f"Threshold: {threshold}")
print("="*60)

For this section of my code, I'm basically setting up what are called "hyperparameters" these are the main settings I need to choose before I train my model. I set the batch size to 16, which just means that every time the model updates, it looks at 16 essays at once. Next, I pick a number of epochs, which is just another way of saying how many times the model will see the full training dataset; I decided on 4 here, so my model will go through all the essays four times. The model I choose is called 'all-mpnet-base-v2,' which is a specific version of a pre-trained Sentence-BERT model that's good at turning sentences into numbers that a computer can understand for tasks like comparing meanings.

I also set a threshold of 0.4, which is probably going to be used later on to decide if two things are similar enough or to make some kind of yes/no decision. I print out these settings in the notebook so it's easy for me to know exactly how my experiment is set up.

By doing this, I'm making sure my model's training will be reproducible, and if I change something to try for better results, I can keep track of it easily.

In [None]:
# ==================== BASELINE EVALUATION ====================
print("\n" + "="*60)
print("BASELINE EVALUATION (Pre-trained Model)")
print("="*60)

# Load pre-trained model (not fine-tuned)
baseline_model = SentenceTransformer(model_name)
print(f"✓ Loaded baseline model: {model_name}")

# Generate embeddings using baseline model
print("\nGenerating baseline embeddings...")
baseline_title_emb = baseline_model.encode(test_df['TITLE'].tolist(), convert_to_tensor=False, show_progress_bar=True)
baseline_essay_emb = baseline_model.encode(test_df['ESSAY'].tolist(), convert_to_tensor=False, show_progress_bar=True)

# Calculate similarities
print("Calculating baseline similarities...")
baseline_similarities = np.diag(cosine_similarity(baseline_title_emb, baseline_essay_emb))
baseline_predictions = [1 if sim >= threshold else 0 for sim in baseline_similarities]

# Calculate baseline metrics
baseline_accuracy = accuracy_score(true_labels, baseline_predictions)
baseline_precision = precision_score(true_labels, baseline_predictions, zero_division=0)
baseline_recall = recall_score(true_labels, baseline_predictions, zero_division=0)
baseline_f1 = f1_score(true_labels, baseline_predictions, zero_division=0)

print("\n" + "="*60)
print("BASELINE RESULTS")
print("="*60)
print(f"Accuracy:  {baseline_accuracy:.4f} ({baseline_accuracy*100:.2f}%)")
print(f"Precision: {baseline_precision:.4f} ({baseline_precision*100:.2f}%)")
print(f"Recall:    {baseline_recall:.4f} ({baseline_recall*100:.2f}%)")
print(f"F1-Score:  {baseline_f1:.4f} ({baseline_f1*100:.2f}%)")
print("="*60)

print("\nBaseline Confusion Matrix:")
print(confusion_matrix(true_labels, baseline_predictions))

I evaluate how well the pre-trained sentence transformer model works on my dataset before I try any improvements. First, I print some lines to make it clear we're doing a baseline evaluation, which just means it's my starting point using the model as it is, without extra fine-tuning. I load in the model called 'all-mpnet-base-v2' and confirm it in the output so I know which version is being used. Next, I use the model to turn every essay title and the essay itself from my test set into embeddings, which are just long lists of numbers that represent the meaning of each text in a way the computer can work with.

After I get the embeddings, I use a calculation called cosine similarity to figure out how close each title is to its matching essay. Cosine similarity gives a score between 0 and 1, where numbers closer to 1 mean the texts are more alike. To convert these scores into predictions for my task, I use a threshold of 0.4, if the score is at least 0.4, I let the computer classify the pair as similar (label 1), otherwise as different (label 0). Now I compare these predictions to the true answers from my test set and calculate metrics like accuracy, precision, recall, and F1-score. These help me see not just how many predictions were right, but also give me more details about how well the model is doing at picking out similar and different essays.

Finally, I print these results, including the confusion matrix, which gives a breakdown of how many essays the model got right and wrong for each class. This whole step gives me a reference point, so I can see if any changes or training I do later actually make the model better.

In [None]:
# ==================== AUTOMATED HYPERPARAMETER SEARCH ====================
def train_evaluate(train_examples, true_labels, test_df, batch_size, num_epochs, threshold, model_name):
    # DataLoader for train set
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
    # Load and set up model
    model = SentenceTransformer(model_name)
    train_loss = losses.CosineSimilarityLoss(model)
    warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
    # Train model
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=num_epochs,
        warmup_steps=warmup_steps,
        output_path=None, # Don't save in search
        show_progress_bar=False,
        use_amp=False
    )
    # Generate embeddings for test set
    title_emb = model.encode(test_df['TITLE'].tolist(), convert_to_tensor=False)
    essay_emb = model.encode(test_df['ESSAY'].tolist(), convert_to_tensor=False)
    similarities = np.diag(cosine_similarity(title_emb, essay_emb))
    predictions = [1 if sim >= threshold else 0 for sim in similarities]
    # Compute metrics
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, zero_division=0)
    precision = precision_score(true_labels, predictions, zero_division=0)
    recall = recall_score(true_labels, predictions, zero_division=0)
    return {
        'accuracy': accuracy, 'f1': f1, 'precision': precision, 'recall': recall,
        'batch_size': batch_size, 'num_epochs': num_epochs, 'threshold': threshold
    }

# Define hyperparameter grids
param_grid = {
    'batch_size': [8, 16],
    'num_epochs': [3, 4],
    'threshold': [0.4, 0.5, 0.6, 0.65]
}
n_iter_random_search = 4 # Choose how many random configs to try

# ---- GRID SEARCH ----
print("\n" + "="*60)
print("GRID SEARCH: SBERT Hyperparameters")
print("="*60)

best_grid_result = None
grid_results = []

for param_set in ParameterGrid(param_grid):
    print(f"\nTrying Grid Parameters: {param_set}")
    metrics = train_evaluate(train_examples, true_labels, test_df,
                             batch_size=param_set['batch_size'],
                             num_epochs=param_set['num_epochs'],
                             threshold=param_set['threshold'],
                             model_name=model_name)
    grid_results.append(metrics)
    if (best_grid_result is None) or (metrics['f1'] > best_grid_result['f1']):
        best_grid_result = metrics
    print(f"F1 = {metrics['f1']:.4f}, Accuracy = {metrics['accuracy']:.4f}")

print("\nBest Grid Search Result:")
print(best_grid_result)

# ---- RANDOM SEARCH ----
print("\n" + "="*60)
print("RANDOM SEARCH: SBERT Hyperparameters")
print("="*60)

param_list = list(ParameterSampler(param_grid, n_iter=n_iter_random_search, random_state=42))
best_random_result = None
random_results = []

for param_set in param_list:
    print(f"\nTrying Random Parameters: {param_set}")
    metrics = train_evaluate(train_examples, true_labels, test_df,
                             batch_size=param_set['batch_size'],
                             num_epochs=param_set['num_epochs'],
                             threshold=param_set['threshold'],
                             model_name=model_name)
    random_results.append(metrics)
    if (best_random_result is None) or (metrics['f1'] > best_random_result['f1']):
        best_random_result = metrics
    print(f"F1 = {metrics['f1']:.4f}, Accuracy = {metrics['accuracy']:.4f}")

print("\nBest Random Search Result:")
print(best_random_result)

# === After search, use best params for final training ===
batch_size = best_grid_result['batch_size'] # Or use best_random_result['batch_size']
num_epochs = best_grid_result['num_epochs']
threshold = best_grid_result['threshold']

print("\nUsing Best Grid/Random Parameters for Final Model:")
print(f"Batch Size: {batch_size}, Epochs: {num_epochs}, Threshold: {threshold}")

I set up an automated way to test out different settings for training a sentence transformer model. I started by writing a function that takes the data I want to work with, such as my labeled sentence examples and a bunch of settings like batch size, number of times I run through the data (which are epochs), and a threshold I’ll use later to judge similarity. In the function, I first make sure my data is organized so the model can practice with small groups at a time, which helps the computer not get overwhelmed. Then, I load up a model that tries to understand sentences and connect them to their meanings, and I set up a system so that it learns by comparing how close or far apart the sentences should be. Once everything is ready, I let the model learn with the settings I gave it, and when it's done, I ask it to measure how similar the pairs of sentences are. Based on those similarities, the model decides whether it thinks the match is good enough, using my threshold as the cutoff. After that, I check how well my choices did by scoring how many it got right and a few other ways that show if it’s missing anything or making mistakes.

The cool part is how I use this function: instead of guessing the best settings, I tell my notebook to try out lots of possibilities using what's called a grid search and a random search. For grid search, it goes through all the combinations I list, like each batch size with every threshold and number of epochs, kind of like testing every flavor combination in a sundae bar. Random search just picks a few combinations to try out, which saves time when I don’t want to test everything. Every time the notebook runs a new combo, it scores it, and at the end, I look for the one with the best results (usually the best F1 score, which means it balances getting answers right and not missing much). Once I know which combo worked best, I use those settings to train my final model. This whole process makes me feel like I’m running experiments in a lab, testing different formulas until I find the one that gives me the strongest results.

In [None]:
# ==================== FINE-TUNING ====================
print("\n" + "="*60)
print("FINE-TUNING SBERT MODEL")
print("="*60)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

# Load model for fine-tuning
model = SentenceTransformer(model_name)
train_loss = losses.CosineSimilarityLoss(model)
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

print(f"Warmup Steps: {warmup_steps}")
print(f"Total Training Steps: {len(train_dataloader) * num_epochs}")

print(f"\nStarting fine-tuning {model_name}...")
start_time = time.time()

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./fine_tuned_sbert',
    show_progress_bar=True,
    use_amp=False
)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds ({training_time/60:.2f} minutes)")

This is where I actually work on improving my sentence transformer model, instead of just using it as is. First, I print some lines to explain that I’m now doing fine-tuning, which means taking the base model and training it more, but this time using my own essay data. I put the training examples into a DataLoader, which just helps organize them so the model looks at them in small groups (batches) each step of the way, and shuffles them to make the training more balanced.

Next, I load the pre-trained model that I want to improve, then pick a loss function called CosineSimilarityLoss. This function tells the model how well it's doing at making similar essays (like matching a title with its essay) give similar results, and encourages it to get better each time. Before training, I calculate the number of 'warmup steps,' which is like a short ramp-up period at the beginning where the learning rate is lower, to help things start smoothly. I also print out how many steps the total training will take.

Once I have all the settings ready, I start the fine-tuning by calling the fit function. This is where the real training happens: the model goes through the training data several times, the number of epochs I picked earlier, and saves the improved version to a folder. I time how long this training process takes, and when it's done, I print out the total training time in both seconds and minutes, so I know roughly how long each run costs me. Fine-tuning like this is supposed to make the model perform better for my specific use case because it gets to "see" more examples that look like my actual data and adjust itself for those kinds of sentences.

In [None]:
# ==================== FINE-TUNED EVALUATION ====================
print("\n" + "="*60)
print("FINE-TUNED SBERT EVALUATION")
print("="*60)

print("Generating fine-tuned embeddings...")
finetuned_title_emb = model.encode(test_df['TITLE'].tolist(), convert_to_tensor=False, show_progress_bar=True)
finetuned_essay_emb = model.encode(test_df['ESSAY'].tolist(), convert_to_tensor=False, show_progress_bar=True)

print("Calculating fine-tuned similarities...")
finetuned_similarities = np.diag(cosine_similarity(finetuned_title_emb, finetuned_essay_emb))
finetuned_predictions = [1 if sim >= threshold else 0 for sim in finetuned_similarities]

# Calculate fine-tuned metrics
finetuned_accuracy = accuracy_score(true_labels, finetuned_predictions)
finetuned_precision = precision_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_recall = recall_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_f1 = f1_score(true_labels, finetuned_predictions, zero_division=0)

print("\n" + "="*60)
print("FINE-TUNED RESULTS")
print("="*60)
print(f"Accuracy:  {finetuned_accuracy:.4f} ({finetuned_accuracy*100:.2f}%)")
print(f"Precision: {finetuned_precision:.4f} ({finetuned_precision*100:.2f}%)")
print(f"Recall:    {finetuned_recall:.4f} ({finetuned_recall*100:.2f}%)")
print(f"F1-Score:  {finetuned_f1:.4f} ({finetuned_f1*100:.2f}%)")
print(f"Training Time: {training_time:.2f}s")
print("="*60)

print("\nFine-tuned Confusion Matrix:")
print(confusion_matrix(true_labels, finetuned_predictions))

print("\nClassification Report:")
print(classification_report(true_labels, finetuned_predictions, target_names=['Not Match', 'Match']))


Checking how well my sentence transformer model performs after it has been fine-tuned on my own essay data. I begin by showing, with print statements, that I'm running the evaluation specifically for the updated (fine-tuned) model. For each essay in the test set, I send the titles and essays through my trained model to get new embeddings, these are like special fingerprints for each piece of text, where similar texts end up with more similar fingerprints. I use these to compare how similar the title and essay are for each pair, relying on a metric called cosine similarity. Cosine similarity is a common way to figure out how close two sets of numbers are by looking at the angle between them, not just their length, making it a solid choice for measuring meanings in text. If the cosine similarity score for a pair is at least as high as my chosen threshold (0.4), I count it as a "match" otherwise, it's not.​

Once I have these match/no-match predictions, I check how well they line up with the real answers in the test set. I compute results like accuracy, precision, recall, and F1-score, which all tell me in slightly different ways how reliable my model's predictions are. I also print out the confusion matrix, which lets me see where the model is making mistakes, and a classification report that breaks down how it does for each category. By the end, I not only see the performance of my fine-tuned model but can directly compare it to the baseline results from before, which helps me know if training on my own data actually made the model better for my particular essay-related task.

In [None]:
# ==================== BASELINE vs FINE-TUNED COMPARISON ====================
print("\n" + "="*60)
print("BASELINE vs FINE-TUNED COMPARISON")
print("="*60)
print(f"{'Metric':<12} {'Baseline':<12} {'Fine-tuned':<12} {'Improvement':<15}")
print("-"*60)

acc_improvement = finetuned_accuracy - baseline_accuracy
acc_improvement_pct = (acc_improvement / baseline_accuracy * 100) if baseline_accuracy > 0 else 0
print(f"{'Accuracy':<12} {baseline_accuracy:.4f}       {finetuned_accuracy:.4f}       {acc_improvement:+.4f} ({acc_improvement_pct:+.1f}%)")

prec_improvement = finetuned_precision - baseline_precision
prec_improvement_pct = (prec_improvement / baseline_precision * 100) if baseline_precision > 0 else 0
print(f"{'Precision':<12} {baseline_precision:.4f}       {finetuned_precision:.4f}       {prec_improvement:+.4f} ({prec_improvement_pct:+.1f}%)")

rec_improvement = finetuned_recall - baseline_recall
rec_improvement_pct = (rec_improvement / baseline_recall * 100) if baseline_recall > 0 else 0
print(f"{'Recall':<12} {baseline_recall:.4f}       {finetuned_recall:.4f}       {rec_improvement:+.4f} ({rec_improvement_pct:+.1f}%)")

f1_improvement = finetuned_f1 - baseline_f1
f1_improvement_pct = (f1_improvement / baseline_f1 * 100) if baseline_f1 > 0 else 0
print(f"{'F1-Score':<12} {baseline_f1:.4f}       {finetuned_f1:.4f}       {f1_improvement:+.4f} ({f1_improvement_pct:+.1f}%)")

print("="*60)

I compare how the basic model and the fine-tuned model perform using the main metrics: accuracy, precision, recall, and F1-score. I start by printing a header to show that this is the comparison section, and then I set up a table to display each metric for both models along with how much the score improved.

For each metric, I find the difference between the value from the fine-tuned model and the baseline, and I also calculate the percentage improvement. For example, if the accuracy of the fine-tuned model is higher than the baseline, I show both the straight improvement and by what percentage it increased. I do the same for precision, recall, and F1-score. This makes it easy to see where the model improved the most after fine-tuning.

This approach is important because it gives me a clear, side-by-side view of the progress my model made through extra training on the specific type of data I care about. By looking at these results, I get a more complete picture of whether fine-tuning really helped, or if any metric did not improve as much as expected. This kind of detailed comparison is a useful step in any machine learning workflow, helping to justify the extra effort put into training, and it's a common way to present results when discussing model improvements in research or real-world projects.

In [None]:
# ==================== SAMPLE PREDICTIONS COMPARISON ====================
print("\n" + "="*60)
print("SAMPLE PREDICTIONS: BASELINE vs FINE-TUNED")
print("="*60)

for i in range(min(5, len(test_df))):
    title = test_df.iloc[i]['TITLE'][:50]
    true_label = true_labels[i]

    baseline_pred = baseline_predictions[i]
    baseline_sim = baseline_similarities[i]

    finetuned_pred = finetuned_predictions[i]
    finetuned_sim = finetuned_similarities[i]

    print(f"\nSample {i+1}: {title}...")
    print(f"True Label: {true_label}")
    print(f"Baseline   - Pred: {baseline_pred} | Sim: {baseline_sim:.4f} | {'✓' if baseline_pred == true_label else '✗'}")
    print(f"Fine-tuned - Pred: {finetuned_pred} | Sim: {finetuned_sim:.4f} | {'✓' if finetuned_pred == true_label else '✗'}")

I take a closer look at a few real examples from my dataset to compare how the baseline and fine-tuned models perform on actual predictions. For each sample, I grab the essay title, just the first 50 characters so it's easy to read, the correct label, and then pull both the baseline and fine-tuned model's predictions and similarity scores for that pair. I print all this out so I can see at a glance whether each model made the right call by matching its prediction to the true label, marking it with a check or a cross. This helps me see not just big-picture improvements in scores from earlier, but also what changes for individual essays when the model gets a chance to learn from my data. If I notice that the fine-tuned model consistently gets more checks, or has higher similarity scores where it should, it reassures me that the extra training really made a difference. Looking at sample predictions like this is a good way to spot strengths and weaknesses of both models on real data, and helps me understand what kind of essays or titles each version might still struggle with.

In [None]:
# ==================== SAVE MODEL ====================
drive_model_path = '/content/drive/MyDrive/fine_tuned_sbert'
print(f"\nSaving fine-tuned SBERT model to Google Drive...")
shutil.copytree('./fine_tuned_sbert', drive_model_path, dirs_exist_ok=True)
print(f"✓ Model saved at: {drive_model_path}")

# ==================== SAVE RESULTS ====================
import json
from datetime import datetime

results = {
    'timestamp': datetime.now().isoformat(),
    'model_name': model_name,
    'batch_size': batch_size,
    'num_epochs': num_epochs,
    'threshold': threshold,
    'training_time_seconds': training_time,
    'baseline': {
        'accuracy': float(baseline_accuracy),
        'precision': float(baseline_precision),
        'recall': float(baseline_recall),
        'f1': float(baseline_f1)
    },
    'finetuned': {
        'accuracy': float(finetuned_accuracy),
        'precision': float(finetuned_precision),
        'recall': float(finetuned_recall),
        'f1': float(finetuned_f1)
    },
    'improvements': {
        'accuracy': float(acc_improvement),
        'precision': float(prec_improvement),
        'recall': float(rec_improvement),
        'f1': float(f1_improvement)
    }
}

results_path = '/content/drive/MyDrive/sbert_experiment_results.json'
with open(results_path, 'a') as f:
    f.write(json.dumps(results, indent=2) + '\n')

print(f"✓ Results saved to: {results_path}")

At this stage, I am making sure my work does not get lost by saving my trained model and key results to my Google Drive. First, I choose a folder in my Drive where I want to keep the fine-tuned SBERT model and use shutil.copytree to copy the whole model directory into that location. This way, the model stays safe even after the Colab session ends or shuts down, since anything not saved to Drive can disappear when the session resets. Once the model is copied, I get a message confirming it is in my Drive.

Next, I turn to saving my experiment's results. I import the json and datetime libraries so I can write my results in a structured format and tag them with the exact time they were saved. I collect all the important details like the name of the model used, batch size, number of training rounds (epochs), threshold for classification, and how long training took. I also put in both the baseline and fine-tuned scores for accuracy, precision, recall, and F1, then show how much each improved overall. I save this collection as a JSON object, appending it to a results file in my Drive. By using JSON, I keep the information well organized and easy to analyze later, whether for making graphs, reports, or tracking my progress between tests. This whole process keeps my work safe, easy to find, and makes sure I have a clear record of my experiments whenever I return to them in the future.

In [None]:
# ==================== SUMMARY ====================
print("\n" + "="*60)
print("SBERT FINE-TUNING COMPLETE!")
print("="*60)
print(f"✓ Baseline evaluated")
print(f"✓ SBERT model fine-tuned")
print(f"✓ Performance comparison completed")
print(f"✓ Model saved at: {drive_model_path}")
print(f"✓ Results saved at: {results_path}")
print("\nTo use fine-tuned SBERT model:")
print(f"model = SentenceTransformer('{drive_model_path}')")
print("="*60)

Summing up the work I did in this project, I started by evaluating the original pre-trained SBERT model on my data to see how it performed right out of the box. I then fine-tuned the SBERT model using my specific dataset of Tagalog essays and their labels, which allowed the model to adapt more to the patterns in my own data. After that, I compared the scores from the baseline and the fine-tuned model, checking metrics like accuracy, precision, recall, and F1-score to see how much the model improved after training on my data.

Once I was satisfied with the results, I saved the newly fine-tuned model to my Google Drive. This means I do not have to repeat all the training steps if I need to use the model again. Along with the model, I also saved all of the results and important experiment settings in a JSON file to my Drive. This keeps a record of what I did and makes it easier to track improvements or share the results with others. If I want to load the model later for new predictions or further work, I can just create a new model in my code by pointing to the file path where it is saved, like this: model = SentenceTransformer('/content/drive/MyDrive/fine_tuned_sbert').

The fine-tuning process let me make the most out of SBERT for my essay-matching task, and I wrapped up with both the improved model and a summary of results safely stored. This step-by-step process matches current best practices for using and adapting SBERT to new uses, showing how fine-tuning on my own dataset leads to noticeably better model performance.

# Actual Generated Code

In [1]:
# ==================== INSTALLATION ====================
!pip install sentence-transformers scikit-learn pandas

# ==================== IMPORTS ====================
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import time
import os
import shutil



In [6]:
# ==================== DISABLE WANDB ====================
os.environ['WANDB_DISABLED'] = 'true'

# ==================== MOUNT GOOGLE DRIVE ====================
from google.colab import drive
drive.mount('/content/drive')

# ==================== LOAD DATASET ====================
print("\n" + "="*60)
print("LOADING DATASET")
print("="*60)

file_path = '/content/drive/MyDrive/TAGALOG-ESSAYS.csv'
df = pd.read_csv(file_path)

print("Dataset loaded successfully.")
print(f"Total samples: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nLabel distribution:\n{df['LABEL'].value_counts()}")

# Clean data
df = df.dropna(subset=['TITLE', 'ESSAY', 'LABEL'])
print(f"After cleaning: {len(df)} samples")

# ==================== SPLIT DATA ====================
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['LABEL'])
print(f"\nTraining samples: {len(train_df)}")
print(f"Testing samples: {len(test_df)}")

# Prepare training examples
train_examples = [
    InputExample(texts=[str(row['TITLE']), str(row['ESSAY'])], label=float(row['LABEL']))
    for index, row in train_df.iterrows()
]

# Get true labels for evaluation
true_labels = test_df['LABEL'].tolist()

# ==================== HYPERPARAMETERS ====================
batch_size = 16 # Reduced batch size
num_epochs = 3 # Reduced number of epochs
model_name = 'all-mpnet-base-v2'  # SBERT model
threshold = 0.4

print("\n" + "="*60)
print("CONFIGURATION")
print("="*60)
print(f"Model: {model_name}")
print(f"Batch Size: {batch_size}")
print(f"Epochs: {num_epochs}")
print(f"Threshold: {threshold}")
print("="*60)

# ==================== BASELINE EVALUATION ====================
print("\n" + "="*60)
print("BASELINE EVALUATION (Pre-trained Model)")
print("="*60)

# Load pre-trained model (not fine-tuned)
baseline_model = SentenceTransformer(model_name)
print(f"✓ Loaded baseline model: {model_name}")

# Generate embeddings using baseline model
print("\nGenerating baseline embeddings...")
baseline_title_emb = baseline_model.encode(test_df['TITLE'].tolist(), convert_to_tensor=False, show_progress_bar=True)
baseline_essay_emb = baseline_model.encode(test_df['ESSAY'].tolist(), convert_to_tensor=False, show_progress_bar=True)

# Calculate similarities
print("Calculating baseline similarities...")
baseline_similarities = np.diag(cosine_similarity(baseline_title_emb, baseline_essay_emb))
baseline_predictions = [1 if sim >= threshold else 0 for sim in baseline_similarities]

# Calculate baseline metrics
baseline_accuracy = accuracy_score(true_labels, baseline_predictions)
baseline_precision = precision_score(true_labels, baseline_predictions, zero_division=0)
baseline_recall = recall_score(true_labels, baseline_predictions, zero_division=0)
baseline_f1 = f1_score(true_labels, baseline_predictions, zero_division=0)

print("\n" + "="*60)
print("BASELINE RESULTS")
print("="*60)
print(f"Accuracy:  {baseline_accuracy:.4f} ({baseline_accuracy*100:.2f}%)")
print(f"Precision: {baseline_precision:.4f} ({baseline_precision*100:.2f}%)")
print(f"Recall:    {baseline_recall:.4f} ({baseline_recall*100:.2f}%)")
print(f"F1-Score:  {baseline_f1:.4f} ({baseline_f1*100:.2f}%)")
print("="*60)

print("\nBaseline Confusion Matrix:")
print(confusion_matrix(true_labels, baseline_predictions))

# ==================== FINE-TUNING ====================
print("\n" + "="*60)
print("FINE-TUNING SBERT MODEL")
print("="*60)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

# Load model for fine-tuning
model = SentenceTransformer(model_name)
train_loss = losses.CosineSimilarityLoss(model)
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

print(f"Warmup Steps: {warmup_steps}")
print(f"Total Training Steps: {len(train_dataloader) * num_epochs}")

print(f"\nStarting fine-tuning {model_name}...")
start_time = time.time()

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./fine_tuned_sbert',
    show_progress_bar=True,
    use_amp=False
)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds ({training_time/60:.2f} minutes)")

# ==================== FINE-TUNED EVALUATION ====================
print("\n" + "="*60)
print("FINE-TUNED SBERT EVALUATION")
print("="*60)

print("Generating fine-tuned embeddings...")
finetuned_title_emb = model.encode(test_df['TITLE'].tolist(), convert_to_tensor=False, show_progress_bar=True)
finetuned_essay_emb = model.encode(test_df['ESSAY'].tolist(), convert_to_tensor=False, show_progress_bar=True)

print("Calculating fine-tuned similarities...")
finetuned_similarities = np.diag(cosine_similarity(finetuned_title_emb, finetuned_essay_emb))
finetuned_predictions = [1 if sim >= threshold else 0 for sim in finetuned_similarities]

# Calculate fine-tuned metrics
finetuned_accuracy = accuracy_score(true_labels, finetuned_predictions)
finetuned_precision = precision_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_recall = recall_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_f1 = f1_score(true_labels, finetuned_predictions, zero_division=0)

print("\n" + "="*60)
print("FINE-TUNED RESULTS")
print("="*60)
print(f"Accuracy:  {finetuned_accuracy:.4f} ({finetuned_accuracy*100:.2f}%)")
print(f"Precision: {finetuned_precision:.4f} ({finetuned_precision*100:.2f}%)")
print(f"Recall:    {finetuned_recall:.4f} ({finetuned_recall*100:.2f}%)")
print(f"F1-Score:  {finetuned_f1:.4f} ({finetuned_f1*100:.2f}%)")
print(f"Training Time: {training_time:.2f}s")
print("="*60)

print("\nFine-tuned Confusion Matrix:")
print(confusion_matrix(true_labels, finetuned_predictions))

print("\nClassification Report:")
print(classification_report(true_labels, finetuned_predictions, target_names=['Not Match', 'Match']))

# ==================== BASELINE vs FINE-TUNED COMPARISON ====================
print("\n" + "="*60)
print("BASELINE vs FINE-TUNED COMPARISON")
print("="*60)
print(f"{'Metric':<12} {'Baseline':<12} {'Fine-tuned':<12} {'Improvement':<15}")
print("-"*60)

acc_improvement = finetuned_accuracy - baseline_accuracy
acc_improvement_pct = (acc_improvement / baseline_accuracy * 100) if baseline_accuracy > 0 else 0
print(f"{'Accuracy':<12} {baseline_accuracy:.4f}       {finetuned_accuracy:.4f}       {acc_improvement:+.4f} ({acc_improvement_pct:+.1f}%)")

prec_improvement = finetuned_precision - baseline_precision
prec_improvement_pct = (prec_improvement / baseline_precision * 100) if baseline_precision > 0 else 0
print(f"{'Precision':<12} {baseline_precision:.4f}       {finetuned_precision:.4f}       {prec_improvement:+.4f} ({prec_improvement_pct:+.1f}%)")

rec_improvement = finetuned_recall - baseline_recall
rec_improvement_pct = (rec_improvement / baseline_recall * 100) if baseline_recall > 0 else 0
print(f"{'Recall':<12} {baseline_recall:.4f}       {finetuned_recall:.4f}       {rec_improvement:+.4f} ({rec_improvement_pct:+.1f}%)")

f1_improvement = finetuned_f1 - baseline_f1
f1_improvement_pct = (f1_improvement / baseline_f1 * 100) if baseline_f1 > 0 else 0
print(f"{'F1-Score':<12} {baseline_f1:.4f}       {finetuned_f1:.4f}       {f1_improvement:+.4f} ({f1_improvement_pct:+.1f}%)")

print("="*60)

# ==================== SAMPLE PREDICTIONS COMPARISON ====================
print("\n" + "="*60)
print("SAMPLE PREDICTIONS: BASELINE vs FINE-TUNED")
print("="*60)

for i in range(min(5, len(test_df))):
    title = test_df.iloc[i]['TITLE'][:50]
    true_label = true_labels[i]

    baseline_pred = baseline_predictions[i]
    baseline_sim = baseline_similarities[i]

    finetuned_pred = finetuned_predictions[i]
    finetuned_sim = finetuned_similarities[i]

    print(f"\nSample {i+1}: {title}...")
    print(f"True Label: {true_label}")
    print(f"Baseline   - Pred: {baseline_pred} | Sim: {baseline_sim:.4f} | {'✓' if baseline_pred == true_label else '✗'}")
    print(f"Fine-tuned - Pred: {finetuned_pred} | Sim: {finetuned_sim:.4f} | {'✓' if finetuned_pred == true_label else '✗'}")

# ==================== SAVE MODEL ====================
drive_model_path = '/content/drive/MyDrive/fine_tuned_sbert'
print(f"\nSaving fine-tuned SBERT model to Google Drive...")
shutil.copytree('./fine_tuned_sbert', drive_model_path, dirs_exist_ok=True)
print(f"✓ Model saved at: {drive_model_path}")

# ==================== SAVE RESULTS ====================
import json
from datetime import datetime

results = {
    'timestamp': datetime.now().isoformat(),
    'model_name': model_name,
    'batch_size': batch_size,
    'num_epochs': num_epochs,
    'threshold': threshold,
    'training_time_seconds': training_time,
    'baseline': {
        'accuracy': float(baseline_accuracy),
        'precision': float(baseline_precision),
        'recall': float(baseline_recall),
        'f1': float(baseline_f1)
    },
    'finetuned': {
        'accuracy': float(finetuned_accuracy),
        'precision': float(finetuned_precision),
        'recall': float(finetuned_recall),
        'f1': float(finetuned_f1)
    },
    'improvements': {
        'accuracy': float(acc_improvement),
        'precision': float(prec_improvement),
        'recall': float(rec_improvement),
        'f1': float(f1_improvement)
    }
}

results_path = '/content/drive/MyDrive/sbert_experiment_results.json'
with open(results_path, 'a') as f:
    f.write(json.dumps(results, indent=2) + '\n')

print(f"✓ Results saved to: {results_path}")

# ==================== SUMMARY ====================
print("\n" + "="*60)
print("SBERT FINE-TUNING COMPLETE!")
print("="*60)
print(f"✓ Baseline evaluated")
print(f"✓ SBERT model fine-tuned")
print(f"✓ Performance comparison completed")
print(f"✓ Model saved at: {drive_model_path}")
print(f"✓ Results saved at: {results_path}")
print("\nTo use fine-tuned SBERT model:")
print(f"model = SentenceTransformer('{drive_model_path}')")
print("="*60)

# ==================== SAVE MATCH RESULTS TO CSV ====================
print("\nSaving CSV with match results...")

# Create a copy of test_df to avoid modifying the original
results_df = test_df.copy()

# Add new column containing match (1) / mismatch (0) results
results_df['MATCH_RESULT'] = finetuned_predictions

# Define output path for new CSV
output_csv_path = '/content/drive/MyDrive/sbert_match_results.csv'

# Save to CSV
results_df.to_csv(output_csv_path, index=False)

print(f"✓ CSV file saved successfully at: {output_csv_path}")
print("✓ New column 'MATCH_RESULT' added with predictions.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

LOADING DATASET
Dataset loaded successfully.
Total samples: 886
Columns: ['TITLE', 'ESSAY', 'LABEL']

Label distribution:
LABEL
1    477
0    409
Name: count, dtype: int64
After cleaning: 885 samples

Training samples: 708
Testing samples: 177

CONFIGURATION
Model: all-mpnet-base-v2
Batch Size: 16
Epochs: 3
Threshold: 0.4

BASELINE EVALUATION (Pre-trained Model)
✓ Loaded baseline model: all-mpnet-base-v2

Generating baseline embeddings...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating baseline similarities...

BASELINE RESULTS
Accuracy:  0.5311 (53.11%)
Precision: 0.5345 (53.45%)
Recall:    0.9789 (97.89%)
F1-Score:  0.6914 (69.14%)

Baseline Confusion Matrix:
[[ 1 81]
 [ 2 93]]

FINE-TUNING SBERT MODEL


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Warmup Steps: 13
Total Training Steps: 135

Starting fine-tuning all-mpnet-base-v2...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss



✓ Training completed in 177.79 seconds (2.96 minutes)

FINE-TUNED SBERT EVALUATION
Generating fine-tuned embeddings...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating fine-tuned similarities...

FINE-TUNED RESULTS
Accuracy:  0.8079 (80.79%)
Precision: 0.7748 (77.48%)
Recall:    0.9053 (90.53%)
F1-Score:  0.8350 (83.50%)
Training Time: 177.79s

Fine-tuned Confusion Matrix:
[[57 25]
 [ 9 86]]

Classification Report:
              precision    recall  f1-score   support

   Not Match       0.86      0.70      0.77        82
       Match       0.77      0.91      0.83        95

    accuracy                           0.81       177
   macro avg       0.82      0.80      0.80       177
weighted avg       0.82      0.81      0.80       177


BASELINE vs FINE-TUNED COMPARISON
Metric       Baseline     Fine-tuned   Improvement    
------------------------------------------------------------
Accuracy     0.5311       0.8079       +0.2768 (+52.1%)
Precision    0.5345       0.7748       +0.2403 (+45.0%)
Recall       0.9789       0.9053       -0.0737 (-7.5%)
F1-Score     0.6914       0.8350       +0.1435 (+20.8%)

SAMPLE PREDICTIONS: BASELINE vs FIN