# **INSTALLATION OF LIBRARIES**

This section of the notebook begins by installing the necessary libraries for the project, specifically **`sentence-transformers`**, **`scikit-learn`**, and **`pandas`**, which provide tools for working with text data, machine learning, and data manipulation, respectively. Following the installation, several modules are imported to enable various functionalities, including loading and fine-tuning sentence transformer models, preparing data for training, evaluating model performance using metrics like accuracy, precision, recall, and F1-score, calculating cosine similarity between text embeddings, splitting the dataset, handling numerical operations, managing file paths, and performing text preprocessing using regular expressions. Additionally, Weights & Biases logging is disabled to streamline the process. The notebook then proceeds to load the dataset. It first mounts Google Drive to access the data file, reads the data from a CSV file named TAGALOG-ESSAYS.csv into a pandas DataFrame, and prints information about the loaded data, such as the total number of samples, column names, sample entries from the 'TITLE' and 'ESSAY' columns before preprocessing, and the distribution of labels in the dataset.

In [2]:
# ==================== INSTALLATION ====================
!pip install sentence-transformers scikit-learn pandas

# ==================== IMPORTS ====================
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import time
import os
import shutil
import re




# **LOADING DATASET**

The code in this cell first disables Weights & Biases logging to prevent tracking for this experiment. It then mounts your Google Drive to the Colab environment, which is necessary to access the dataset file stored there. After successfully mounting the drive, it proceeds to load the dataset from the specified file path, /content/drive/MyDrive/TAGALOG-ESSAYS.csv, into a pandas DataFrame named df. To provide an initial understanding of the data, the code prints the total number of samples loaded, lists the column names present in the DataFrame, displays a sample of the content from the 'TITLE' and 'ESSAY' columns before any preprocessing, and shows the distribution of labels within the dataset.

In [4]:
# ==================== DISABLE WANDB ====================
os.environ['WANDB_DISABLED'] = 'true'

# ==================== MOUNT GOOGLE DRIVE ====================
from google.colab import drive
drive.mount('/content/drive')

# ==================== LOAD DATASET ====================
print("\n" + "="*60)
print("LOADING DATASET")
print("="*60)

file_path = '/content/drive/MyDrive/TAGALOG-ESSAYS.csv'
df = pd.read_csv(file_path)

print("Dataset loaded successfully.")
print(f"Total samples: {len(df)}")
print(f"Columns: {list(df.columns)}")

# Show sample before preprocessing
print("\n" + "="*60)
print("SAMPLE DATA (Before Preprocessing)")
print("="*60)
print(f"Title sample: {df['TITLE'].iloc[0][:100]}")
print(f"Essay sample: {df['ESSAY'].iloc[0][:100]}...")
print(f"\nLabel distribution:\n{df['LABEL'].value_counts()}")

Mounted at /content/drive

LOADING DATASET
Dataset loaded successfully.
Total samples: 886
Columns: ['TITLE', 'ESSAY', 'LABEL']

SAMPLE DATA (Before Preprocessing)
Title sample: Edukasyon, Bulok na, Bakit Mahal
Maikling
Essay sample: Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...

Label distribution:
LABEL
1    477
0    409
Name: count, dtype: int64


# **DATA PREPROCESSING**

This is a comment marking the start of the data preprocessing section. It indicates that the code below will focus on cleaning and preparing the raw data.

**`def preprocess_text(text)`**: This line defines a Python function named preprocess_text that takes one argument, text. This function contains the logic for cleaning the input text.

**`if pd.isna(text) or text is None`**: This checks if the input **`text`** is a missing value (like NaN in pandas) or is **`None`**. If it is, the function returns an empty string "" to handle potential errors.

**`text = str(text)`**: This ensures that the input text is converted into a string data type, which is necessary for applying string operations and regular expressions.

**# 1. Remove extra whitespace:**

**`text = re.sub(r'\s+', ' ', text)`**: This line uses a regular expression (re.sub) to replace one or more whitespace characters (\s+) with a single space ( ). This cleans up multiple spaces, tabs, and newlines within the text.

**# 2. Remove leading/trailing whitespace:**

**`text = text.strip()`**: This line removes any whitespace characters from the beginning and end of the string.

**# 3. Normalize quotes: This comment introduces the third step.**

text = text.replace('"', '"').replace('"', '"'): These lines replace different types of quotation marks (like curly quotes) with standard straight quotes.

**# 4. Remove special characters but keep Tagalog letters and basic punctuation:**

**`text = re.sub(r'[^\w\s.,!?;:\-áéíóúàèìòùäëïöüñÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ]', '', text)`**: This regular expression removes any character that is NOT (^) a word character (\w, which includes letters, numbers, and underscore), whitespace (\s), or one of the specified punctuation marks (., ,, !, ?, ;, :, -) or Tagalog letters (áéíóúàèìòùäëïöüñÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ).

**# 5. Remove multiple punctuation:**

**`text = re.sub(r'([.!?])\1+', r'\1', text)`**: This regular expression finds sequences of the same punctuation mark (., !, or ?) repeated one or more times (\1+) and replaces them with just one instance of that punctuation mark (\1).

**# 6. Remove URLs:**

**`text = re.sub(r'http\S+|www.\S+', '', text)`**: This regular expression removes common URL patterns starting with http or www..

**# 7. Remove email addresses:**

**`text = re.sub(r'\S+@\S+', '', text)`**: This regular expression removes patterns that look like email addresses (characters, followed by @, followed by more characters).

**# 8. Remove numbers-only sequences:**

**`text = re.sub(r'\b\d+\b', '', text)`**: This regular expression removes sequences that consist only of digits (\d+) and are surrounded by word boundaries (\b), preventing it from removing numbers that are part of words.

**# 9. Final cleanup:**

**`text = re.sub(r'\s+', ' ', text).strip()`**: This performs another pass to replace any sequence of whitespace with a single space and then removes leading/trailing whitespace, just in case previous steps introduced new whitespace issues.

**Apply preprocessing:**

**`print("Preprocessing TITLE column...")`**: Prints a message indicating the start of preprocessing for the 'TITLE' column.
**`df['TITLE_CLEAN']= df['TITLE'].apply(preprocess_text)`** : Applies the preprocess_text function to each value in the original 'TITLE' column and saves the cleaned results into a new column named 'TITLE_CLEAN'.
**`print("Preprocessing ESSAY column...")`**: Prints a message indicating the start of preprocessing for the 'ESSAY' column.
**`df['ESSAY_CLEAN'] = df['ESSAY'].apply(preprocess_text)`**: Applies the preprocess_text function to each value in the original 'ESSAY' column and saves the cleaned results into a new column named 'ESSAY_CLEAN'.

**Show sample after preprocessing:**

The following print statements show a comparison between the original text in 'TITLE' and 'ESSAY' and their cleaned versions in 'TITLE_CLEAN' and 'ESSAY_CLEAN' from the first row of the DataFrame.

In [5]:
# ==================== DATA PREPROCESSING ====================
print("\n" + "="*60)
print("DATA PREPROCESSING")
print("="*60)

def preprocess_text(text):
    """
    Clean and normalize text for better model performance
    """
    if pd.isna(text) or text is None:
        return ""

    # Convert to string
    text = str(text)

    # 1. Remove extra whitespace (multiple spaces, tabs, newlines)
    text = re.sub(r'\s+', ' ', text)

    # 2. Remove leading/trailing whitespace
    text = text.strip()

    # 3. Normalize quotes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(''', "'").replace(''', "'")

    # 4. Remove special characters but keep Tagalog letters and basic punctuation
    # Keep: letters, numbers, spaces, periods, commas, question marks, exclamation points
    # Remove: excessive punctuation, symbols, emojis
    text = re.sub(r'[^\w\s.,!?;:\-áéíóúàèìòùäëïöüñÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÑ]', '', text)

    # 5. Remove multiple punctuation (e.g., "!!!" -> "!")
    text = re.sub(r'([.!?])\1+', r'\1', text)

    # 6. Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)

    # 7. Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # 8. Remove numbers-only sequences (e.g., "12345")
    text = re.sub(r'\b\d+\b', '', text)

    # 9. Final cleanup: remove extra spaces again
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply preprocessing
print("Preprocessing TITLE column...")
df['TITLE_CLEAN'] = df['TITLE'].apply(preprocess_text)

print("Preprocessing ESSAY column...")
df['ESSAY_CLEAN'] = df['ESSAY'].apply(preprocess_text)

# Show sample after preprocessing
print("\n" + "="*60)
print("SAMPLE DATA (After Preprocessing)")
print("="*60)
print(f"Title before: {df['TITLE'].iloc[0][:100]}")
print(f"Title after:  {df['TITLE_CLEAN'].iloc[0][:100]}")
print(f"\nEssay before: {df['ESSAY'].iloc[0][:100]}...")
print(f"Essay after:  {df['ESSAY_CLEAN'].iloc[0][:100]}...")



DATA PREPROCESSING
Preprocessing TITLE column...
Preprocessing ESSAY column...

SAMPLE DATA (After Preprocessing)
Title before: Edukasyon, Bulok na, Bakit Mahal
Maikling
Title after:  Edukasyon, Bulok na, Bakit Mahal Maikling

Essay before: Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...
Essay after:  Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...


# **DATA QUALITY CHECKS**

This section of code performs essential data quality checks on the text data after it has been preprocessed. It first examines the cleaned 'TITLE' and 'ESSAY' columns to identify if any entries became empty during the cleaning process, and then reports the counts of such empty titles and essays. Following this, the code calculates and presents descriptive statistics about the length of the text in the cleaned columns. It prints the average, minimum, and maximum number of characters for both the cleaned titles and essays. These checks are important to understand the state of the data after preprocessing and to identify any potential issues before proceeding with further analysis or model training.

In [6]:
# ==================== DATA QUALITY CHECKS ====================
print("\n" + "="*60)
print("DATA QUALITY CHECKS")
print("="*60)

# Check for empty strings after preprocessing
empty_titles = df[df['TITLE_CLEAN'].str.len() == 0]
empty_essays = df[df['ESSAY_CLEAN'].str.len() == 0]

print(f"Empty titles after preprocessing: {len(empty_titles)}")
print(f"Empty essays after preprocessing: {len(empty_essays)}")

# Check length statistics
print("\nLength Statistics:")
print(f"Title length (avg): {df['TITLE_CLEAN'].str.len().mean():.1f} characters")
print(f"Title length (min): {df['TITLE_CLEAN'].str.len().min()} characters")
print(f"Title length (max): {df['TITLE_CLEAN'].str.len().max()} characters")
print(f"\nEssay length (avg): {df['ESSAY_CLEAN'].str.len().mean():.1f} characters")
print(f"Essay length (min): {df['ESSAY_CLEAN'].str.len().min()} characters")
print(f"Essay length (max): {df['ESSAY_CLEAN'].str.len().max()} characters")



DATA QUALITY CHECKS
Empty titles after preprocessing: 1
Empty essays after preprocessing: 1

Length Statistics:
Title length (avg): 43.6 characters
Title length (min): 0 characters
Title length (max): 123 characters

Essay length (avg): 1441.9 characters
Essay length (min): 0 characters
Essay length (max): 18285 characters


# **DATA CLEANING**

 This part of the code is for removing unwanted rows based on the quality checks and other criteria.
*   **`# Remove rows with missing values in original columns`**: This comment explains that the following code will remove rows where the original 'TITLE', 'ESSAY', or 'LABEL' columns had missing values (NaN).
    *   `df_before = len(df)`: Stores the current number of rows in `df` before removing missing values.
    *   `df = df.dropna(subset=['TITLE', 'ESSAY', 'LABEL'])`: This line removes rows from the DataFrame `df` if there are any missing values (`NaN`) in the specified columns: 'TITLE', 'ESSAY', or 'LABEL'.
    *   `print(f"Removed {df_before - len(df)} rows with missing original data")`: Prints how many rows were removed in the previous step by subtracting the new length from the old length.
*   **`# Remove rows with empty strings after preprocessing`**: This comment explains that the following code will remove rows where the cleaned text in 'TITLE_CLEAN' or 'ESSAY_CLEAN' is empty.
    *   `df_before = len(df)`: Stores the current number of rows in `df` before removing empty preprocessed data.
    *   `df = df[df['TITLE_CLEAN'].str.len() > 0]`: This line filters the DataFrame to keep only the rows where the length of the text in 'TITLE_CLEAN' is greater than 0 (i.e., not empty).
    *   `df = df[df['ESSAY_CLEAN'].str.len() > 0]`: This line further filters the DataFrame to keep only the rows where the length of the text in 'ESSAY_CLEAN' is greater than 0.
    *   `print(f"Removed {df_before - len(df)} rows with empty preprocessed data")`: Prints how many rows were removed due to having empty preprocessed text.
*   **`# Remove duplicates based on preprocessed text`**: This comment explains that duplicate rows will be removed based on the cleaned text and the label.
    *   `df_before = len(df)`: Stores the current number of rows in `df` before removing duplicates.
    *   `df = df.drop_duplicates(subset=['TITLE_CLEAN', 'ESSAY_CLEAN', 'LABEL'])`: This line removes duplicate rows from the DataFrame, considering rows as duplicates if they have the same values in 'TITLE_CLEAN', 'ESSAY_CLEAN', and 'LABEL'.
    *   `print(f"Removed {df_before - len(df)} duplicate rows")`**: Prints how many duplicate rows were removed in the previous step by subtracting the new length from the old length.
*   **`# Remove very short titles (less than 3 characters)`**: This comment explains that rows with titles shorter than 3 characters after cleaning will be removed.
    *   `df_before = len(df)`: Stores the current number of rows in `df` before removing short titles.
    *   `df = df[df['TITLE_CLEAN'].str.len() >= 3]`: This line filters the DataFrame to keep only rows where the length of the text in 'TITLE_CLEAN' is 3 or more characters.
    *   `print(f"Removed {df_before - len(df)} rows with very short titles (<3 chars)")`**: Prints how many rows with very short titles were removed.
*   **`# Remove very short essays (less than 50 characters)`**: This comment explains that rows with essays shorter than 50 characters after cleaning will be removed.
    *   `df_before = len(df)`: Stores the current number of rows in `df` before removing short essays.
    *   `df = df[df['ESSAY_CLEAN'].str.len() >= 50]`: This line filters the DataFrame to keep only rows where the length of the text in 'ESSAY_CLEAN' is 50 or more characters.
    *   `print(f"Removed {df_before - len(df)} rows with very short essays (<50 chars)")`**: Prints how many rows with very short essays were removed.
*   **`print(f"\nFinal dataset size: {len(df)} samples")`**: Prints the total number of rows remaining in the DataFrame after all the cleaning steps.
*   **`# Check final label distribution`**: This comment indicates that the final distribution of labels in the cleaned dataset will be printed.
    *   `print(f"\nFinal label distribution:\n{df['LABEL'].value_counts()}")`**: Prints the count of each unique value in the 'LABEL' column of the cleaned DataFrame.
    *   `print(f"Label balance: {df['LABEL'].value_counts(normalize=True).round(3)}")`**: Prints the percentage distribution of each label, rounded to 3 decimal places, to show the balance of the dataset's classes.

In [7]:
# ==================== DATA CLEANING ====================
print("\n" + "="*60)
print("DATA CLEANING")
print("="*60)

# Remove rows with missing values in original columns
df_before = len(df)
df = df.dropna(subset=['TITLE', 'ESSAY', 'LABEL'])
print(f"Removed {df_before - len(df)} rows with missing original data")

# Remove rows with empty strings after preprocessing
df_before = len(df)
df = df[df['TITLE_CLEAN'].str.len() > 0]
df = df[df['ESSAY_CLEAN'].str.len() > 0]
print(f"Removed {df_before - len(df)} rows with empty preprocessed data")

# Remove duplicates based on preprocessed text
df_before = len(df)
df = df.drop_duplicates(subset=['TITLE_CLEAN', 'ESSAY_CLEAN', 'LABEL'])
print(f"Removed {df_before - len(df)} duplicate rows")

# Remove very short titles (less than 3 characters)
df_before = len(df)
df = df[df['TITLE_CLEAN'].str.len() >= 3]
print(f"Removed {df_before - len(df)} rows with very short titles (<3 chars)")

# Remove very short essays (less than 50 characters)
df_before = len(df)
df = df[df['ESSAY_CLEAN'].str.len() >= 50]
print(f"Removed {df_before - len(df)} rows with very short essays (<50 chars)")

print(f"\nFinal dataset size: {len(df)} samples")

# Check final label distribution
print(f"\nFinal label distribution:\n{df['LABEL'].value_counts()}")
print(f"Label balance: {df['LABEL'].value_counts(normalize=True).round(3)}")


DATA CLEANING
Removed 1 rows with missing original data
Removed 0 rows with empty preprocessed data
Removed 11 duplicate rows
Removed 0 rows with very short titles (<3 chars)
Removed 0 rows with very short essays (<50 chars)

Final dataset size: 874 samples

Final label distribution:
LABEL
1    468
0    406
Name: count, dtype: int64
Label balance: LABEL
1    0.535
0    0.465
Name: proportion, dtype: float64


# **SPLIT DATA**

This part of the code prepares the data for training and testing the model.
*   **`print("\n" + "="*60)`**: Prints a separator line.
*   **`print("SPLITTING DATA")`**: Prints the header for this section.
*   **`print("="*60)`**: Prints another separator line.
*   **`train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['LABEL'])`**: This is the core line for splitting the data.
    *   `train_test_split(df, ...)`: Uses the `train_test_split` function from scikit-learn to divide the DataFrame `df`.
    *   `test_size=0.2`: Specifies that 20% of the data should be allocated to the testing set.
    *   `random_state=42`: Sets a seed for the random number generator. This ensures that the split is the same every time the code is run, making results reproducible.
    *   `stratify=df['LABEL']`: This is important for maintaining the same proportion of labels in both the training and testing sets as in the original dataset. This is especially useful for imbalanced datasets.
    *   `train_df, test_df = ...`: The function returns two DataFrames, one for the training data (`train_df`) and one for the testing data (`test_df`).
*   **`print(f"Training samples: {len(train_df)}")`**: Prints the number of samples (rows) in the training set.
*   **`print(f"Testing samples: {len(test_df)}")`**: Prints the number of samples (rows) in the testing set.
*   **`print(f"\nTrain label distribution:\n{train_df['LABEL'].value_counts()}")`**: Prints the distribution of labels within the training set.
*   **`print(f"Test label distribution:\n{test_df['LABEL'].value_counts()}")`**: Prints the distribution of labels within the testing set.
*   **`# Prepare training examples using CLEAN data`**: This comment indicates that the following code is preparing the data for training the SentenceTransformer model.
*   **`train_examples = [`**: This starts a list comprehension to create the training examples in a format required by the `sentence-transformers` library.
    *   `InputExample(...)`: For each row in the `train_df`, an `InputExample` object is created.
    *   `texts=[str(row['TITLE_CLEAN']), str(row['ESSAY_CLEAN'])]`: The `texts` attribute of the `InputExample` is a list containing the cleaned title and cleaned essay from the current row. They are converted to strings just to be safe.
    *   `label=float(row['LABEL'])`: The `label` attribute is the label from the current row, converted to a float.
    *   `for index, row in train_df.iterrows()`: This iterates through each row of the `train_df`.
*   **`# Get true labels for evaluation`**: This comment indicates that the true labels for the testing set are being prepared.
*   **`true_labels = test_df['LABEL'].tolist()`**: This line extracts the 'LABEL' column from the `test_df` and converts it into a Python list. This list will be used later to compare against the model's predictions to evaluate its performance.

In [8]:
# ==================== SPLIT DATA ====================
print("\n" + "="*60)
print("SPLITTING DATA")
print("="*60)

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['LABEL'])
print(f"Training samples: {len(train_df)}")
print(f"Testing samples: {len(test_df)}")
print(f"\nTrain label distribution:\n{train_df['LABEL'].value_counts()}")
print(f"Test label distribution:\n{test_df['LABEL'].value_counts()}")

# Prepare training examples using CLEAN data
train_examples = [
    InputExample(texts=[str(row['TITLE_CLEAN']), str(row['ESSAY_CLEAN'])], label=float(row['LABEL']))
    for index, row in train_df.iterrows()
]

# Get true labels for evaluation
true_labels = test_df['LABEL'].tolist()


SPLITTING DATA
Training samples: 699
Testing samples: 175

Train label distribution:
LABEL
1    374
0    325
Name: count, dtype: int64
Test label distribution:
LABEL
1    94
0    81
Name: count, dtype: int64


# **HYPERPARAMETERS**

This code block sets up the hyperparameters and configuration for your model training and evaluation.

1.  **Hyperparameters**: It defines key values that control the training process:
    *   `batch_size`: The number of training examples processed together in one go during training.
    *   `num_epochs`: The number of times the entire training dataset will be passed through the model.
    *   `model_name`: Specifies the pre-trained Sentence Transformer model to be used (`all-MiniLM-L6-v2`).
    *   `threshold`: A value used later to determine if the similarity between a title and essay embedding indicates a "match" (similarity >= threshold) or "not match" (similarity < threshold).
2.  **Configuration Output**: It then prints these configuration values to the console, providing a clear summary of the settings being used for the current experiment, including a note that the cleaned data columns ('TITLE_CLEAN', 'ESSAY_CLEAN') will be used.

In [9]:
# ==================== HYPERPARAMETERS ====================
batch_size = 16
num_epochs = 3
model_name = 'all-MiniLM-L6-v2'
threshold = 0.4

print("\n" + "="*60)
print("CONFIGURATION")
print("="*60)
print(f"Model: {model_name}")
print(f"Batch Size: {batch_size}")
print(f"Epochs: {num_epochs}")
print(f"Threshold: {threshold}")
print(f"Using preprocessed data: TITLE_CLEAN, ESSAY_CLEAN")
print("="*60)


CONFIGURATION
Model: all-MiniLM-L6-v2
Batch Size: 16
Epochs: 3
Threshold: 0.4
Using preprocessed data: TITLE_CLEAN, ESSAY_CLEAN


# **BASE EVALUATION**

This code block evaluates the performance of a pre-trained Sentence Transformer model *before* any fine-tuning is applied. This is called the "baseline" evaluation.

1.  **Load Baseline Model**: It loads the specified pre-trained Sentence Transformer model (`all-MiniLM-L6-v2` in this case) using `SentenceTransformer(model_name)`.
2.  **Generate Embeddings**: It then uses this baseline model to generate numerical representations (embeddings) for the cleaned titles and cleaned essays in your *test* dataset. Embeddings are dense vectors that capture the semantic meaning of the text.
3.  **Calculate Similarities**: It calculates the cosine similarity between the embedding of each title and its corresponding essay in the test set. Cosine similarity measures how similar two vectors are in direction, indicating the semantic similarity between the title and essay.
4.  **Make Predictions**: Based on a predefined `threshold`, it makes predictions. If the cosine similarity between a title and essay pair is greater than or equal to the `threshold`, it predicts a "match" (represented by 1); otherwise, it predicts "not match" (represented by 0).
5.  **Calculate Baseline Metrics**: It compares these predictions to the actual true labels from the test set (`true_labels`) to calculate several evaluation metrics: accuracy, precision, recall, and F1-score. These metrics quantify how well the baseline model performs.
6.  **Print Results**: Finally, it prints a summary of the baseline evaluation results, including the calculated metrics and the confusion matrix, which shows the counts of true positive, true negative, false positive, and false negative predictions. This provides a benchmark to compare against the fine-tuned model's performance later.

In [10]:
# ==================== BASELINE EVALUATION ====================
print("\n" + "="*60)
print("BASELINE EVALUATION (Pre-trained Model)")
print("="*60)

# Load pre-trained model (not fine-tuned)
baseline_model = SentenceTransformer(model_name)
print(f"✓ Loaded baseline model: {model_name}")

# Generate embeddings using baseline model (with CLEAN data)
print("\nGenerating baseline embeddings...")
baseline_title_emb = baseline_model.encode(test_df['TITLE_CLEAN'].tolist(), convert_to_tensor=False, show_progress_bar=True)
baseline_essay_emb = baseline_model.encode(test_df['ESSAY_CLEAN'].tolist(), convert_to_tensor=False, show_progress_bar=True)

# Calculate similarities
print("Calculating baseline similarities...")
baseline_similarities = np.diag(cosine_similarity(baseline_title_emb, baseline_essay_emb))
baseline_predictions = [1 if sim >= threshold else 0 for sim in baseline_similarities]

# Calculate baseline metrics
baseline_accuracy = accuracy_score(true_labels, baseline_predictions)
baseline_precision = precision_score(true_labels, baseline_predictions, zero_division=0)
baseline_recall = recall_score(true_labels, baseline_predictions, zero_division=0)
baseline_f1 = f1_score(true_labels, baseline_predictions, zero_division=0)

print("\n" + "="*60)
print("BASELINE RESULTS")
print("="*60)
print(f"Accuracy:  {baseline_accuracy:.4f} ({baseline_accuracy*100:.2f}%)")
print(f"Precision: {baseline_precision:.4f} ({baseline_precision*100:.2f}%)")
print(f"Recall:    {baseline_recall:.4f} ({baseline_recall*100:.2f}%)")
print(f"F1-Score:  {baseline_f1:.4f} ({baseline_f1*100:.2f}%)")
print("="*60)

print("\nBaseline Confusion Matrix:")
print(confusion_matrix(true_labels, baseline_predictions))



BASELINE EVALUATION (Pre-trained Model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Loaded baseline model: all-MiniLM-L6-v2

Generating baseline embeddings...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating baseline similarities...

BASELINE RESULTS
Accuracy:  0.5029 (50.29%)
Precision: 0.5215 (52.15%)
Recall:    0.9043 (90.43%)
F1-Score:  0.6615 (66.15%)

Baseline Confusion Matrix:
[[ 3 78]
 [ 9 85]]


# **FINE-TUNING**

This code block performs the fine-tuning of the Sentence Transformer model.

1.  **Create DataLoader**: It creates a `DataLoader` from the `train_examples` list. This helps to efficiently load the training data in batches during the fine-tuning process. `shuffle=True` ensures that the training data is randomly ordered for each epoch.
2.  **Load Model for Fine-tuning**: It loads the pre-trained Sentence Transformer model specified by `model_name` again, but this time, this model object (`model`) will be fine-tuned.
3.  **Define Loss Function**: It defines the `train_loss` using `losses.CosineSimilarityLoss`. This loss function is suitable for training models to learn embeddings where the cosine similarity between pairs should reflect their relationship (in this case, whether titles and essays match).
4.  **Calculate Warmup Steps**: It calculates the number of `warmup_steps`. Warmup is a technique often used in training models where the learning rate gradually increases from a small value at the beginning of training.
5.  **Print Training Information**: It prints the calculated `Warmup Steps` and the `Total Training Steps`.
6.  **Start Fine-tuning**: It starts the fine-tuning process using the `model.fit()` method:
    *   `train_objectives`: Specifies the training data (`train_dataloader`) and the loss function (`train_loss`) to use.
    *   `epochs`: Sets the number of times the model will iterate over the entire training dataset.
    *   `warmup_steps`: Applies the calculated warmup steps.
    *   `output_path`: Specifies the directory where the fine-tuned model will be saved.
    *   `show_progress_bar`: Shows a progress bar during training.
    *   `use_amp=False`: Disables automatic mixed precision training (using lower precision for potentially faster training, but disabled here).
7.  **Measure Training Time**: It records the start and end time of the training process to calculate the total `training_time`.
8.  **Print Completion Message**: Finally, it prints a confirmation message that training is complete and shows the total training time in seconds and minutes.

In [11]:

# ==================== FINE-TUNING ====================
print("\n" + "="*60)
print("FINE-TUNING MODEL")
print("="*60)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

# Load model for fine-tuning
model = SentenceTransformer(model_name)
train_loss = losses.CosineSimilarityLoss(model)
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

print(f"Warmup Steps: {warmup_steps}")
print(f"Total Training Steps: {len(train_dataloader) * num_epochs}")

print("\nStarting fine-tuning...")
start_time = time.time()

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./fine_tuned_minilm',
    show_progress_bar=True,
    use_amp=False
)

training_time = time.time() - start_time
print(f"\n✓ Training completed in {training_time:.2f} seconds ({training_time/60:.2f} minutes)")



FINE-TUNING MODEL


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Warmup Steps: 13
Total Training Steps: 132

Starting fine-tuning...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss



✓ Training completed in 21.88 seconds (0.36 minutes)


# **FINE-TUNING EVALUATION**

This code block evaluates the performance of the Sentence Transformer model *after* it has been fine-tuned on your dataset.

1.  **Generate Fine-tuned Embeddings**: It uses the `model` (which has now been fine-tuned) to generate new embeddings for the cleaned titles and cleaned essays in your *test* dataset.
2.  **Calculate Fine-tuned Similarities**: It calculates the cosine similarity between the title and essay embeddings for each pair in the test set, similar to the baseline evaluation.
3.  **Make Fine-tuned Predictions**: Using the same `threshold` as the baseline evaluation, it makes predictions (match or not match) based on these fine-tuned similarities.
4.  **Calculate Fine-tuned Metrics**: It compares these fine-tuned predictions to the `true_labels` from the test set and calculates the accuracy, precision, recall, and F1-score for the fine-tuned model.
5.  **Print Fine-tuned Results**: Finally, it prints a summary of the fine-tuned model's performance, including the metrics, the training time from the previous step, the confusion matrix, and a classification report. These results can then be compared to the baseline results to see how much the fine-tuning improved the model's performance on your specific task.

In [12]:
# ==================== FINE-TUNED EVALUATION ====================
print("\n" + "="*60)
print("FINE-TUNED MODEL EVALUATION")
print("="*60)

print("Generating fine-tuned embeddings...")
finetuned_title_emb = model.encode(test_df['TITLE_CLEAN'].tolist(), convert_to_tensor=False, show_progress_bar=True)
finetuned_essay_emb = model.encode(test_df['ESSAY_CLEAN'].tolist(), convert_to_tensor=False, show_progress_bar=True)

print("Calculating fine-tuned similarities...")
finetuned_similarities = np.diag(cosine_similarity(finetuned_title_emb, finetuned_essay_emb))
finetuned_predictions = [1 if sim >= threshold else 0 for sim in finetuned_similarities]

# Calculate fine-tuned metrics
finetuned_accuracy = accuracy_score(true_labels, finetuned_predictions)
finetuned_precision = precision_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_recall = recall_score(true_labels, finetuned_predictions, zero_division=0)
finetuned_f1 = f1_score(true_labels, finetuned_predictions, zero_division=0)

print("\n" + "="*60)
print("FINE-TUNED RESULTS")
print("="*60)
print(f"Accuracy:  {finetuned_accuracy:.4f} ({finetuned_accuracy*100:.2f}%)")
print(f"Precision: {finetuned_precision:.4f} ({finetuned_precision*100:.2f}%)")
print(f"Recall:    {finetuned_recall:.4f} ({finetuned_recall*100:.2f}%)")
print(f"F1-Score:  {finetuned_f1:.4f} ({finetuned_f1*100:.2f}%)")
print(f"Training Time: {training_time:.2f}s")
print("="*60)

print("\nFine-tuned Confusion Matrix:")
print(confusion_matrix(true_labels, finetuned_predictions))

print("\nClassification Report:")
print(classification_report(true_labels, finetuned_predictions, target_names=['Not Match', 'Match']))




FINE-TUNED MODEL EVALUATION
Generating fine-tuned embeddings...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating fine-tuned similarities...

FINE-TUNED RESULTS
Accuracy:  0.7486 (74.86%)
Precision: 0.7119 (71.19%)
Recall:    0.8936 (89.36%)
F1-Score:  0.7925 (79.25%)
Training Time: 21.88s

Fine-tuned Confusion Matrix:
[[47 34]
 [10 84]]

Classification Report:
              precision    recall  f1-score   support

   Not Match       0.82      0.58      0.68        81
       Match       0.71      0.89      0.79        94

    accuracy                           0.75       175
   macro avg       0.77      0.74      0.74       175
weighted avg       0.76      0.75      0.74       175



# **BASELINE VS FINE-TUNED COMPARISION**

This section of the notebook provides a comparison between the performance of the initial pre-trained model (the baseline) and the model after it has been fine-tuned on your specific dataset. The first part of the code calculates the difference and percentage improvement for key evaluation metrics such as Accuracy, Precision, Recall, and F1-Score, showing how the fine-tuned model's scores compare to the baseline scores. This comparison is then presented in a formatted table, making it easy to visualize the impact of fine-tuning on each metric. Following this metric comparison, the code provides a sample-by-sample comparison of predictions. It takes a small number of examples from the test set and shows the true label along with the prediction and similarity score from both the baseline and fine-tuned models for each example. This qualitative comparison helps in understanding how the fine-tuning affected the model's behavior on individual data points and whether it led to more accurate predictions for specific titles and essays.

In [13]:
# ==================== BASELINE vs FINE-TUNED COMPARISON ====================
print("\n" + "="*60)
print("BASELINE vs FINE-TUNED COMPARISON")
print("="*60)
print(f"{'Metric':<12} {'Baseline':<12} {'Fine-tuned':<12} {'Improvement':<15}")
print("-"*60)

acc_improvement = finetuned_accuracy - baseline_accuracy
acc_improvement_pct = (acc_improvement / baseline_accuracy * 100) if baseline_accuracy > 0 else 0
print(f"{'Accuracy':<12} {baseline_accuracy:.4f}       {finetuned_accuracy:.4f}       {acc_improvement:+.4f} ({acc_improvement_pct:+.1f}%)")

prec_improvement = finetuned_precision - baseline_precision
prec_improvement_pct = (prec_improvement / baseline_precision * 100) if baseline_precision > 0 else 0
print(f"{'Precision':<12} {baseline_precision:.4f}       {finetuned_precision:.4f}       {prec_improvement:+.4f} ({prec_improvement_pct:+.1f}%)")

rec_improvement = finetuned_recall - baseline_recall
rec_improvement_pct = (rec_improvement / baseline_recall * 100) if baseline_recall > 0 else 0
print(f"{'Recall':<12} {baseline_recall:.4f}       {finetuned_recall:.4f}       {rec_improvement:+.4f} ({rec_improvement_pct:+.1f}%)")

f1_improvement = finetuned_f1 - baseline_f1
f1_improvement_pct = (f1_improvement / baseline_f1 * 100) if baseline_f1 > 0 else 0
print(f"{'F1-Score':<12} {baseline_f1:.4f}       {finetuned_f1:.4f}       {f1_improvement:+.4f} ({f1_improvement_pct:+.1f}%)")

print("="*60)




BASELINE vs FINE-TUNED COMPARISON
Metric       Baseline     Fine-tuned   Improvement    
------------------------------------------------------------
Accuracy     0.5029       0.7486       +0.2457 (+48.9%)
Precision    0.5215       0.7119       +0.1904 (+36.5%)
Recall       0.9043       0.8936       -0.0106 (-1.2%)
F1-Score     0.6615       0.7925       +0.1310 (+19.8%)


This code block provides a sample-by-sample comparison of predictions made by the baseline (pre-trained) and fine-tuned models.

1.  **Loop through Samples**: It iterates through a small number of examples (up to the first 5) in your test dataset.
2.  **Get Information**: For each sample, it retrieves the cleaned title, the true label, and the predictions and similarity scores from both the baseline and fine-tuned models.
3.  **Print Comparison**: It then prints the sample number, a snippet of the title, the true label, and the baseline and fine-tuned results, including their predictions, similarity scores, and an indicator (✓ or ✗) of whether the prediction was correct. This allows for a direct, visual comparison of how each model performed on specific examples.

In [14]:
# ==================== SAMPLE PREDICTIONS COMPARISON ====================
print("\n" + "="*60)
print("SAMPLE PREDICTIONS: BASELINE vs FINE-TUNED")
print("="*60)

for i in range(min(5, len(test_df))):
    title = test_df.iloc[i]['TITLE_CLEAN'][:50]
    true_label = true_labels[i]

    baseline_pred = baseline_predictions[i]
    baseline_sim = baseline_similarities[i]

    finetuned_pred = finetuned_predictions[i]
    finetuned_sim = finetuned_similarities[i]

    print(f"\nSample {i+1}: {title}...")
    print(f"True Label: {true_label}")
    print(f"Baseline   - Pred: {baseline_pred} | Sim: {baseline_sim:.4f} | {'✓' if baseline_pred == true_label else '✗'}")
    print(f"Fine-tuned - Pred: {finetuned_pred} | Sim: {finetuned_sim:.4f} | {'✓' if finetuned_pred == true_label else '✗'}")




SAMPLE PREDICTIONS: BASELINE vs FINE-TUNED

Sample 1: Epekto ng Volcanic Eruptions sa Mga Komunidad sa A...
True Label: 1
Baseline   - Pred: 1 | Sim: 0.6712 | ✓
Fine-tuned - Pred: 1 | Sim: 0.7308 | ✓

Sample 2: Epekto ng Special Program in Sports sa Akademikong...
True Label: 0
Baseline   - Pred: 1 | Sim: 0.8025 | ✗
Fine-tuned - Pred: 1 | Sim: 0.8205 | ✗

Sample 3: Kahalagahan ng Wikang Filipino sa Makabagong Panah...
True Label: 1
Baseline   - Pred: 1 | Sim: 0.7412 | ✓
Fine-tuned - Pred: 1 | Sim: 0.5874 | ✓

Sample 4: Epekto ng Kakulangan sa Tulog sa Kabataan...
True Label: 0
Baseline   - Pred: 1 | Sim: 0.5929 | ✗
Fine-tuned - Pred: 0 | Sim: 0.2343 | ✓

Sample 5: Pagkasira ng Kalikasan: Isang Historikal na Pagtin...
True Label: 1
Baseline   - Pred: 1 | Sim: 0.4926 | ✓
Fine-tuned - Pred: 0 | Sim: 0.2574 | ✗


This code block is responsible for saving the fine-tuned model, experiment results, and the preprocessed data, and then providing a summary of the completed experiment.

1.  **Save Model**: It saves the fine-tuned model to a specified directory in your Google Drive.
2.  **Save Results**: It creates a dictionary containing all the key details of the experiment, including timestamps, hyperparameters, dataset statistics, baseline metrics, fine-tuned metrics, and the calculated improvements. This dictionary is then saved as a JSON object to a file in your Google Drive.
3.  **Save Preprocessed Data**: It saves the preprocessed DataFrame (`df`) to a new CSV file in your Google Drive, allowing you to easily access the cleaned data later.
4.  **Summary**: Finally, it prints a summary message indicating that the experiment is complete and lists where the model, results, and preprocessed data have been saved. It also provides instructions on how to load and use the fine-tuned model and how to preprocess new input data for prediction.

In [16]:
# ==================== SAVE MODEL ====================
drive_model_path = '/content/drive/MyDrive/fine_tuned_minilm'
print(f"\nSaving fine-tuned model to Google Drive...")
shutil.copytree('./fine_tuned_minilm', drive_model_path, dirs_exist_ok=True)
print(f"✓ Model saved at: {drive_model_path}")

# ==================== SAVE RESULTS ====================
import json
from datetime import datetime

results = {
    'timestamp': datetime.now().isoformat(),
    'model_name': model_name,
    'batch_size': batch_size,
    'num_epochs': num_epochs,
    'threshold': threshold,
    'preprocessing': 'enabled',
    'training_time_seconds': training_time,
    'dataset_stats': {
        'total_samples': len(df),
        'train_samples': len(train_df),
        'test_samples': len(test_df),
        'avg_title_length': float(df['TITLE_CLEAN'].str.len().mean()),
        'avg_essay_length': float(df['ESSAY_CLEAN'].str.len().mean())
    },
    'baseline': {
        'accuracy': float(baseline_accuracy),
        'precision': float(baseline_precision),
        'recall': float(baseline_recall),
        'f1': float(baseline_f1)
    },
    'finetuned': {
        'accuracy': float(finetuned_accuracy),
        'precision': float(finetuned_precision),
        'recall': float(finetuned_recall),
        'f1': float(finetuned_f1)
    },
    'improvements': {
        'accuracy': float(acc_improvement),
        'precision': float(prec_improvement),
        'recall': float(rec_improvement),
        'f1': float(f1_improvement)
    }
}

results_path = '/content/drive/MyDrive/experiment_results.json'
with open(results_path, 'a') as f:
    f.write(json.dumps(results, indent=2) + '\n')

print(f"✓ Results saved to: {results_path}")

# ==================== SAVE PREPROCESSED DATA ====================
preprocessed_data_path = '/content/drive/MyDrive/TAGALOG-ESSAYS-PREPROCESSED.csv'
df.to_csv(preprocessed_data_path, index=False)
print(f"✓ Preprocessed data saved to: {preprocessed_data_path}")

# ==================== SUMMARY ====================
print("\n" + "="*60)
print("EXPERIMENT COMPLETE!")
print("="*60)
print(f"✓ Data preprocessed and cleaned")
print(f"✓ Baseline evaluated")
print(f"✓ Model fine-tuned")
print(f"✓ Performance comparison completed")
print(f"✓ Model saved at: {drive_model_path}")
print(f"✓ Results saved at: {results_path}")
print(f"✓ Preprocessed data saved at: {preprocessed_data_path}")
print("\nTo use fine-tuned model:")
print(f"model = SentenceTransformer('{drive_model_path}')")
print("\nTo use for prediction, remember to preprocess your input:")
print("title_clean = preprocess_text(title)")
print("essay_clean = preprocess_text(essay)")
print("="*60)


Saving fine-tuned model to Google Drive...
✓ Model saved at: /content/drive/MyDrive/fine_tuned_minilm
✓ Results saved to: /content/drive/MyDrive/experiment_results.json
✓ Preprocessed data saved to: /content/drive/MyDrive/TAGALOG-ESSAYS-PREPROCESSED.csv

EXPERIMENT COMPLETE!
✓ Data preprocessed and cleaned
✓ Baseline evaluated
✓ Model fine-tuned
✓ Performance comparison completed
✓ Model saved at: /content/drive/MyDrive/fine_tuned_minilm
✓ Results saved at: /content/drive/MyDrive/experiment_results.json
✓ Preprocessed data saved at: /content/drive/MyDrive/TAGALOG-ESSAYS-PREPROCESSED.csv

To use fine-tuned model:
model = SentenceTransformer('/content/drive/MyDrive/fine_tuned_minilm')

To use for prediction, remember to preprocess your input:
title_clean = preprocess_text(title)
essay_clean = preprocess_text(essay)
