# Model Evaluation Demonstration

This notebook demonstrates the evaluation of the models we trained on the validation dataset. MOdels are loaded using weights saved after training of a specific model. 

### Models Demonstrated:
1. Logistic Regression (LogReg)
2. SVM
3. Multi-Layer Perceptron (MLP)
4. 
5.
6. 
7. Ensemble


## Data Preprocessing

In this section, we:
1. Load the validation dataset.
2. Apply the `TfidfVectorizer` for feature extraction (Saved after training).
3. Standardize the extracted features using a `StandardScaler` (Saved after training).

These steps ensure that the validation dataset is processed consistently with the training data.


In [2]:
import joblib
import sys
import os
import numpy as np
from datasets import Dataset

# Add src folder to the Python path
sys.path.append('../src')

# Import custom functions
from data_processing.preprocessing import get_dataset
from data_processing.feature_extraction import apply_features

# Paths to the TF-IDF vectorizer and scaler
tfidf_path = '../src/models/weights/tfidf_vectorizer.joblib'
scaler_path = '../src/models/weights/standard_scaler.joblib'

# Load the validation dataset
validation_dataset = get_dataset('val')

# Load the TF-IDF vectorizer
if os.path.exists(tfidf_path):
    print("TF-IDF Vectorizer file exists.")
    tfidf = joblib.load(tfidf_path)
else:
    print("TF-IDF Vectorizer file does not exist.")
    sys.exit(1)

# Apply TF-IDF and other features to the validation dataset
validation_dataset = validation_dataset.map(lambda batch: apply_features(batch, tf_idf=tfidf), batched=True)

# Convert features to a matrix format and extract labels
validation_features = np.vstack(validation_dataset['features'])
validation_labels = validation_dataset['label']  # Replace 'label' with the actual label column name

# Load and apply the Standard Scaler
scaler = joblib.load(scaler_path)
validation_features = scaler.transform(validation_features)


TF-IDF Vectorizer file exists.


Map:   0%|          | 0/14378 [00:00<?, ? examples/s]

## Logistic Regression (LogReg) Evaluation

In this section, we:
1. Load the trained Logistic Regression model.
2. Use the model to predict on the validation dataset.
3. Evaluate the model's performance using:
    - Accuracy
    - Macro F1 Score
    - Classification Report


In [4]:
# Logistic Regression Evaluation

from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression

# Path to the Logistic Regression model
model_path = '../src/models/weights/logreg/logreg_model_weights.joblib'

# Load the Logistic Regression model
logreg = joblib.load(model_path)

# Predict on the validation data
predictions = logreg.predict(validation_features)

# Calculate and print accuracy
accuracy = accuracy_score(validation_labels, predictions)
print("LogReg - Accuracy:", accuracy)

# Calculate and print macro F1 score
macro_f1 = f1_score(validation_labels, predictions, average='macro')
print("LogReg - Macro F1 Score:", macro_f1)

# Print detailed classification report
print("LogReg - Classification Report:\n", classification_report(validation_labels, predictions))

# Record probabilities for ensemble 
logreg_probs = logreg.predict_proba(validation_features)  


LogReg - Accuracy: 0.7708304353873974
LogReg - Macro F1 Score: 0.689905917584563
LogReg - Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.90      0.85     10257
           1       0.64      0.45      0.53      4121

    accuracy                           0.77     14378
   macro avg       0.72      0.68      0.69     14378
weighted avg       0.76      0.77      0.76     14378



## Support Vector Machine (SVM) Evaluation

In this section, we:
1. Load the trained SVM model.
2. Use the model to predict on the validation dataset.
3. Evaluate the model's performance using:
    - Accuracy
    - Macro F1 Score
    - Classification Report


In [5]:
# Unzip SVM weights (Unzipped file is too large for git repository)
import gzip
import shutil

# Paths to the compressed and decompressed files
compressed_file_path = '../src/models/weights/svm/svm_model_weights.joblib.gz'
decompressed_file_path = '../src/models/weights/svm/svm_model_weights.joblib'

# Open and decompress the .gz file
with gzip.open(compressed_file_path, 'rb') as f_in:
    with open(decompressed_file_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

print("Decompression complete! The file has been saved to:", decompressed_file_path)


Decompression complete! The file has been saved to: ../src/models/weights/svm/svm_model_weights.joblib


In [6]:
# Support Vector Machine (SVM) Evaluation

from sklearn.metrics import accuracy_score, f1_score, classification_report

# Path to the SVM model
svm_model_path = '../src/models/weights/svm/svm_model_weights.joblib'

# Load the SVM model
svm_model = joblib.load(svm_model_path)

# Predict on the validation data
svm_predictions = svm_model.predict(validation_features)

# Calculate and print accuracy
svm_accuracy = accuracy_score(validation_labels, svm_predictions)
print("SVM - Accuracy:", svm_accuracy)

# Calculate and print macro F1 score
svm_macro_f1 = f1_score(validation_labels, svm_predictions, average='macro')
print("SVM - Macro F1 Score:", svm_macro_f1)

# Print detailed classification report
print("SVM - Classification Report:\n", classification_report(validation_labels, svm_predictions))


SVM - Accuracy: 0.7143552649881764
SVM - Macro F1 Score: 0.6304512572863838
SVM - Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.83      0.81     10257
           1       0.50      0.41      0.45      4121

    accuracy                           0.71     14378
   macro avg       0.64      0.62      0.63     14378
weighted avg       0.70      0.71      0.71     14378



## Multi-Layer Perceptron (MLP) Evaluation

In this section, we:
1. Load the trained MLP model.
2. Use the model to predict on the validation dataset.
3. Evaluate the model's performance using:
    - Accuracy
    - Macro F1 Score
    - Classification Report


In [7]:
# Multi-Layer Perceptron (MLP) Evaluation

from sklearn.metrics import accuracy_score, f1_score, classification_report

# Path to the MLP model
mlp_model_path = '../src/models/weights/mlp/mlp_model_weights.joblib'

# Load the MLP model
mlp_model = joblib.load(mlp_model_path)

# Predict on the validation data
mlp_predictions = mlp_model.predict(validation_features)

# Calculate and print accuracy
mlp_accuracy = accuracy_score(validation_labels, mlp_predictions)
print("MLP - Accuracy:", mlp_accuracy)

# Calculate and print macro F1 score
mlp_macro_f1 = f1_score(validation_labels, mlp_predictions, average='macro')
print("MLP - Macro F1 Score:", mlp_macro_f1)

# Print detailed classification report
print("MLP - Classification Report:\n", classification_report(validation_labels, mlp_predictions))


MLP - Accuracy: 0.7292391153150647
MLP - Macro F1 Score: 0.6545075126574034
MLP - Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.84      0.82     10257
           1       0.53      0.46      0.49      4121

    accuracy                           0.73     14378
   macro avg       0.66      0.65      0.65     14378
weighted avg       0.72      0.73      0.72     14378



In [8]:
import torch
import torch.nn as nn
import sys 
sys.path.append('../src')

from models.transformer_based_models import load_and_prepare_model, train_and_evaluate
from data_processing.feature_extraction import prepare_single_dataset

tokenized_dataset = prepare_single_dataset(validation_dataset, model_type='electra')

Map:   0%|          | 0/14378 [00:00<?, ? examples/s]

In [9]:
import sys
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
from transformers import ElectraTokenizer

# Add custom module path
sys.path.append('../src')

# Import custom modules
from models.transformer_based_models import load_and_prepare_model
from data_processing.feature_extraction import prepare_single_dataset
import torch.nn.functional as F
# -------------------------------
# 1. Prepare Dataset
# -------------------------------
# Tokenize the validation dataset
tokenized_dataset = prepare_single_dataset(validation_dataset, model_type='electra')

# Set format for PyTorch compatibility
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoader
batch_size = 32  # Define batch size
data_loader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=False)

# -------------------------------
# 2. Load the Electra Model
# -------------------------------
# Define the number of labels (for classification tasks)
num_labels = 2

# Load the pre-trained Electra model
electra_model = load_and_prepare_model('electra', num_labels=num_labels)

# Load the saved model weights
model_weights_path = '../src/models/weights/electra_model.pth'
electra_model.load_state_dict(torch.load(model_weights_path))

# Set model to evaluation mode and move to appropriate device
electra_model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
electra_model.to(device)

# -------------------------------
# 3. Perform Inference
# -------------------------------
all_predictions = []
all_labels = []
electra_probs = []
# Disable gradient computation during inference
with torch.no_grad():
    for batch in data_loader:
        # Move batch to the appropriate device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)  # True labels

        # Forward pass through the model
        outputs = electra_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = F.softmax(logits, dim=-1)  # Convert logits to probabilities
        electra_probs.append(probabilities.cpu().numpy())

        # Get predicted labels (highest logit value)
        predictions = torch.argmax(logits, dim=-1)

        # Collect predictions and labels
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# -------------------------------
# 4. Evaluate Model Performance
# -------------------------------
# Calculate Accuracy
accuracy = accuracy_score(all_labels, all_predictions)
print(f"Accuracy: {accuracy:.4f}")

# Calculate F1-Score
f1 = f1_score(all_labels, all_predictions, average='macro')  # Choose 'macro', 'micro', or 'binary' as appropriate
print(f"F1-Score: {f1:.4f}")

# Record probabilities for ensemble
electra_probs_combined = np.concatenate(electra_probs, axis=0)  # Concatenate along the batch axis



Map:   0%|          | 0/14378 [00:00<?, ? examples/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.7818
F1-Score: 0.7185


In [10]:
import sys
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
from transformers import BertTokenizer

# Add custom module path
sys.path.append('../src')

# Import custom modules
from models.transformer_based_models import load_and_prepare_model
from data_processing.feature_extraction import prepare_single_dataset

# -------------------------------
# 1. Prepare Dataset
# -------------------------------
# Tokenize the validation dataset
tokenized_dataset = prepare_single_dataset(validation_dataset, model_type='bert')

# Set format for PyTorch compatibility
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoader
batch_size = 32  # Define batch size
data_loader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=False)

# -------------------------------
# 2. Load the BERT Model
# -------------------------------
# Define the number of labels (for classification tasks)
num_labels = 2

# Load the pre-trained BERT model
bert_model = load_and_prepare_model('bert', num_labels=num_labels)

# Load the saved model weights
model_weights_path = '../src/models/weights/bert_model.pth'
bert_model.load_state_dict(torch.load(model_weights_path))

# Set model to evaluation mode and move to appropriate device
bert_model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model.to(device)

# -------------------------------
# 3. Perform Inference
# -------------------------------
all_predictions = []
all_labels = []
bert_probs = []

# Disable gradient computation during inference
with torch.no_grad():
    for batch in data_loader:
        # Move batch to the appropriate device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)  # True labels

        # Forward pass through the model
        outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = F.softmax(logits, dim=-1)  # Convert logits to probabilities
        bert_probs.append(probabilities.cpu().numpy())


        # Get predicted labels (highest logit value)
        predictions = torch.argmax(logits, dim=-1)

        # Collect predictions and labels
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# -------------------------------
# 4. Evaluate Model Performance
# -------------------------------
# Calculate Accuracy
accuracy = accuracy_score(all_labels, all_predictions)
print(f"Accuracy: {accuracy:.4f}")

# Calculate F1-Score
f1 = f1_score(all_labels, all_predictions, average='macro')  # Choose 'macro', 'micro', or 'binary' as appropriate
print(f"F1-Score: {f1:.4f}")

# Record probabilities for ensemble
bert_probs_combined = np.concatenate(bert_probs, axis=0)  # Concatenate along the batch axis


Map:   0%|          | 0/14378 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.7708
F1-Score: 0.7102


In [11]:
import sys
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
from transformers import RobertaTokenizer

# Add custom module path
sys.path.append('../src')

# Import custom modules
from models.transformer_based_models import load_and_prepare_model
from data_processing.feature_extraction import prepare_single_dataset

# -------------------------------
# 1. Prepare Dataset
# -------------------------------
# Tokenize the validation dataset
tokenized_dataset = prepare_single_dataset(validation_dataset, model_type='roberta')

# Set format for PyTorch compatibility
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoader
batch_size = 32  # Define batch size
data_loader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=False)

# -------------------------------
# 2. Load the RoBERTa Model
# -------------------------------
# Define the number of labels (for classification tasks)
num_labels = 2

# Load the pre-trained RoBERTa model
roberta_model = load_and_prepare_model('roberta', num_labels=num_labels)

# Load the saved model weights
model_weights_path = '../src/models/weights/roberta_model.pth'
roberta_model.load_state_dict(torch.load(model_weights_path))

# Set model to evaluation mode and move to appropriate device
roberta_model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
roberta_model.to(device)

# -------------------------------
# 3. Perform Inference
# -------------------------------
all_predictions = []
all_labels = []
roberta_probs = []

# Disable gradient computation during inference
with torch.no_grad():
    for batch in data_loader:
        # Move batch to the appropriate device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)  # True labels

        # Forward pass through the model
        outputs = roberta_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = F.softmax(logits, dim=-1)  # Convert logits to probabilities
        roberta_probs.append(probabilities.cpu().numpy())


        # Get predicted labels (highest logit value)
        predictions = torch.argmax(logits, dim=-1)

        # Collect predictions and labels
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# -------------------------------
# 4. Evaluate Model Performance
# -------------------------------
# Calculate Accuracy
accuracy = accuracy_score(all_labels, all_predictions)
print(f"Accuracy: {accuracy:.4f}")

# Calculate F1-Score
f1 = f1_score(all_labels, all_predictions, average='macro')  # Choose 'macro', 'micro', or 'binary' as appropriate
print(f"F1-Score: {f1:.4f}")
roberta_probs_combined = np.concatenate(roberta_probs, axis=0)  # Concatenate along the batch axis


Map:   0%|          | 0/14378 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.7756
F1-Score: 0.7160


# Softvoting ensemble

In [12]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score


# Step 1: Stack probabilities and calculate the mean (soft voting)
all_probs = np.stack([logreg_probs, bert_probs_combined, electra_probs_combined, roberta_probs_combined], axis=0)  # Shape: (num_models, num_samples, num_classes)
ensemble_probs = np.mean(all_probs, axis=0)  # Shape: (num_samples, num_classes)

# Step 2: Predict the final class
ensemble_predictions = np.argmax(ensemble_probs, axis=1)  # Shape: (num_samples,)

# Ground truth labels (validation_labels)
# Assuming validation_labels is a NumPy array of true class labels
accuracy = accuracy_score(validation_labels, ensemble_predictions)
f1 = f1_score(validation_labels, ensemble_predictions, average='macro')  # Use 'weighted' or 'micro' as needed

print(f"Soft Voting Ensemble Accuracy: {accuracy:.4f}")
print(f"Soft Voting Ensemble F1-Score: {f1:.4f}")


Soft Voting Ensemble Accuracy: 0.7837
Soft Voting Ensemble F1-Score: 0.7194
