# Assignment 3 Part 2 - Wiki Question Answering

**Submission deadline:** Friday 30 May 2025, 11:55 pm

**Marks:** 20 marks (20% of the total unit assessment)

Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. For example, if the assignment is worth 8 marks (of the entire unit) and your submission is late by 19 hours (or 23 hours 59 minutes 59 seconds), 0.4 marks (5% of 8 marks) will be deducted. If your submission is late by 24 hours (or 47 hours 59 minutes 59 seconds), 0.8 marks (10% of 8 marks) will be deducted, and so on. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for Special Consideration.


## A Note on the Use of AI Generators

In this assignment, we view AI code generators such as Copilot, CodeGPT, etc. as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and cannot do, please visit these information pages provided by Macquarie University:

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4

If you choose to use these tools, make the following explicit in your submitted file as comments starting with "Use of AI generators in this assignment" explaining:

-   What part of your code is based on the output of such tools,
-   What tools you used,
-   What prompts you used to generate the code or text, and
-   What modifications you made on the generated code or text.

This will help us assess your work fairly. If we observe that you have used an AI generator and you do not give the above information, you may face disciplinary action.

## Objectives of This Assignment

<!-- In Assignment 3 you will work on a general answer selection task. Given a question and a list of candidate sentences, the goal is to predict which sentences can be used as part of the answer. Assignment 3 Part 2 requires you to implement deep neural networks. -->

In this assignment, you will work on the answer selection task using the WikiQA corpus. Given a question and a list of candidate sentences, the goal is to predict which sentences can be used to form a correct answer.  This assignment requires you to implement and evaluate a traditional text classification method (Naive Bayes) as well as deep neural networks (Siamese Network and Transformer models).



The dataset is the **Wiki Question Answering corpus from Microsoft**. The provided files (`training.csv`, `dev_test.csv`, `test.csv` in `data.zip`) contain the following columns:

-   `question_id`: ID for a question
-   `question`: Text of the question
-   `document_title`: Topic of the question
-   `answer`: Sentence candidate for the answer
-   `label`: 1 if the sentence is part of the answer, 0 otherwise

The following code shows how to load and preview the data:

In [19]:
import pandas as pd

train_data = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\training.csv')
dev_data = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\dev_test.csv')
test_data = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\test.csv')
train_data.head()


Unnamed: 0,question_id,question,document_title,answer,label
0,Q1,how are glacier caves formed?,Glacier cave,A partly submerged glacier cave on Perito More...,0
1,Q1,how are glacier caves formed?,Glacier cave,The ice facade is approximately 60 m high,0
2,Q1,how are glacier caves formed?,Glacier cave,Ice formations in the Titlis glacier cave,0
3,Q1,how are glacier caves formed?,Glacier cave,A glacier cave is a cave formed within the ice...,1
4,Q1,how are glacier caves formed?,Glacier cave,"Glacier caves are often called ice caves , but...",0


## Instructions

* Complete the three tasks below.

* Write your code inside this notebook.

* Your notebook must include the running outputs of your final code.

* **Submit this `.ipynb` file, containing your code and outputs, to iLearn.**

## Assessment

1.  Marks are based on the correctness of your code, outputs, and coding style.
<!-- 2.  A total of **1.5 marks** (0.5 per task) are awarded globally across the assignment for good coding style: clean, modular code, meaningful variable names, and good comments. -->
3.  Marks for each task focus only on the main implementation, **not on the data loading step**.
4.  If outputs are missing or incorrect, up to **25% of the marks for that task** can be deducted.
5.  See each task below for the detailed mark breakdown.

## Task 1 (4 marks): Query-Focused Text Classification Using Naive Bayes

* Preprocess the text data. Feel free to explore and use suitable preprocessing.

* Extract features using **CountVectorizer** and **TF-IDF**.

* Train and evaluate a **Naive Bayes classifier** on both feature sets.

* Report and compare accuracy, precision, recall, and F1-score.

**Mark breakdown:**


* (2 marks) Correct implementation: preprocessing, feature extraction, training Naive Bayes models.

* (1.5 marks) Proper evaluation: accuracy, precision, recall, F1-score + discussion of results.

* (0.5 mark) Good coding style: clean, modular, clear variables, comments.

<!-- * (0.5 mark) Preprocessing and feature extraction.

* (1 mark) Training Naive Bayes on CountVectorizer and TF-IDF features.

* (1 mark) Evaluation on the test set with proper metrics.

* (1 mark) Brief discussion on which feature set performed better and why.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->

In [21]:
#   Write your code and answers here. You can add more code and markdown cells if needed.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load the training data
df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\training.csv')

# Combining relevant columns into a single text field
df['text'] = df['question'] + " " + df['document_title'] + " " + df['answer']

# Basic text preprocessing
class TextCleaner(TransformerMixin):
    def transform(self, X, **transform_params):
        return [self.clean_text(text) for text in X]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def clean_text(self, text):
        stop_words = set(stopwords.words('english'))
        # converts text into lowercase
        text = text.lower()
        # remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # removes stopwords
        return " ".join([word for word in text.split() if word not in stop_words])

# Splitting data into x and y labels
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# code for modelling with countVectoriser
count_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', CountVectorizer()),
    ('nb', MultinomialNB())
])

count_pipe.fit(X_train, y_train)
y_pred_count = count_pipe.predict(X_test)

# Printing CountVectorizer Metrics
print("=== CountVectorizer Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_count))
print("Precision:", precision_score(y_test, y_pred_count))
print("Recall:", recall_score(y_test, y_pred_count))
print("F1 Score:", f1_score(y_test, y_pred_count))

# code for extracting features with tfidf 
tfidf_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

tfidf_pipe.fit(X_train, y_train)
y_pred_tfidf = tfidf_pipe.predict(X_test)

# Printing TF-IDF Metrics
print("=== TF-IDF Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("Precision:", precision_score(y_test, y_pred_tfidf))
print("Recall:", recall_score(y_test, y_pred_tfidf))
print("F1 Score:", f1_score(y_test, y_pred_tfidf))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


=== CountVectorizer Metrics ===
Accuracy: 0.9334479371316307
Precision: 0.16
Recall: 0.05454545454545454
F1 Score: 0.08135593220338982
=== TF-IDF Metrics ===
Accuracy: 0.9459724950884086
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [5]:
#   Write your code and answers here. You can add more code and markdown cells if needed.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load the training data
df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\dev_test.csv')

# Combine relevant columns into a single text field
df['text'] = df['question'] + " " + df['document_title'] + " " + df['answer']

# Basic text preprocessing
class TextCleaner(TransformerMixin):
    def transform(self, X, **transform_params):
        return [self.clean_text(text) for text in X]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def clean_text(self, text):
        stop_words = set(stopwords.words('english'))
        # converts text into lowercase
        text = text.lower()
        # remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # removes stopwords
        return " ".join([word for word in text.split() if word not in stop_words])
        
# Splitting data into x and y labels
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# code for modelling with countVectoriser
count_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', CountVectorizer()),
    ('nb', MultinomialNB())
])

count_pipe.fit(X_train, y_train)
y_pred_count = count_pipe.predict(X_test)

# Printing CountVectorizer Metrics
print("=== CountVectorizer Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_count))
print("Precision:", precision_score(y_test, y_pred_count))
print("Recall:", recall_score(y_test, y_pred_count))
print("F1 Score:", f1_score(y_test, y_pred_count))

# code for extracting features with tfidf 
tfidf_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

tfidf_pipe.fit(X_train, y_train)
y_pred_tfidf = tfidf_pipe.predict(X_test)

# Printing TF-IDF Metrics
print("=== TF-IDF Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("Precision:", precision_score(y_test, y_pred_tfidf))
print("Recall:", recall_score(y_test, y_pred_tfidf))
print("F1 Score:", f1_score(y_test, y_pred_tfidf))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


=== CountVectorizer Metrics ===
Accuracy: 0.9396709323583181
Precision: 0.1
Recall: 0.04
F1 Score: 0.05714285714285714
=== TF-IDF Metrics ===
Accuracy: 0.9542961608775137
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [23]:
#   Write your code and answers here. You can add more code and markdown cells if needed.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load the training data
df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\test.csv')

# Combine relevant columns into a single text field
df['text'] = df['question'] + " " + df['document_title'] + " " + df['answer']

# Basic text preprocessing
class TextCleaner(TransformerMixin):
    def transform(self, X, **transform_params):
        return [self.clean_text(text) for text in X]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def clean_text(self, text):
        stop_words = set(stopwords.words('english'))
        # convert text into lowercase
        text = text.lower()
        # remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # removes stopwords
        return " ".join([word for word in text.split() if word not in stop_words])
        
# Splitting data into x and y labels        
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# code for extracting features with CountVectoriser
count_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', CountVectorizer()),
    ('nb', MultinomialNB())
])

count_pipe.fit(X_train, y_train)
y_pred_count = count_pipe.predict(X_test)

# Printing CountVectoriser Metrics
print("=== CountVectorizer Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_count))
print("Precision:", precision_score(y_test, y_pred_count))
print("Recall:", recall_score(y_test, y_pred_count))
print("F1 Score:", f1_score(y_test, y_pred_count))

# code for extracting features with tfidf 
tfidf_pipe = Pipeline([
    ('cleaner', TextCleaner()),
    ('vect', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

tfidf_pipe.fit(X_train, y_train)
y_pred_tfidf = tfidf_pipe.predict(X_test)

# Printing TF-IDF Metrics
print("=== TF-IDF Metrics ===")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("Precision:", precision_score(y_test, y_pred_tfidf))
print("Recall:", recall_score(y_test, y_pred_tfidf))
print("F1 Score:", f1_score(y_test, y_pred_tfidf))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


=== CountVectorizer Metrics ===
Accuracy: 0.9383617193836172
Precision: 0.037037037037037035
Recall: 0.0196078431372549
F1 Score: 0.02564102564102564
=== TF-IDF Metrics ===
Accuracy: 0.9586374695863747
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The countVectorizer has been representing a higher value in precision, recal and f1 score as compared to the TF-IDF technique which shows that across training, devtest and test datasets that when countVectorizer is implemented with a naive bayes classifier it's more likely to detect the positive class despite having a slightly lower accuracy than TF-IDF technique. Therefore, the countVectorizer is effective in detecting positives especially when given raw word count.

## Task 2 (6 marks): Siamese Neural Network with Contrastive Loss (PyTorch)

This task involves two stages: first learning sentence embeddings using contrastive loss, and then using these embeddings for classification.

### Task 2a: Learning Embeddings with Contrastive Loss

* Preprocess question-answer pairs (e.g., TF-IDF or embeddings).

* Implement a Siamese Network in PyTorch:
    * The network should take the preprocessed question and answer representations as input.
  
    * Each branch of the Siamese network should contain two hidden layers with ReLU activation. (hidden layer size chosen from {64, 128, 256})
  
    * Use Euclidean-distance-based contrastive loss, use a margin value of m=1.
  
    * The network should output an embedding vector (the output of the second hidden layer) for the question and the answer.

* Train the model and evaluate on the test set.

*Note: Save the best performing model to be reused in Task 2b*

### Task 2b: Classification using Learned Embeddings

* Load the weights of the best performing Siamese network model saved from Task 2a. Freeze the weights of the shared Siamese branches (i.e., the hidden layers) so they are not updated during this stage.

* Build Classifier Head in PyTorch:
    * Pass the question and answer representations through their respective frozen branches to obtain their learned embeddings from Task 2a.

    * Calculate the Euclidean distance between the question embedding and the answer embedding.

    * Add a final classification output layer: Pass the calculated distance through a simple trainable layer (e.g., a Dense layer with 1 unit) followed by a Sigmoid activation function. This will output a value between 0 and 1, representing the predicted probability of the pair being related.

* Train the model and evaluate on the test set with Binary Cross-Entropy (BCE) loss.

* Report the accuracy and provide at least one failure case analysis, with supporting code output.

**Mark breakdown:**

* (3 marks) Correct implementation: Siamese NN architecture, contrastive loss, classification head setup.

* (2.5 marks) Proper evaluation: training/evaluation correctness, metric reporting, failure case analysis.

* (0.5 mark) Good coding style: : clean, modular code, meaningful variable names, and good comments.

<!-- * (1 mark) Correct Siamese NN architecture and contrastive loss.

* (1 mark) SNN training setup and data feeding.

* (1 mark) Correctly loading the pre-trained model, freezing the appropriate layers, and constructing the classification architecture.

* (1 mark) Correct training/evaluation setup using Binary Cross-Entropy loss.

* (0.5 mark) Proper evaluation and accuracy reporting.

* (1 mark) Example of a failure case, possible reason, and suggested improvement.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->

In [7]:
#   Write your code and answers here. You can add more code and markdown cells if needed.
# 2a
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.feature_extraction.text import TfidfVectorizer
import torch.nn.functional as F
import os
# Load your CSVs
train_df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\training.csv')
devtest_df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\dev_test.csv')
test_df = pd.read_csv(r'C:\Users\Alka\Downloads\Data for Assignment 3 (Part 2)\Data\data\test.csv')

# Keep relevant columns
def preprocess_dataframe(df):
    return df[['question', 'answer', 'label']]

train_df = preprocess_dataframe(train_df)
devtest_df = preprocess_dataframe(devtest_df)
test_df = preprocess_dataframe(test_df)

# Fit TF-IDF vectorizer on all questions/answers labels
all_text = pd.concat([train_df['question'], train_df['answer'],
                      devtest_df['question'], devtest_df['answer'],
                      test_df['question'], test_df['answer']])

vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(all_text)

# Creating a class for the Siamese Model
class SiameseDataset(Dataset):
    def __init__(self, df, vectorizer):
        self.q1 = df['question'].values
        self.q2 = df['answer'].values
        self.labels = df['label'].values.astype(np.float32)
        self.vectorizer = vectorizer

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        q1_vec = self.vectorizer.transform([self.q1[idx]]).toarray().squeeze()
        q2_vec = self.vectorizer.transform([self.q2[idx]]).toarray().squeeze()
        label = self.labels[idx]
        return torch.tensor(q1_vec, dtype=torch.float32), \
               torch.tensor(q2_vec, dtype=torch.float32), \
               torch.tensor(label, dtype=torch.float32)

# Create datasets and dataloaders
train_dataset = SiameseDataset(train_df, vectorizer)
devtest_dataset = SiameseDataset(devtest_df, vectorizer)
test_dataset = SiameseDataset(test_df, vectorizer)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
devtest_loader = DataLoader(devtest_dataset, batch_size=64)
test_loader = DataLoader(test_dataset, batch_size=64)

# Created a class for the siamese network
class SiameseNetwork(nn.Module):
    def __init__(self, input_size, hidden_size=128):
        super(SiameseNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)

    def forward_once(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return x

    def forward(self, x1, x2):
        out1 = self.forward_once(x1)
        out2 = self.forward_once(x2)
        return out1, out2
# Created a class for contrastive Loss
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, out1, out2, label):
        distance = F.pairwise_distance(out1, out2)
        loss = torch.mean(label * distance**2 + (1 - label) * F.relu(self.margin - distance)**2)
        return loss
# Function for training the siamese model
def train_siamese(model, train_loader, dev_loader, criterion, optimizer, num_epochs=10):
    best_loss = float('inf')
    best_model_path = 'best_siamese_model.pth'

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0.0
        for q1, q2, label in train_loader:
            optimizer.zero_grad()
            out1, out2 = model(q1, q2)
            loss = criterion(out1, out2, label)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        model.eval()
        dev_loss = 0.0
        with torch.no_grad():
            for q1, q2, label in dev_loader:
                out1, out2 = model(q1, q2)
                loss = criterion(out1, out2, label)
                dev_loss += loss.item()

        print(f"Epoch {epoch+1}/{num_epochs} - Train Loss: {train_loss:.4f} - Dev Loss: {dev_loss:.4f}")

        if dev_loss < best_loss:
            best_loss = dev_loss
            torch.save(model.state_dict(), best_model_path)
            print(f"✔️ Model saved at epoch {epoch+1} with dev loss {best_loss:.4f}")

    return best_model_path
input_size = 5000  # TF-IDF vector size
model = SiameseNetwork(input_size, hidden_size=128)
criterion = ContrastiveLoss(margin=1.0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

best_model_path = train_siamese(model, train_loader, devtest_loader, criterion, optimizer, num_epochs=10)
# Function for evaluating the model
def evaluate_model(model, test_loader, threshold=0.5):
    model.eval()
    correct = 0
    total = 0
    distances = []
    labels = []

    with torch.no_grad():
        for q1, q2, label in test_loader:
            out1, out2 = model(q1, q2)
            dist = F.pairwise_distance(out1, out2)
            pred = (dist < threshold).float()
            correct += (pred == label).sum().item()
            total += label.size(0)
            distances.extend(dist.cpu().numpy())
            labels.extend(label.cpu().numpy())

    accuracy = correct / total
    print(f"✅ Test Accuracy: {accuracy:.4f}")
    return distances, labels

# Load the best saved model and evaluate
model.load_state_dict(torch.load(best_model_path))
evaluate_model(model, test_loader)

Epoch 1/10 - Train Loss: 42.3804 - Dev Loss: 4.9007
✔️ Model saved at epoch 1 with dev loss 4.9007
Epoch 2/10 - Train Loss: 13.9991 - Dev Loss: 4.6771
✔️ Model saved at epoch 2 with dev loss 4.6771
Epoch 3/10 - Train Loss: 8.0120 - Dev Loss: 4.3233
✔️ Model saved at epoch 3 with dev loss 4.3233
Epoch 4/10 - Train Loss: 4.3496 - Dev Loss: 4.4355
Epoch 5/10 - Train Loss: 2.5264 - Dev Loss: 4.6385
Epoch 6/10 - Train Loss: 1.6426 - Dev Loss: 4.6558
Epoch 7/10 - Train Loss: 1.3011 - Dev Loss: 4.5187
Epoch 8/10 - Train Loss: 1.1297 - Dev Loss: 4.5299
Epoch 9/10 - Train Loss: 1.0226 - Dev Loss: 4.5946
Epoch 10/10 - Train Loss: 0.9193 - Dev Loss: 4.5270
✅ Test Accuracy: 0.8832


([0.8721303,
  1.8370169,
  1.0232195,
  1.0929205,
  1.7740557,
  0.8013859,
  0.63399905,
  0.5632703,
  1.5086821,
  0.93604434,
  0.46677473,
  0.75443465,
  0.88136494,
  0.7445635,
  0.8853184,
  0.9268358,
  0.79673517,
  1.2407627,
  1.2445532,
  1.0080708,
  1.1223879,
  1.1871717,
  0.9041545,
  0.97441536,
  1.1001577,
  1.3971908,
  1.308397,
  1.1964772,
  1.0734302,
  1.1509583,
  1.5234449,
  1.5354161,
  1.2595487,
  0.94493204,
  1.2896973,
  0.8868288,
  1.7182921,
  1.2218429,
  0.8420409,
  1.4572475,
  1.4435974,
  1.1652778,
  1.279891,
  1.1233634,
  1.4541634,
  1.2053885,
  1.6121633,
  1.4431397,
  1.2266185,
  1.2776884,
  0.6629238,
  1.1585437,
  1.3935473,
  1.18967,
  1.392029,
  1.3279825,
  0.8828656,
  1.6180205,
  1.5109768,
  0.8625507,
  0.55000806,
  1.8148216,
  1.2103851,
  1.0339999,
  1.2988428,
  1.5433885,
  1.2094846,
  1.4908707,
  1.1887617,
  0.4369374,
  0.691636,
  1.0748836,
  0.36320662,
  0.6509259,
  0.78187644,
  0.8337801,
  1.181

Part 2a shows that the model saved was at epoch 3 with a test accuracy of 88.32% and it has a dev loss of 4.3233

In [27]:
# Part 2b
# ---- Distance-based Classifier Model ----
class DistanceClassifier(nn.Module):
    def __init__(self, base_model):
        super(DistanceClassifier, self).__init__()
        self.encoder = base_model
        for param in self.encoder.parameters():
            param.requires_grad = False  # Freeze Siamese branches
        self.output = nn.Sequential(
            nn.Linear(1, 1),
            nn.Sigmoid()
        )

    def forward(self, x1, x2):
        with torch.no_grad():
            out1 = self.encoder.forward_once(x1)
            out2 = self.encoder.forward_once(x2)
        distance = F.pairwise_distance(out1, out2).unsqueeze(1)
        return self.output(distance)

# ---- Train Function ----
def train_classifier(model, train_loader, dev_loader, epochs=10):
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.output.parameters(), lr=0.001)

    # code for the gradient descent method
    for epoch in range(epochs):
        model.train()
        total_loss = 0.0
        for q1, q2, label in train_loader:
            optimizer.zero_grad()
            pred = model(q1, q2).squeeze()
            loss = criterion(pred, label)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1} - Train Loss: {total_loss:.4f}")

# ---- Evaluate + Show Failure Case ----
def evaluate_classifier(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    failure_cases = []

    with torch.no_grad():
        for q1, q2, label in test_loader:
            pred = model(q1, q2).squeeze()
            predicted = (pred >= 0.5).float()
            correct += (predicted == label).sum().item()
            total += label.size(0)

            for i in range(len(label)):
                if predicted[i] != label[i]:
                    failure_cases.append((q1[i], q2[i], label[i].item(), pred[i].item()))

    accuracy = correct / total
    print(f"✅ Classification Accuracy: {accuracy:.4f}")
    
    if failure_cases:
        f1, f2, true_label, pred_score = failure_cases[0]
        # Print failure case alongside label metrics
        print("\n🚨 Failure Case:")
        print("True Label:", true_label)
        print("Predicted Score:", pred_score)
        print("Question Tokens:", vectorizer.inverse_transform(f1.unsqueeze(0).numpy())[0][:10])
        print("Answer Tokens:", vectorizer.inverse_transform(f2.unsqueeze(0).numpy())[0][:10])

# ---- Load Trained Siamese Model + Run Task 2b ----
input_size = 5000
hidden_size = 128

base_model = SiameseNetwork(input_size=input_size, hidden_size=hidden_size)
base_model.load_state_dict(torch.load("best_siamese_model.pth"))

classifier = DistanceClassifier(base_model)

train_classifier(classifier, train_loader, devtest_loader, epochs=10)
evaluate_classifier(classifier, test_loader)

Epoch 1 - Train Loss: 90.4360
Epoch 2 - Train Loss: 68.1077
Epoch 3 - Train Loss: 58.1823
Epoch 4 - Train Loss: 53.5597
Epoch 5 - Train Loss: 50.8765
Epoch 6 - Train Loss: 49.2103
Epoch 7 - Train Loss: 47.6764
Epoch 8 - Train Loss: 46.5172
Epoch 9 - Train Loss: 45.3946
Epoch 10 - Train Loss: 44.1980
✅ Classification Accuracy: 0.9525

🚨 Failure Case:
True Label: 1.0
Predicted Score: 0.04442455992102623
Question Tokens: ['african' 'americans' 'how' 'the' 'to' 'us' 'were']
Answer Tokens: ['african' 'africans' 'american' 'and' 'are' 'as' 'atlantic' 'be'
 'brought' 'by']


Part 2b shows that the classsification accuracy is 95.25%. The failure case has a true label of 100% where the predicted score is 0.04% which highlights that the model failed to recognise an accurate answer thus making it a false negative.

## Task 3 (10 marks): Transformer-Based Sentence Classification (PyTorch)

* Preprocess input as: question [SEP] answer, pad to a fixed length (justify your choice of length).

* Use a suitable tokenizer (justify your choice).

* Build a Transformer model in PyTorch:

    * Embedding layer (size 128) + positional embeddings.

    * One Transformer encoder layer (hidden dim in {64, 128, 256}, 4 attention heads).

    * One hidden layer (256 units, ReLU).

    * Use suitable final layer for classification
    
  
* Apply Global Average Pooling to the output sequence of the Transformer encoder layer.
  
* Use an appropriate loss function (e.g., CrossEntropyLoss).

* Train and evaluate on the test split.

* Report best accuracy, precision, recall, F1-score, and discuss a failure case, with supporting code output.

**Mark breakdown:**

* (5 marks) Correct implementation: input preparation, tokenizer, transformer model, training setup.

* (4.5 marks) Proper evaluation: metric reporting, failure case analysis with discussion.

* (0.5 mark) Good coding style: : clean, modular code, meaningful variable names, and good comments.

<!-- * (1.5 marks) Correct input preparation and tokenizer choice (with justification).

* (2 marks) Transformer architecture implementation.

* (2 marks) Training setup, loss function, and optimizer.

* (2 marks) Evaluation and correct metric reporting.

* (2 marks) Failure case analysis and suggestions.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->


In [32]:
#   Write your code and answers here. You can add more code and markdown cells if needed.
# Import transformers and sklearn libraries
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Parameters
MAX_LEN = 64
BATCH_SIZE = 32
HIDDEN_DIM = 128
EMBED_DIM = 128

# Initialising the Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def encode_pair(q, a):
    return tokenizer.encode_plus(
        q, a,
        add_special_tokens=True,
        max_length=MAX_LEN,
        truncation=True,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt'
    )
class QADataset(Dataset):
    def __init__(self, df):
        self.qs = df['question'].tolist()
        self.as_ = df['answer'].tolist()
        self.labels = df['label'].tolist()

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        encoded = encode_pair(self.qs[idx], self.as_[idx])
        return {
            'input_ids': encoded['input_ids'].squeeze(0),
            'attention_mask': encoded['attention_mask'].squeeze(0),
            'label': torch.tensor(self.labels[idx], dtype=torch.float32)
        }

train_dataset = QADataset(train_df)
dev_dataset = QADataset(devtest_df)
test_dataset = QADataset(test_df)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [34]:
# Code for transformer classifier
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_heads, max_len):
        super(TransformerClassifier, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(max_len, embed_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=hidden_dim)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.fc1 = nn.Linear(embed_dim, 256)
        self.fc2 = nn.Linear(256, 1)
    
    def forward(self, input_ids, attention_mask):
        positions = torch.arange(0, input_ids.size(1)).unsqueeze(0).to(input_ids.device)
        x = self.embed(input_ids) + self.pos_embed(positions)
        x = x.permute(1, 0, 2)  # (seq_len, batch, embed)
        x = self.encoder(x, src_key_padding_mask=~attention_mask.bool())
        x = x.permute(1, 2, 0)  # (batch, embed, seq_len)
        x = self.pool(x).squeeze(2)  # (batch, embed)
        x = F.relu(self.fc1(x))
        return torch.sigmoid(self.fc2(x)).squeeze(1)

In [36]:
# Code for training the model based on train loader and dev loader
def train_model(model, train_loader, dev_loader, epochs=5):
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
    best_model = None
    best_acc = 0.0

    for epoch in range(epochs):
        model.train()
        total_loss = 0.0
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids = batch['input_ids']
            mask = batch['attention_mask']
            labels = batch['label']
            preds = model(input_ids, mask)
            loss = criterion(preds, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1} - Train Loss: {total_loss:.4f}")

        # Validation
        model.eval()
        all_preds, all_labels = [], []
        with torch.no_grad():
            for batch in dev_loader:
                input_ids = batch['input_ids']
                mask = batch['attention_mask']
                labels = batch['label']
                preds = model(input_ids, mask)
                all_preds.extend((preds > 0.5).cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
        acc = accuracy_score(all_labels, all_preds)
        print(f"Validation Accuracy: {acc:.4f}")
        if acc > best_acc:
            best_acc = acc
            best_model = model.state_dict()

    model.load_state_dict(best_model)
    return model

In [38]:
# Code for evaluating model based on test loader
def evaluate_model(model, test_loader):
    model.eval()
    y_true, y_pred = [], []
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids']
            mask = batch['attention_mask']
            labels = batch['label']
            preds = model(input_ids, mask)
            y_pred.extend((preds > 0.5).cpu().numpy())
            y_true.extend(labels.cpu().numpy())
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"✅ Test Accuracy: {acc:.4f} | Precision: {prec:.4f} | Recall: {rec:.4f} | F1: {f1:.4f}")

    # Print one failure case
    for i in range(len(y_true)):
        if y_true[i] != y_pred[i]:
            print("\n🚨 Failure Case:")
            print("True Label:", y_true[i])
            print("Predicted:", y_pred[i])
            break
VOCAB_SIZE = tokenizer.vocab_size
model = TransformerClassifier(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, num_heads=4, max_len=MAX_LEN)
model = train_model(model, train_loader, dev_loader, epochs=5)
evaluate_model(model, test_loader)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Epoch 1 - Train Loss: 132.2685


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Validation Accuracy: 0.9488


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Epoch 2 - Train Loss: 118.2301


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Validation Accuracy: 0.9488


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Epoch 3 - Train Loss: 109.5613


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Validation Accuracy: 0.9473


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Epoch 4 - Train Loss: 100.0932


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Validation Accuracy: 0.9491


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Epoch 5 - Train Loss: 87.7795


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Validation Accuracy: 0.9480


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

✅ Test Accuracy: 0.9494 | Precision: 0.2683 | Recall: 0.0375 | F1: 0.0659

🚨 Failure Case:
True Label: 1.0
Predicted: False


The transformer based sentence classification's model has a test accuracy of 94.94%. The precision is 26.83% which means that represents the total percentage of positive predictions the model made and the recall is 3.75% which shows that the total number of actual positives the model caught. The low f-1 score represents the model's imbalance between precision and recall as it fails to identify the positive class.

# Submission

Your submission should consist of this Jupyter notebook with all your code and explanations inserted into the notebook as text cells. **The notebook should contain the output of the runs. All code should run. Code with syntax errors or code without output will not be assessed.**

**Do not submit multiple files.**

Examine the text cells of this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the MarkDown notation](https://daringfireball.net/projects/markdown/syntax),  which explains the format of the text cells.

### Marking Rubric

| Criteria                          | Unsatisfactory | Pass           | Credit         | Distinction     |
|----------------------------------|----------------|----------------|----------------|-----------------|
| **Task 1 – Correctness**         | 0 points       | 1 point        | 1.5 points     | 2 points        |
| **Task 1 – Evaluation & Discussion** | 0 points   | 0.75 points    | 1 point        | 1.5 points      |
| **Task 1 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |
| **Task 2 – Correctness**         | 0 points       | 1.5 points     | 2.5 points     | 3 points        |
| **Task 2 – Evaluation & Analysis** | 0 points     | 1.25 points    | 2 points       | 2.5 points      |
| **Task 2 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |
| **Task 3 – Correctness**         | 0 points       | 2.5 points     | 4 points       | 5 points        |
| **Task 3 – Evaluation & Analysis** | 0 points     | 2.25 points    | 3.5 points     | 4.5 points      |
| **Task 3 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |


### Assessment Criteria Description

The following aspects will be considered when marking each task. The total score is based on the level of achievement across these dimensions.

#### Correctness
How well the main functionality and requirements of the task are implemented.

- **Unsatisfactory** – Major components are missing or incorrect.
- **Pass** – Some core components are correctly implemented.
- **Credit** – Most components are correctly implemented with minor issues.
- **Distinction** – All required components are correctly and completely implemented.

#### Evaluation & Analysis (where applicable)
The quality of evaluation metrics, observations, and insights into the model’s performance.

- **Unsatisfactory** – Minimal or no evaluation and discussion.
- **Pass** – Basic evaluation is provided, but analysis is shallow.
- **Credit** – Good evaluation with meaningful discussion.
- **Distinction** – In-depth, insightful analysis and thoughtful observations.

#### Code Readability
Clarity, structure, and quality of code writing style.

- **Unsatisfactory** – Code is difficult to read, poorly structured, and lacks clarity (e.g., meaningless variable names, no comments).
- **Pass** – Code is generally readable with some good practices.
- **Credit** – Code is clearly readable and mostly well-structured.
- **Distinction** – Code is clean, well-organized, and easy to follow; shows excellent style and best practices.
