In [1]:
import sys
import os

# Add src/ to Python path
sys.path.append(os.path.abspath("../src"))

## Task 1

For Task 1, I set up a sentence encoding pipeline using the bert-base-uncased model from Hugging Face. The goal was to encode input sentences into fixed-length embeddings that could be used for downstream tasks such as classification or similarity.

Initially, I used the raw BERT model and implemented mean pooling over the last hidden states (excluding padding tokens). This approach is often more effective than using the [CLS] token for general-purpose sentence embeddings.

As the project evolved, I refactored this logic into a reusable utility function called encode_sentences(), which supports both mean and CLS pooling. To align with the multi-task design, I also reused the MultiTaskModel backbone to extract embeddings. This made the architecture consistent and reusable across all tasks.

To evaluate the quality of the embeddings, I computed cosine similarity between related and unrelated sentence pairs. Related sentences (e.g., about NLP or Paris) had higher similarity scores, demonstrating that the embeddings effectively captured semantic meaning.

In [None]:
from transformers import AutoTokenizer
import torch.nn.functional as F

from data import encode_sentences
from model import MultiTaskModel

In [3]:
sentences = [
    "I love natural language processing.",
    "NLP is a fascinating field.",
    "The Eiffel Tower is in Paris.",
    "Transformers are used in deep learning.",
    "Paris is the capital of France."
]

# Similar sentence pairs: [0,1] and [2,4] should have high similarity
pairs = [(0, 1), (0, 2), (2, 4), (1, 3)]

model = MultiTaskModel("bert-base-uncased", 11, 3)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

embeddings = encode_sentences(sentences, model, tokenizer, pooling="mean")
print(embeddings.shape)

print("🔍 Cosine Similarities Between Sentence Pairs:")
for i, j in pairs:
    sim = F.cosine_similarity(embeddings[i], embeddings[j], dim=0).item()
    print(f"({i}, {j}) '{sentences[i]}' <-> '{sentences[j]}' -> similarity = {sim:.4f}")

torch.Size([5, 768])
🔍 Cosine Similarities Between Sentence Pairs:
(0, 1) 'I love natural language processing.' <-> 'NLP is a fascinating field.' -> similarity = 0.6150
(0, 2) 'I love natural language processing.' <-> 'The Eiffel Tower is in Paris.' -> similarity = 0.5189
(2, 4) 'The Eiffel Tower is in Paris.' <-> 'Paris is the capital of France.' -> similarity = 0.7452
(1, 3) 'NLP is a fascinating field.' <-> 'Transformers are used in deep learning.' -> similarity = 0.6954


## Task 2

For Task 2, I expanded the sentence transformer setup to support multi-task learning. Since I chose sentence classification and sentiment analysis as my two tasks, I leveraged the TweetEval dataset, which includes both emotion and sentiment labels. This made it a practical, real-world source for aligned multitask inputs.

To support this setup, I implemented a shared BERT backbone with two task-specific classification heads:

classifier_a for emotion classification

classifier_b for sentiment classification

This design allows both tasks to benefit from the same sentence representation while learning their own task-specific nuances. The loss function is computed separately for each task and summed (loss = loss_a + loss_b) during training. This setup lays the foundation for scalable multi-task learning by simply plugging in new task heads.



## Task 3

For Task 3, I explored how freezing different parts of the model affects training and learning dynamics.

Freezing the entire model turns it into a fixed feature extractor. This approach is fast and useful for quick baselines or extremely small datasets, but it limits the model’s ability to adapt.

Freezing the transformer backbone allows only the classification heads to learn. This is effective when using a strong pretrained model and prevents overfitting to limited task-specific data.

Freezing one of the task-specific heads is useful when you want to retain the performance of a stable task while improving another. For example, if the sentiment task performs well and doesn’t require further tuning, freezing its head allows the emotion head to learn without interference.

In terms of transfer learning, I started with the bert-base-uncased model as the backbone. This model captures general-purpose linguistic patterns. I chose to freeze the lower layers (which handle syntax) and fine-tune the upper layers and task heads (which are more semantic and task-specific). This approach offers a good balance between stability and adaptability.



## Task 4

For Task 4, I created a reusable MultiTaskTrainer class that encapsulates training logic in a clean, modular way. Inspired by PyTorch Lightning, I split the setup into:

prepare_data() for downloading and preparing data

setup() for building datasets and dataloaders

train() for the training loop and loss tracking

This structure makes the training logic clean, extensible, and testable.

To validate correctness, I implemented an overfit mode that trains the model on a single batch. By increasing the learning rate and training for more epochs, the model learns to perfectly memorize the batch — a key sanity check for multi-task setups.

In [7]:
import torch
from train import MultiTaskTrainer

In [None]:
trainer = MultiTaskTrainer(
    batch_size=8,
    epochs=30,            
    sample_size=8,        
    lr=1e-4 
)
trainer.prepare_data()
trainer.setup()
trainer.train()

Generating train split: 100%|██████████| 45615/45615 [00:00<00:00, 309556.38 examples/s]
Generating test split: 100%|██████████| 12284/12284 [00:00<00:00, 655040.05 examples/s]
Generating validation split: 100%|██████████| 2000/2000 [00:00<00:00, 252114.57 examples/s]
Generating train split: 100%|██████████| 3257/3257 [00:00<00:00, 536497.98 examples/s]
Generating test split: 100%|██████████| 1421/1421 [00:00<00:00, 154848.17 examples/s]
Generating validation split: 100%|██████████| 374/374 [00:00<00:00, 83444.32 examples/s]


Epoch 1/30, Avg Loss: 3.5471
Epoch 2/30, Avg Loss: 2.2816
Epoch 3/30, Avg Loss: 1.2555
Epoch 4/30, Avg Loss: 0.8170
Epoch 5/30, Avg Loss: 0.5313
Epoch 6/30, Avg Loss: 0.3137
Epoch 7/30, Avg Loss: 0.2171
Epoch 8/30, Avg Loss: 0.1606
Epoch 9/30, Avg Loss: 0.1331
Epoch 10/30, Avg Loss: 0.0935
Epoch 11/30, Avg Loss: 0.0762
Epoch 12/30, Avg Loss: 0.0614
Epoch 13/30, Avg Loss: 0.0480
Epoch 14/30, Avg Loss: 0.0435
Epoch 15/30, Avg Loss: 0.0360
Epoch 16/30, Avg Loss: 0.0333
Epoch 17/30, Avg Loss: 0.0290
Epoch 18/30, Avg Loss: 0.0271
Epoch 19/30, Avg Loss: 0.0259
Epoch 20/30, Avg Loss: 0.0247
Epoch 21/30, Avg Loss: 0.0229
Epoch 22/30, Avg Loss: 0.0213
Epoch 23/30, Avg Loss: 0.0178
Epoch 24/30, Avg Loss: 0.0190
Epoch 25/30, Avg Loss: 0.0173
Epoch 26/30, Avg Loss: 0.0169
Epoch 27/30, Avg Loss: 0.0153
Epoch 28/30, Avg Loss: 0.0138
Epoch 29/30, Avg Loss: 0.0141
Epoch 30/30, Avg Loss: 0.0133


In [9]:
with torch.no_grad():
    for batch in trainer.dataloader:
        input_ids = batch["input_ids"].to(trainer.device)
        attention_mask = batch["attention_mask"].to(trainer.device)
        labels_a = batch["label_a"].to(trainer.device)
        labels_b = batch["label_b"].to(trainer.device)

        out_a, out_b = trainer.model(input_ids, attention_mask)
        pred_a = out_a.argmax(dim=1)
        pred_b = out_b.argmax(dim=1)

        print("Sentence Classification Predictions:", pred_a.cpu().tolist())
        print("Labels:", labels_a.cpu().tolist())

        print("Sentiment Analysis Predictions:", pred_b.cpu().tolist())
        print("Labels:", labels_b.cpu().tolist())

Sentence Classification Predictions: [3, 0, 0, 1, 1, 0, 0, 0]
Labels: [3, 0, 0, 1, 1, 0, 0, 0]
Sentiment Analysis Predictions: [1, 2, 1, 2, 2, 0, 0, 2]
Labels: [1, 2, 1, 2, 2, 0, 0, 2]
