# Fetch Take Home Exercise

In [1]:
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# !pip3 install transformers
# !pip3 install pandas

### Task 1

In [2]:
from sentence_transformer import SentenceTransformer 

# Test sentences
sentences = [
    "Well that's a bit far-fetched",
    "Go fetch boy",
    "Rewards, rewards, rewards"
]

# Model initialization
model = SentenceTransformer(tokenizer_name='bert-base-cased')

# Model inference
out = model(sentences)
out

  from .autonotebook import tqdm as notebook_tqdm


tensor([[[-0.2514, -0.3482,  0.2149,  ...,  0.2314,  0.0810, -0.3170],
         [-0.5276,  0.1111,  0.4386,  ...,  0.5205, -0.0781,  0.0565],
         [-0.0779, -0.0373, -0.4438,  ...,  0.0753, -0.3515, -0.2617],
         ...,
         [-0.3342, -0.0064,  0.2034,  ...,  0.2632, -0.9232, -0.1479],
         [-0.6239,  0.4463,  0.7132,  ..., -0.5824, -0.6383,  0.3792],
         [-0.7860, -0.0121, -0.1461,  ..., -0.0792,  0.0956, -0.3999]],

        [[ 0.4845, -0.1764, -0.2109,  ...,  0.1437,  0.7978, -0.5256],
         [ 0.5374,  0.4205, -0.3329,  ...,  0.0727,  0.7001, -0.7020],
         [ 0.3948,  0.2103,  0.0070,  ...,  0.1159,  0.1814, -0.6382],
         ...,
         [ 0.5440, -0.0874, -0.0264,  ..., -0.0862,  0.4257, -0.6423],
         [ 0.9366, -0.2117,  0.1287,  ..., -0.1637,  0.4449, -0.4650],
         [ 0.8690, -0.2463,  0.0888,  ..., -0.3149,  0.6313, -0.5963]],

        [[ 0.5890, -0.0768, -0.0999,  ..., -0.0489,  0.1756, -0.4978],
         [ 0.2048,  0.3711,  0.0297,  ...,  0

Besides the transformer backbone, I had to come up with a way to convert the sentence strings into a format that was digestible for the transformer. This could be done in many ways including more traditional methods like one-hot encoding, Word2Vec, etc., but these representations can become too big and too computationally intensive. I decided to go with a more contemporary method in a pretrained BERT tokenizer. BERT learns contextual information, so that useful information also comes in its word embeddings. I set padding and truncation to True as we are dealing with variable length data.

### Task 2

To support multi-task learning, I added one more fully connected layers for each task. The transformer-encoded sentences would be fed to these layers and then output a number of probabilities based on the task. I gave 3 possible labels (food, sports, reading) to the classification problem, so its task layer had 3 outputs. The sentiment labels were binary (positive, negative) so its task layer had only one output. The softmax function is applied to each output to produce a probability for each possible label. 


### Task 3

1. If the entire network is frozen, then it won't be trainable. This is particularly useful when we want to ship the model without it changing the weights. Basically a pretrained model.

2. If only the transformer backbone is frozen, then only the fully connected layers conducting the tasks can be trained. Typically, this means that we think that the embeddings coming out of the transformer are good, so we want to keep those weights frozen. This is beneficial when we want to isolate training the multi-task extensions on good embeddings. 

3. If the Task A head is frozen, then we are training the transformer and the Task B head. This means that training will alter the transformer weights to minimize the loss for Task B. This is beneficial if we think the transformer’s performance has been lacking on only Task B.  Switching the frozen head would produce the exact opposite effect.

1. The most important thing to look at when selecting a pretrained model is the data it was trained on. If the model is trained on data that is similar to the inputs come inference time, then we can trust that the output will be accurate. Then, looking at its reported performance through papers or model hubs will help decide which one's the best. If there are multiple options that I think could work, testing them against a reliable dataset that I have will help me choose the right one. 

2. I would freeze the layers of the model if I'm completely satisfied with its performance on the task at hand. If not satisfied, I generally want to unfreeze a couple of the later layers and train it on my own dataset. The earlier weights in a network generally contain the “big picture” ideas, while as we go along the network the details become finer and finer. A carefully selected pretrained model should already perform decently well on the task at hand, so it should understand the “big picture”. However, we may need to smooth out some of the finer details. 


### Task 4

In [3]:
import torch
import torch.nn as nn
from sentence_transformer import MTLTransformer, SentencesDataset
from torch.utils.data import DataLoader
from tqdm import tqdm

# Model initialization
model = MTLTransformer()

# Loss functions for training
classification_loss_func = nn.CrossEntropyLoss()
sentiment_loss_func = nn.CrossEntropyLoss()

# Optimizer initialization
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6)  

# Number of epochs 
num_epochs = 100

# Dataloader
train_dataset = SentencesDataset("sentences.csv")
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


# Training loop
for epoch in tqdm(range(num_epochs)):
    loss_per_epoch = 0 
    for inputs, class_labels, sentiment_labels in train_loader:

        # Zero out optimizer before every pass        
        optimizer.zero_grad()

        # Forward pass
        output_class, output_sentiment = model(inputs)

        # Calculate loss based on both tasks
        sentiment_loss = sentiment_loss_func(output_sentiment, torch.Tensor(sentiment_labels)[..., None])
        class_loss = classification_loss_func(output_class, torch.LongTensor(class_labels))
        total_loss = sentiment_loss + class_loss

        # Backward pass
        total_loss.backward()
        optimizer.step()

        loss_per_epoch += total_loss.item()

100%|██████████| 100/100 [00:33<00:00,  2.96it/s]


For training two tasks at once, I set up two separate losses (one for each task). I decided cross entropy loss would be fitting as both tasks are classification-esque. I decided to go with an Adam optimizer as it would adaptively change the learning rate as training moved along. This helps the model learn the task more quickly and more stably compared to a static learning rate. From there, I constructed a typical training loop that sent inputs in the format (sentence, class_label, sentiment_label) to the model and two sets of probabilities would then be sent to two loss functions. Since both tasks were equally important to me, I simply added them together with no weighting.

