# Task 1: Sentence Transformer Implementation

Implement a sentence transformer model using any deep learning framework of your choice. 
This model should be able to encode input sentences into fixed-length embeddings. Test your 
implementation with a few sample sentences and showcase the obtained embeddings. 
Describe any choices you had to make regarding the model architecture outside of the 
transformer backbone. 

In [2]:
import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertTokenizer

  from .autonotebook import tqdm as notebook_tqdm


### Implement a sentence transformer model

In [3]:
# Initialize the model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
transformer_model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [4]:
# Sample sentences
sentence = "This is the assessement for sentence transformer implementation. Let's embed this sentence and check the length"
# Tokenize and get embeddings
inputs = tokenizer(sentence, return_tensors="pt")

In [4]:
inputs["input_ids"].shape

torch.Size([1, 24])

For the given query sentence, the tokenizer generates 24 tokens for the transformer backbone .

In [5]:
with torch.no_grad():
    # forward the token ids to the transformer
    transformer_output = transformer_model(**inputs)
    # We select the first token (CLS token) as sentence embedding 
    cls_embedding = transformer_output.last_hidden_state[:, 0, :]  # CLS token
print(f"Sentence: {sentence}")
print(cls_embedding.shape)
print(transformer_output.last_hidden_state)

Sentence: This is the assessement for sentence transformer implementation. Let's embed this sentence and check the length
torch.Size([1, 768])
tensor([[[-0.2685, -0.3769,  0.0627,  ..., -0.0342,  0.0350,  0.6689],
         [-0.5525, -0.4315, -0.1180,  ..., -0.0320,  0.1906,  0.4065],
         [-0.6068, -0.1323,  0.1431,  ...,  0.0917, -0.2161,  0.8538],
         ...,
         [-0.4668, -0.3988,  0.2348,  ...,  0.0157, -0.3578,  0.3923],
         [ 0.3333, -0.3877, -0.0371,  ...,  0.1150, -0.0046,  0.2536],
         [ 0.2240, -0.1058,  0.3084,  ...,  0.3252, -0.8726,  0.1843]]])


This model should be able to encode input sentences into embeddings of length 768.

### Testing the implementation with a few sample sentences

In [6]:
# Sample sentences
sentences = ["I like doing assesements.", 
             "I love doing NLP works!! Is there more??", 
             "I wish to be a AI engineer and make AI products."]

# Tokenize and get embeddings
for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        # forward the token ids to the transformer
        transformer_output = transformer_model(**inputs)
        # We select the first token (CLS token) as sentence embedding 
        cls_embedding = transformer_output.last_hidden_state[:, 0, :]  # CLS token
    print(f"Sentence: {sentence}")
    print(cls_embedding.shape)


Sentence: I like doing assesements.
torch.Size([1, 768])
Sentence: I love doing NLP works!! Is there more??
torch.Size([1, 768])
Sentence: I wish to be a AI engineer and make AI products.
torch.Size([1, 768])


### Describe any choices you had to make regarding the model architecture outside of the transformer backbone. 

In the literature, depending on the use case of sentence transformer, we need to design the model architecture. For example, just for sentence classification or sentiment analysis, we could use the  [CLS] token of BERT/DistilledBERT since the [CLS] token captures the overall meaning of the input. If the application of the transformer were to create general sentence embedding, the [CLS] token only will not work since the [CLS] token embedding does not capture the semantic space to understand the similarity.

For this work, we only require the sentence embedding to capture the sentence meaning, using [CLS] token should work. For works that require setence similarity, we need to further fintune the model.

# Task 2: Multi-Task Learning Expansion

Expand the sentence transformer to handle a multi-task learning setting. 
1. Task A: Sentence Classification – Classify sentences into predefined classes (you can 
make these up). 
2. Task B: [Choose another relevant NLP task such as Named Entity Recognition, 
Sentiment Analysis, etc.] (you can make the labels up) 
Describe the changes made to the architecture to support multi-task learning.

### Expand the sentence transformer to handle a multi-task learning setting

In [None]:
class MultiTaskModel(nn.Module):
    def __init__(self, transformer_model, tokenizer, num_classes, num_sentiments=3):
        super().__init__()
        self.transformer = transformer_model
        self.tokenizer = tokenizer
        # Classification and Sentiment heads
        self.classification_head = nn.Linear(self.transformer.config.hidden_size, num_classes)
        self.sentiment_head = nn.Linear(self.transformer.config.hidden_size, num_sentiments)

    def forward(self, sequence):
        input_ids, attention_mask = self.tokenizer(sequence)
        # Pass through transformer
        transformer_output = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = transformer_output.last_hidden_state[:, 0, :]  # CLS token
        # Outputs for each task
        classification_output = self.classification_head(cls_embedding)
        sentiment_output = self.sentiment_head(cls_embedding)
        # return the task scores
        return classification_output, sentiment_output

### Describe the changes made to the architecture to support multi-task learning.

For muliti-task learning setting, we chose to train the model to learn a classification task and a sentiment analysis task. For each task, we add a linear layer that feeds on the setence embedding provided by previously designed sentence transformer. The sentence transformer is shared for both linear layers, which makes sense, since the DistillBERT embedding provides good overall summary of the sequence required for both layers.

# Task 3: Training Considerations

### Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following: 

1. If the entire network should be frozen. <br>
__Implications :__ If we freeze the entire network, then the weights of both the transformer and task heads remain unchanged.  <br>
__Advantages :__ This preserves the original backbone model's knowledge and we do not need to train the model. This approach is recommended when we do not have enough data to train the model and when the network is pretrained enough to work reasonably for both tasks.<br>

2. If only the transformer backbone should be frozen. <br>
__Implications :__ Freezing the transformer model keeps its knowledge intact. So, we only need to train the task-specific heads. <br>
__Advantages :__ if the pretrained model are very good, the embedding should be a great representation of the input sequence. Comparitvely, we need much lesser data to train the linear heads than the whole transformer backbone. Hence, we can train the netork with little data to fintune the task specific heads. This approach is recommended if we only need to finetune the task specific head while the backbone model provides a good general represenation fit for the purpose, and when we do no have much data.

3. If only one of the task-specific heads (either for Task A or Task B) should be frozen. <br>
__Implications :__ Freezing on one task specific head means the weights of that linear layer are frozen while the transformer model and the another linear layer can be trained. <br>
__Advantages :__ Let's say the network performs really good on Task A and we want to train the another task-specific head more for Task B. Without freezing the Task A, we cannot make sure that the performance of the Task A reamins same after training Task B again since the weights of both Tasks are being updated. This is related to catastophic forgetting. Hence, if we freeze one head and train the rest of the network, the frozen task head can retain its capabilities.

### Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:

1. The choice of a pre-trained model. <br>
The choice of pretrained model depends on the efficiency, performance and compute resources. DistilBERT provides good mix of performance and efficiency for most of the applications of this kind, and has lesser latency and parameters to train. If the main priority is performance, then the larger model is recommended.

2. The layers you would freeze/unfreeze. <br> 
The higher layers are usually finetuned compared to intermediate layers. The lowere layers are kept frozen.

3. The rationale behind these choices.  <br>
The lower layers are mainly associated to general language rules whereas the higher layers are more specific to final prediction. So, during finetuning, training higher layers retrains to understand new task specific patterns.

# Task 4: Training Loop Implementation (BONUS)

If not already done, code the training loop for the Multi-Task Learning Expansion in Task 2. 
Explain any assumptions or decisions made paying special attention to how training within a 
MTL framework operates. Please note you need not actually train the model.  <br>
Things to focus on: <br>
- Handling of hypothetical data 
- Forward pass 
- Metrics 

In [None]:
def train_epoch(model, dataloader, optimizer, criterion_a, criterion_b, device):
    """

    Args:
        model (MultiTaskModel): multi task learning model
        dataloader (Dataloader): training data loader
        optimizer : Optimizer with the specific layers frozen 
        criterion_a: loss function
        criterion_b : loss function
        device: cpu or gpu
    """
    model.train()
    for batch in dataloader:
        # obtain the data and labels
        input_sequence, labels_a, labels_b = batch
        input_sequence, labels_a, labels_b = input_sequence.to(device), labels_a.to(device), labels_b.to(device)

        optimizer.zero_grad()

        # forward pass where we input the training sentence and obtaining the task scores
        classification_output, sentiment_output = model(input_sequence)

        # calculate the loss
        loss_a = criterion_a(classification_output, labels_a)
        loss_b = criterion_b(sentiment_output, labels_b)
        loss = loss_a + loss_b

         # Calculate accuracy for current batch
        predicted_category = classification_output.argmax(dim=1)
        predicted_sentiment = sentiment_output.argmax(dim=1)

        correct_category_predictions += (predicted_category == labels_a).sum().item()
        correct_sentiment_predictions += (predicted_sentiment == labels_b).sum().item()

        total_samples += labels_a.size(0)
        
        # Metrics
        classification_accuracy = (correct_category_predictions / total_samples) * 100
        sentiment_accuracy = (correct_sentiment_predictions / total_samples) * 100

        print(f"Iteration {iter + 1}, Loss: {loss.item():.4f}, Category Accuracy: {classification_accuracy:.2f}%, Sentiment Accuracy: {sentiment_accuracy:.2f}%")
        loss.backward()
        optimizer.step()

Given some hypothetical training data with raw text sequence, label_a and label_b, our model forwards the classification scores and sentiment score. Assuming the optimizer has the learning rate set for layers and the task heads, we calculate the accuracy metrics for each task, and we aim to optimize these metrics while training.