# Layer-wise Learning Rate Implementation (BONUS)

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertModel, BertTokenizer

In [2]:
# Defining a SentenceTransformer_V2 Class which inherits torch.nn.Module and uses BERT model and takes sentences as input to give fixed-size shared embeddings
# Then it passes the shared embeddings to the classification task head and sentiment task head for multi-tasking purposes.

class SentenceTransformer_V2(nn.Module):
    def __init__(self, model_name='bert-base-uncased'):
        super().__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_name) # Importing BertTokenizer to tokenize sentences
        self.bert = BertModel.from_pretrained(model_name) # Importing BertModel to extract contextualized embeddings from sentences

        #Defining Sentences Classification Head
        self.text_classification = nn.Sequential(
            nn.Linear(768,128),  # First layer of the classification head
            nn.ReLU(), # Assigning an activation function between two layers
            nn.Linear(128,5) # Final layer of classification head with output of shape 5 to classify 5 classes, for example - happy, sad, angry, fear, disgust
        )

        #Defining Sentiment Classification Head
        self.sentiment = nn.Sequential(
            nn.Linear(768,128), # First layer of the sentiment head
            nn.ReLU(), # Assigning an activation function between two layers
            nn.Linear(128,3) #FInal layer of sentiment head wiith output of shape 2 for positive, negative and neutral sentiment classification
        )

    def forward(self, sentences):
        tokens = self.tokenizer(sentences, padding = True, truncation = True, return_tensors = 'pt') # Tokenizing all the sentences
        outputs = self.bert(input_ids=tokens['input_ids'], token_type_ids=tokens['token_type_ids'], attention_mask=tokens['attention_mask']) # Using BertModel to extract shared embeddings
        embeddings = outputs.last_hidden_state.mean(dim = 1) # Performing pooling i.e. averaging the last hidden state embeddings over the sequence length to get fixed size embeddings

        classification_logits = self.text_classification(embeddings) # Passing the shared embeddings to classification head to get classification output probability
        sentiment_logits =  self.sentiment(embeddings) # Passing the shared embeddings to sentiment head to get sentiment output probability

        return classification_logits, sentiment_logits # get both task's outputs

In [3]:
multi_task_model = SentenceTransformer_V2() # Here we are setting up the model so that we can call out it's layers and assign specific learning rates.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [4]:
optimizer = optim.Adam([
    {'params': multi_task_model.bert.parameters(), 'lr': 1e-5},  # Small learning rate for the pre-trained BERT model
    {'params': multi_task_model.text_classification[0].parameters(), 'lr': 1e-3},  # Larger learning rate for the initial linear layer
    {'params': multi_task_model.text_classification[2].parameters(), 'lr': 1e-4},  # Smaller learning rate for the later linear layer
    {'params': multi_task_model.sentiment[0].parameters(), 'lr': 1e-3},  # Larger learning rate for the initial linear layer
    {'params': multi_task_model.sentiment[2].parameters(), 'lr': 1e-4}  # Smaller learning rate for the later linear layer
])

## Explain the rationale for the specific learning rates you've set for each layer.

1. As BERT is a large model trained on a large corpora, I have assigned a small learning rate to it as it will help it to learn gradually and adapt to the new task
   while retaining its pretrained knowledge. If we assign a high learning rate here, 1e-3 for example, it will aggressively try to adapt to the new data and forget
   its initial pretrained knowledge in the process. If we were fine-tuning the BERT for a significantly different use case than its pretrained knowledge, then I
   would have assigned the higher learning rate.

2. For both the task heads, I have assigned a high learning rate to the initial Linear Layer as they have a more responsible task of converting the high dimensional
   embeddings (length: 768) to lower dimensional ones (length: 128). Then the outer Linear layer are assigned lower learning rates for increased stability in convergence
   during calculating logits.

## Describe the potential benefits of using layer-wise learning rates for training deep neural networks. Does the multi-task setting play into that?

Benefits:

1. Generally, earlier layers tend to capture more general features from the data as it is a nascent sy=tage in the learning phase. A model's quality is dependent on how
   well it has generalized the data. Hence, we assign a higher learning rate to the initial layer to lay a strong foundation. The later layers are responsible for capturing
   more specific features and assigning higher learning rates will result in increased peculiarity in feature capturing and hence overfitting. Hence, for the better development
   of deep learning networks, layer-wise learning rates are beneficial.

2. Assigning layer specific learning rates also helps in mitigating the vanishing or exploding gradients problem.

Regarding Multi-Tasking:

Layer-wise learning rates are beneficial for multi-task settings too. For example, in our case, classifying sentence might be a relatively easier task than sentiment classification
because of the nuances present in accent, sarcasm usage in the natural language of humans. Hence, assigning higher learning rates to sentiment head can help in the overall quality of the model.
Hence, layer-wise learning rates improves the flexibility in creating robust models and model's adaptability to the data and the task.