# Task 2: Multi-Task Learning Expansion for Sentence Transformer

Expanding the sentence transformer model to accommodate multi-task learning involves significant adjustments to the architecture. This setup allows the model to handle multiple NLP tasks simultaneously, leveraging shared representations to improve overall efficiency and performance. Here's how we have adapted the transformer to manage two distinct tasks:

## Expanded Transformer Model Architecture
The updated model architecture includes task-specific classification layers on top of a shared transformer backbone. This design allows the model to optimize for multiple objectives, enhancing its utility and flexibility.

### Explanation of Architectural Choices and Advantages

1. **Shared Transformer Backbone:**
   - **Rationale:** The shared layers (embedding, positional encoding, and transformer blocks) process input data in a way that captures universal linguistic features, which are beneficial for any NLP task. This setup reduces redundancy and conserves computational resources.
   - **Advantages:** Sharing lower layers across tasks allows the model to learn a more robust representation of the language, which can improve generalization across tasks due to shared learning signals.

2. **Task-Specific Classifiers:**
   - **Rationale:** After processing through shared layers, task-specific classifiers (sentiment and engagement classifiers) tailor the learned embeddings to particular objectives. Each classifier focuses on optimizing for its respective task, allowing for specialization where necessary.
   - **Advantages:** This approach enables the model to be flexible and adaptable, capable of addressing the nuances of different tasks while maintaining the efficiency of a unified model structure. The use of separate classifiers ensures that task-specific features can be learned without interference, potentially enhancing accuracy on individual tasks.

3. **Mean Pooling Strategy:**
   - **Rationale:** Before passing the output to the classifiers, applying a mean pooling reduces the sequence of vectors to a single vector that captures the essence of the input across all positions. This is particularly useful for classification tasks, as it distills the entire input into a format suitable for making a single prediction per input.
   - **Advantages:** Mean pooling simplifies the output while retaining critical information, making it easier for the classifiers to perform effectively. It ensures that all parts of the input contribute to the final decision, enhancing the model's ability to understand and utilize the full context of the input.

# Data Loading

In [113]:
import pandas as pd

df = pd.read_csv('./archive/netflix_reviews.csv')

# Data Cleaning

In [114]:
df = df[['content','score','thumbsUpCount','at']]

df = df[~df['content'].isnull()]

df = df.astype({
    'score':'int16',
    'thumbsUpCount':'int16'
})

df['at'] = pd.to_datetime(df['at']) 
df['at'] = df['at'].dt.strftime('%Y')

In [115]:
df['thumbsUpCount'].describe()

count    112730.000000
mean         10.514681
std         101.402747
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max        8032.000000
Name: thumbsUpCount, dtype: float64

# Labelling

generating labels out of ratings

In [116]:
def sentiment_labeller(x):
    if x > 3:
        # 'Positive'
        return 2
    elif x <= 3 and x > 2:
        # 'Neutral'
        return 1
    else:
        # 'Negative'
        return 0


def user_engagement_labeller(x):
    if x > 12:
        # 'High'
        return 2
    elif x <= 12 and x >= 7:
        # "Moderate"
        return 1
    else:
        # "Low"
        return 0


df['sentiment'] = df['score'].apply(lambda x: sentiment_labeller(x))
df['user_engagement'] = df['thumbsUpCount'].apply(lambda x: user_engagement_labeller(x))

# Tokenization

In [117]:
df['content'].to_csv('content_text.txt', index=False, header=False)

In [118]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["content_text.txt"], trainer=trainer)






In [119]:
from tqdm import tqdm

def tokenize(text):
    return tokenizer.encode(text).ids

tqdm.pandas(desc="Processing dataframe: ")
df['tokenized_content'] = df['content'].progress_apply(lambda x: tokenize(str(x)))
df['tokenized_content'].head()

Processing dataframe: 100%|██████████| 112730/112730 [00:06<00:00, 16947.34it/s]


0    [56, 12780, 1677, 20724, 6867, 5390, 1522, 70,...
1    [46, 7805, 2049, 17, 1530, 1607, 46, 12, 82, 2...
2    [2215, 2171, 1514, 2019, 1547, 1642, 1736, 192...
3    [1725, 1502, 1480, 1692, 1584, 3381, 1514, 147...
4    [1984, 1560, 1596, 4417, 1483, 1632, 10029, 23...
Name: tokenized_content, dtype: object

In [120]:
df = df[['content','tokenized_content','sentiment','user_engagement','at']]

# Multi Head Attention 

In [121]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert self.head_dim * heads == embed_size, "Embed size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        attention = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            attention = attention.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(attention / (self.embed_size ** (1 / 2)), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out


# Positional Encoding

In [122]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len=500):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, embed_size)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        _2i = torch.arange(0, embed_size, step=2).float()

        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i/embed_size)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i/embed_size)))
        self.encoding = self.encoding.unsqueeze(0)

        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.encoding = self.encoding.to(self.device)  

    def forward(self, x):
        return self.encoding[:, :x.size(1), :].to(x.device)

# Transformer block

In [123]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = MultiHeadAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

# Transformer

In [124]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, max_length, num_classes_sentiment, num_classes_engagement):
        super(Transformer, self).__init__()
        self.embed_size = embed_size
        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, max_length)

        self.layers = nn.ModuleList([
            TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)
        ])

        self.dropout = nn.Dropout(dropout)

        self.sentiment_classifier = nn.Linear(embed_size, num_classes_sentiment)
        
        self.engagement_classifier = nn.Linear(embed_size, num_classes_engagement)

    def forward(self, x):
        out = self.dropout(self.word_embedding(x) + self.positional_encoding(x))
        for layer in self.layers:
            out = layer(out, out, out, None)

        out = out.mean(dim=1)  

        sentiment_output = self.sentiment_classifier(out)
        engagement_output = self.engagement_classifier(out)

        return sentiment_output, engagement_output

In [125]:
max_length = 128
batch_size = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vocab_size = tokenizer.get_vocab_size()

# Data Preparetion

In [126]:
def pad_sequences(vector, pad_idx, max_len):
    padded = vector + [pad_idx] * (max_len - len(vector))
    return padded[:max_len]

pad_idx = tokenizer.token_to_id("[PAD]")
df['padded_tokens'] = df['tokenized_content'].progress_apply(lambda x: pad_sequences(x, pad_idx, max_length))

Processing dataframe: 100%|██████████| 112730/112730 [00:01<00:00, 70043.40it/s]


In [127]:
padded_tokens_tensor = torch.tensor(df['padded_tokens'].tolist()).to(device)

sentiment_tensor = torch.tensor(df['sentiment'].values).to(device)
user_engagement_tensor = torch.tensor(df['user_engagement'].values).to(device)

In [128]:
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

train_data, test_data, train_labels_sentiment, test_labels_sentiment, train_labels_engagement, test_labels_engagement = train_test_split(
    padded_tokens_tensor, sentiment_tensor, user_engagement_tensor, test_size=0.2, random_state=42)

train_dataset = TensorDataset(train_data, train_labels_sentiment, train_labels_engagement)
test_dataset = TensorDataset(test_data, test_labels_sentiment, test_labels_engagement)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# Training And Evaluation

In [129]:
def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))

In [130]:
def export_model(model,model_name):
    torch.save(model.state_dict(), f'{model_name}.pt')

In [131]:
def train_model(model, train_loader, test_loader, num_epochs, optimizer):

    train_losses = []
    test_losses = []
    test_accuracies_sentiment = []
    test_accuracies_engagement = []

    loss_fn_sentiment = nn.CrossEntropyLoss()
    loss_fn_engagement = nn.CrossEntropyLoss()


    for epoch in range(num_epochs):
        model.train()

        train_loss = 0

        for inputs, sentiment_labels, engagement_labels in tqdm(train_loader):
            inputs = inputs.to(device)
            sentiment_labels = sentiment_labels.to(device)
            engagement_labels = engagement_labels.to(device)

            optimizer.zero_grad()

            # Forward pass
            sentiment_preds, engagement_preds = model(inputs)

            # Calculate loss
            loss_sentiment = loss_fn_sentiment(sentiment_preds, sentiment_labels)
            loss_engagement = loss_fn_engagement(
                engagement_preds, engagement_labels)
            loss = loss_sentiment + loss_engagement

            # Backward pass
            loss.backward()

            optimizer.step()


            train_loss += loss.item()

        train_losses.append(train_loss / len(train_data))

        model.eval()
        total_accuracy_sentiment = 0
        total_accuracy_engagement = 0
        test_loss = 0

        with torch.no_grad():
            for inputs, sentiment_labels, engagement_labels in tqdm(test_loader):
                inputs = inputs.to(device)
                sentiment_labels = sentiment_labels.to(device)
                engagement_labels = engagement_labels.to(device)

                sentiment_preds, engagement_preds = model(inputs)

                loss_sentiment = loss_fn_sentiment(sentiment_preds, sentiment_labels)
                loss_engagement = loss_fn_engagement(
                    engagement_preds, engagement_labels)
                loss = loss_sentiment + loss_engagement
                test_loss += loss.item()

                total_accuracy_sentiment += accuracy(sentiment_preds, sentiment_labels)
                total_accuracy_engagement += accuracy(
                    engagement_preds, engagement_labels)

        avg_test_loss = test_loss / len(test_loader)
        avg_accuracy_sentiment = total_accuracy_sentiment / len(test_loader)
        avg_accuracy_engagement = total_accuracy_engagement / len(test_loader)

        test_losses.append(avg_test_loss)
        test_accuracies_sentiment.append(avg_accuracy_sentiment)
        test_accuracies_engagement.append(avg_accuracy_engagement)

        print(
            f'Epoch {epoch+1}, Train Loss: {train_losses[-1]}, Test Loss: {avg_test_loss}, Test Accuracy Sentiment: {avg_accuracy_sentiment}, Test Accuracy Engagement: {avg_accuracy_engagement}')
        
    return model

# Fixed Learning Rate Multi Task Learning

In [133]:
lr = 3e-5
num_epochs = 10

model = Transformer(src_vocab_size=vocab_size, 
                    embed_size=128,
                    num_layers=4,
                    heads=8,
                    forward_expansion=4,
                    dropout=0.1,
                    max_length=max_length,
                    num_classes_sentiment=3,
                    num_classes_engagement=3
                )

model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [134]:
model = train_model(model, train_loader, test_loader, num_epochs, optimizer)
export_model(model,'transformer_model_with_fixed_learning_rate')

100%|██████████| 2819/2819 [00:55<00:00, 50.62it/s]
100%|██████████| 705/705 [00:04<00:00, 146.42it/s]


Epoch 1, Train Loss: 0.03592141743514422, Test Loss: 1.0685446591241985, Test Accuracy Sentiment: 0.6386820673942566, Test Accuracy Engagement: 0.9207988977432251


100%|██████████| 2819/2819 [00:55<00:00, 50.84it/s]
100%|██████████| 705/705 [00:04<00:00, 147.43it/s]


Epoch 2, Train Loss: 0.031948474652461784, Test Loss: 0.9563935298446222, Test Accuracy Sentiment: 0.7190209031105042, Test Accuracy Engagement: 0.9244336485862732


100%|██████████| 2819/2819 [00:56<00:00, 50.16it/s]
100%|██████████| 705/705 [00:04<00:00, 141.75it/s]


Epoch 3, Train Loss: 0.029664756139168816, Test Loss: 0.9200128873612018, Test Accuracy Sentiment: 0.7487834692001343, Test Accuracy Engagement: 0.9254727959632874


100%|██████████| 2819/2819 [00:56<00:00, 50.24it/s]
100%|██████████| 705/705 [00:04<00:00, 146.78it/s]


Epoch 4, Train Loss: 0.02818497419416783, Test Loss: 0.902097405736328, Test Accuracy Sentiment: 0.752058744430542, Test Accuracy Engagement: 0.9253989458084106


100%|██████████| 2819/2819 [00:56<00:00, 49.70it/s]
100%|██████████| 705/705 [00:04<00:00, 146.70it/s]


Epoch 5, Train Loss: 0.027474813145105116, Test Loss: 0.8676464684888826, Test Accuracy Sentiment: 0.7630712985992432, Test Accuracy Engagement: 0.9251871705055237


100%|██████████| 2819/2819 [00:58<00:00, 47.99it/s]
100%|██████████| 705/705 [00:05<00:00, 136.24it/s]


Epoch 6, Train Loss: 0.02700981883611208, Test Loss: 0.8810238599354494, Test Accuracy Sentiment: 0.7582692503929138, Test Accuracy Engagement: 0.9284130930900574


100%|██████████| 2819/2819 [00:57<00:00, 48.85it/s]
100%|██████████| 705/705 [00:05<00:00, 140.47it/s]


Epoch 7, Train Loss: 0.02667090955817328, Test Loss: 0.8448654973337836, Test Accuracy Sentiment: 0.7703063488006592, Test Accuracy Engagement: 0.9281570315361023


100%|██████████| 2819/2819 [01:01<00:00, 46.12it/s]
100%|██████████| 705/705 [00:05<00:00, 121.94it/s]


Epoch 8, Train Loss: 0.026295042641782763, Test Loss: 0.8530412191617573, Test Accuracy Sentiment: 0.7711091637611389, Test Accuracy Engagement: 0.9292996525764465


100%|██████████| 2819/2819 [01:02<00:00, 45.30it/s]
100%|██████████| 705/705 [00:05<00:00, 135.34it/s]


Epoch 9, Train Loss: 0.026012080003216878, Test Loss: 0.8398456337604117, Test Accuracy Sentiment: 0.77036052942276, Test Accuracy Engagement: 0.9297626614570618


100%|██████████| 2819/2819 [00:59<00:00, 47.01it/s]
100%|██████████| 705/705 [00:04<00:00, 145.04it/s]


Epoch 10, Train Loss: 0.025752227959278198, Test Loss: 0.8264187521968328, Test Accuracy Sentiment: 0.7827373743057251, Test Accuracy Engagement: 0.9242316484451294


# Task 3: Training Considerations and Transfer Learning Strategy

When training a multi-task learning model like the Sentence Transformer adapted for tasks such as sentiment analysis and engagement prediction, several training strategies can be employed. Each has implications on the model's learning dynamics and performance:

## Scenario 1: Freezing the Entire Network
- **Implications:** Freezing the entire network means that all the weights are kept constant, and no learning occurs during training. This scenario is typically used when you apply a pre-trained model directly to a new task without any fine-tuning. It assumes the pre-trained weights are optimal for the new tasks without any adjustments.
- **Advantages:** The main advantage is computational efficiency; no backpropagation is needed, and the model serves purely as a feature extractor. This can be useful in highly resource-constrained environments or when the pre-trained model is exceptionally well-aligned with the new tasks.
- **Rationale:** Freezing the entire network is generally not recommended unless the new tasks are very similar to the tasks on which the model was originally trained. The lack of adaptability can lead to suboptimal performance if the tasks differ significantly.

## Scenario 2: Freezing Only the Transformer Backbone
- **Implications:** In this scenario, the shared transformer layers are frozen, and only the task-specific heads are trainable. This approach assumes that the shared layers already capture universal language features effectively and that only the final task-specific adaptations need learning.
- **Advantages:** This method balances the benefits of transfer learning with the flexibility of task-specific tuning. It can lead to faster training and lower risk of overfitting the shared layers while allowing the model to adapt to the specifics of each task through the trainable heads.
- **Rationale:** Freezing the backbone while training the heads is suitable when the pre-trained model's general features are relevant to the new tasks, but some adaptation is still required to optimize performance on specific task metrics.

## Scenario 3: Freezing Only One of the Task-Specific Heads
- **Implications:** Freezing one task-specific head while training the other allows for asymmetric learning where one task is considered stable or less important to optimize than the other. This might be used when one task is already performing at acceptable levels with pre-trained settings.
- **Advantages:** This selective training focuses computational resources and model capacity on improving where it is most needed, potentially enhancing performance on a more challenging or impactful task without disturbing a satisfactory performance on another.
- **Rationale:** Such a strategy would be adopted in a situation where improving performance on one task can lead to significant business or operational gains, while changes in the other are less beneficial or might even risk destabilizing established functionalities.

## Transfer Learning Strategy
When implementing a transfer learning strategy with a pre-trained model, consider the following steps:

1. **Choice of a Pre-trained Model:**
   - Select a model that has been trained on a large, comprehensive dataset similar to the tasks at hand, such as BERT or RoBERTa, which are trained on vast amounts of general text and are capable of understanding complex language patterns.

2. **Layers to Freeze/Unfreeze:**
   - **Freeze Early Layers:** Typically, earlier layers in transformer models capture more general linguistic features (e.g., syntax and common semantics), which are usually beneficial across different tasks and domains.
   - **Unfreeze Later Layers:** Later layers, especially those closer to the output, tend to capture more task-specific features. Unfreezing these allows the model to adapt these layers to the specifics of the new tasks.

3. **Rationale Behind Choices:**
   - **Preserve General Features:** By freezing the early layers, the model preserves the robust features learned from large-scale data, reducing the risk of forgetting essential language understanding capabilities.
   - **Adapt to Specific Tasks:** Unfreezing the later layers and the task-specific heads allows the model to adapt to the nuances of the specific tasks it is being fine-tuned for, improving its relevance and effectiveness on these tasks.

Implementing these considerations and strategies ensures that the model benefits from the strengths of the pre-trained weights while still adapting sufficiently to excel at the new tasks. This approach optimizes the use of computational resources, enhances model performance, and mitigates the risks associated with overfitting and catastrophic forgetting.

# Task 4: Layer-wise Learning Rate Implementation

Implementing layer-wise learning rates in training deep neural networks is a sophisticated technique that tailors the learning process to the specifics of each layer's role within the model. This approach can optimize training dynamics, leading to more effective and efficient learning. Here's a deeper explanation of why different learning rates were set for each layer in the context of a multi-task sentence transformer model:

## Rationale for Layer-wise Learning Rates

### 1. **Base Learning Rate for Embeddings and Transformer Blocks:**
   - **Rate:** `base_lr = 3e-6`
   - **Reason:** The embedding layer and the transformer blocks form the foundation of the model, capturing general linguistic and contextual information from the input text. These layers are typically pre-trained on large datasets and are highly sensitive. A lower learning rate is used here to make fine, cautious adjustments, preserving the rich, pre-trained features while preventing drastic changes that might lead to forgetting useful information.

### 2. **Learning Rate for Sentiment Classifier:**
   - **Rate:** `sentiment_classifier_lr = 3e-5`
   - **Reason:** The sentiment classifier tailors the output of the shared transformer architecture to a specific task — sentiment analysis. A higher learning rate compared to the base layers allows this classifier to quickly adapt to the nuances of sentiment classification. However, it was noted that the sentiment classifier's accuracy was initially low, suggesting that the task might be more complex or that the initial parameters were not optimal. A slightly conservative rate (relative to the engagement classifier) was therefore chosen to facilitate more stable and gradual learning, enhancing its ability to refine its parameters without drastic oscillations.


### 3. **Learning Rate for Engagement Classifier:**
   - **Rate:** `engagement_classifier_lr = 3e-4`
   - **Reason:** Engagement prediction might be somewhat less complex or differently characterized compared to sentiment analysis, or it might benefit from more aggressive updates due to different initial performance baselines. Therefore, a higher learning rate is employed to enable faster learning adjustments, allowing the model to quickly optimize its predictions based on engagement-specific feedback.

## Benefits of Layer-wise Learning Rates in Multi-task Settings

### Enhanced Task-specific Adaptation:
   - **Multi-task Efficiency:** By using different learning rates, each part of the model can learn at a pace suitable for its specific task and complexity level. This is particularly beneficial in a multi-task setting where different tasks may have varying degrees of difficulty and data characteristics.
   - **Prevents Overfitting:** Lower rates in foundational layers help prevent overfitting by ensuring that these layers, which are responsible for capturing universal features, do not change too rapidly. This stability is crucial when the model is applied across multiple tasks that might pull the foundational layers in different directions.
   - **Encourages Task-specific Fine-tuning:** Higher rates in the task-specific layers encourage these layers to fine-tune aggressively to their respective tasks, making the model more responsive to task-specific signals without affecting the shared layers.


# Layer wise Learning Rates

In [111]:
base_lr = 3e-6  # Lower learning rate for embeddings and transformer blocks
sentiment_classifier_lr = 3e-5  # Higher learning rate for classifiers
engagement_classifier_lr = 3e-4  # Higher learning rate for classifiers

num_epochs = 10

optimizer = torch.optim.Adam([
    {'params': model.word_embedding.parameters(), 'lr': base_lr},
    {'params': model.positional_encoding.parameters(), 'lr': base_lr},
    {'params': [p for layer in model.layers for p in layer.parameters()], 'lr': base_lr},
    {'params': model.sentiment_classifier.parameters(), 'lr': sentiment_classifier_lr},
    {'params': model.engagement_classifier.parameters(), 'lr': engagement_classifier_lr}
])


model = Transformer(src_vocab_size=vocab_size, 
                    embed_size=128,
                    num_layers=4,
                    heads=8,
                    forward_expansion=4,
                    dropout=0.1,
                    max_length=max_length,
                    num_classes_sentiment=3,
                    num_classes_engagement=3
                )

model.to(device)

Transformer(
  (word_embedding): Embedding(30000, 128)
  (positional_encoding): PositionalEncoding()
  (layers): ModuleList(
    (0-3): 4 x TransformerBlock(
      (attention): MultiHeadAttention(
        (values): Linear(in_features=16, out_features=16, bias=False)
        (keys): Linear(in_features=16, out_features=16, bias=False)
        (queries): Linear(in_features=16, out_features=16, bias=False)
        (fc_out): Linear(in_features=128, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (feed_forward): Sequential(
        (0): Linear(in_features=128, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=128, bias=True)
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (sentiment_classifier): Linear(in_features=128, out_features=3, bias=True)
  (eng

In [135]:
model = train_model(model, train_loader, test_loader, num_epochs, optimizer)  
export_model(model,'transformer_model_with_layer_wise_learning_rate')

100%|██████████| 2819/2819 [01:02<00:00, 45.05it/s]
100%|██████████| 705/705 [00:05<00:00, 120.68it/s]


Epoch 1, Train Loss: 0.025468018559272547, Test Loss: 0.804600294874915, Test Accuracy Sentiment: 0.7830624580383301, Test Accuracy Engagement: 0.9308067560195923


100%|██████████| 2819/2819 [01:05<00:00, 43.14it/s]
100%|██████████| 705/705 [00:05<00:00, 131.69it/s]


Epoch 2, Train Loss: 0.025266446947131448, Test Loss: 0.8190241541845579, Test Accuracy Sentiment: 0.7803486585617065, Test Accuracy Engagement: 0.9311367869377136


100%|██████████| 2819/2819 [01:03<00:00, 44.36it/s]
100%|██████████| 705/705 [00:05<00:00, 127.61it/s]


Epoch 3, Train Loss: 0.025009157959757455, Test Loss: 0.8058740529819584, Test Accuracy Sentiment: 0.784727156162262, Test Accuracy Engagement: 0.9278663992881775


100%|██████████| 2819/2819 [01:03<00:00, 44.22it/s]
100%|██████████| 705/705 [00:05<00:00, 131.51it/s]


Epoch 4, Train Loss: 0.024882146644341664, Test Loss: 0.8122115374245542, Test Accuracy Sentiment: 0.7832298874855042, Test Accuracy Engagement: 0.9313485026359558


100%|██████████| 2819/2819 [01:04<00:00, 43.71it/s]
100%|██████████| 705/705 [00:05<00:00, 139.81it/s]


Epoch 5, Train Loss: 0.024684242282588535, Test Loss: 0.7910871576332877, Test Accuracy Sentiment: 0.7903122901916504, Test Accuracy Engagement: 0.9303634762763977


100%|██████████| 2819/2819 [00:59<00:00, 47.22it/s]
100%|██████████| 705/705 [00:04<00:00, 152.19it/s]


Epoch 6, Train Loss: 0.024463480261291533, Test Loss: 0.8043776453809536, Test Accuracy Sentiment: 0.7868301868438721, Test Accuracy Engagement: 0.9278122782707214


100%|██████████| 2819/2819 [01:00<00:00, 46.64it/s]
100%|██████████| 705/705 [00:05<00:00, 120.97it/s]


Epoch 7, Train Loss: 0.02433282113751777, Test Loss: 0.7815935524642891, Test Accuracy Sentiment: 0.7929028868675232, Test Accuracy Engagement: 0.9290090799331665


100%|██████████| 2819/2819 [01:01<00:00, 45.57it/s]
100%|██████████| 705/705 [00:05<00:00, 136.52it/s]


Epoch 8, Train Loss: 0.02423191926319238, Test Loss: 0.7823284150438106, Test Accuracy Sentiment: 0.7922379970550537, Test Accuracy Engagement: 0.9280585050582886


100%|██████████| 2819/2819 [01:00<00:00, 46.66it/s]
100%|██████████| 705/705 [00:06<00:00, 112.52it/s]


Epoch 9, Train Loss: 0.024080614274081674, Test Loss: 0.7824346129564529, Test Accuracy Sentiment: 0.7927156686782837, Test Accuracy Engagement: 0.9303634762763977


100%|██████████| 2819/2819 [01:26<00:00, 32.76it/s]
100%|██████████| 705/705 [00:04<00:00, 150.12it/s]


Epoch 10, Train Loss: 0.023964314545471453, Test Loss: 0.7993287307573549, Test Accuracy Sentiment: 0.7912726402282715, Test Accuracy Engagement: 0.9278811812400818
