In [None]:
!pip install torch transformers

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertModel, BertTokenizer


# Model

In [3]:
class TransformerModel(nn.Module):
    def __init__(self, model_name='bert-base-uncased', num_labels_task_a=3, num_labels_task_b=3):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)  # for Task 1 / sentence encoding
        self.dropout = nn.Dropout(0.1)  # for Task 2
        self.task_a = nn.Linear(self.bert.config.hidden_size, num_labels_task_a)  # for Task 2
        self.task_b = nn.Linear(self.bert.config.hidden_size, num_labels_task_b)  # for Task 2

    def forward(self, input_ids, attention_mask, task='a'):  # for Task 2
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        output = self.dropout(outputs.pooler_output)
        if task == 'a':
            return self.task_a(output)
        elif task == 'b':
            return self.task_b(output)

# Task 1


In [12]:
sentences = ["My favorite animal is the dog.",
            "I am a Georgia Tech graduate.",
            "I don't like missing out on savings."]

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

model = TransformerModel()

with torch.no_grad():
  embeddings = model.bert(**inputs).pooler_output

In [13]:
print(embeddings)
print(embeddings.shape)

tensor([[-0.8644, -0.3422, -0.8039,  ..., -0.5739, -0.6053,  0.8960],
        [-0.8271, -0.5030, -0.8600,  ..., -0.8143, -0.6632,  0.8940],
        [-0.7990, -0.4106, -0.8943,  ..., -0.7665, -0.6276,  0.9231]])
torch.Size([3, 768])


I chose to use the pre-trained bert-base-uncased model as my transformer backbone because it is an effective and well-documented option for simpler NLP use cases. To obtain the word embeddings, I first tokenized the example sentences using the bert-base-uncased tokenizer. This splits the text into subwords recognized by the model and assigns them their corresponding ID in the model's vocabulary. Then, I input them into the model and received the embedding as an output. The pooler output is used for sentence-based tasks like sentence classification because it represents whole input sentences rather than individual tokens.

# Task 2

For Task 1, I only needed to use the pre-trained Bert model. For Task 2, I first added a dropout layer; a dropout layer can reduce overfitting, prevent the network from becoming overly reliant on specific nodes, and help increase the usage of other nodes. Finally, I added one linear layer for Task A (sentence classification) and another for Task B (Sentiment Analysis). The labels for Task A (Sentence Classification) would be animals, school, and savings, and the labels for Task B would be positive, neutral, and negative.

# Task 3

Training: <br>
1. If the whole model is frozen, it cannot be trained because none of the model's weights can be updated. Although this means that the model will likely be suboptimal for more specific use cases, it also means that the model can be used immediately without training and will not overfit our current dataset. These benefits make fully frozen models useful for situations where we want to build a quick prototype with limited compute or if we have a limited dataset. This type of model should be used with very limited datasets for quick tasks, such as prototyping a more complex model.
2. If the transformer backbone is frozen, this means that the linear layers for tasks A and B can still adapt to the training dataset, but the main BERT model's will remain steady. This setup has similar but less extreme pros and cons compared to freezing the whole model. Because the transformer backbone is frozen, training will not take as much time or computing power. Additionally, it reduces the risk of overfitting to limited datasets. A key downside of this setup is that the transformer backbone's weights will not be able to adapt to the specific tasks the model performs. This type of model should be trained for simple tasks with limited datasets; if the task is complex or we have a large dataset, freezing the backbone could limit the potential of our model.
3. If only one task-specific head is frozen (let's say Task B), this will keep Task B performance stable during training while allowing potential for Task A performance to improve. This scenario is therefore best when one of our tasks' performances needs to be improved but we are content with the other's. A possible downside to this approach is that if the transformer backbone's weights shift, Task B's head may not be as compatible with the backbone as it was previously. Thus, when training with this setup, we should ensure that our loss function accounts for Task A and Task B's performance so that the transformer backbone does not neglect Task B performance.

Transfer Learning: <br>
Transfer learning can be useful in scenarios like our current one, where we apply an existing pre-trained model like bert-base-uncased to new tasks like sentence classification and sentiment analysis. An example of another scenario could be animal classification based on images. We can use a pre-trained model like ResNet50 and fine-tune it with a dataset of animal images. ResNet50 is a good choice because it is trained on the vast ImageNet dataset and can recognize different objects and colors. Of the model's 50 layers, I would freeze the first 10 throughout all of training; these layers primarily detect basic features common to all images such as edges and colors. I would keep the last 5 layers and the classification head unfrozen through the whole training process because these layers primarily focus on more specific features of images. If we are only working with animals and want to focus on differentiating them, it would be beneficial for our model's last few layers to learn the nuances between different images of animals. Finally, I would initially freeze the middle layers and experiment with unfreezing some of them after a few rounds of training; these layers can contain some knowledge that applies to all images and some that may apply to images specifically in the dataset. Thus, unfreezing them could help the model adapt to the task of animal classification, but it may also lead to overfitting our dataset.

# Task 4

In [5]:
class PlaceholderDataset(Dataset):
    def __len__(self):
        return 0

    def __getitem__(self, idx):  # should return (input_ids, attention_mask, labels)
        raise IndexError("This dataset is empty.")

def compute_accuracy(logits, labels):
    preds = torch.argmax(logits, dim=1)
    return (preds == labels).float().mean().item()

In [6]:
loader_a = DataLoader(PlaceholderDataset(), batch_size=1)
loader_b = DataLoader(PlaceholderDataset(), batch_size=1)

model = TransformerModel()
combined_batches = list(zip(loader_a, loader_b))
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-2)
epochs = 1

In [14]:
for epoch in range(epochs):
  model.train()
  total_loss = 0
  total_acc_a = 0
  total_acc_b = 0
  for batch_a, batch_b in combined_batches:
      input_ids_a, attn_mask_a, labels_a = batch_a
      input_ids_b, attn_mask_b, labels_b = batch_b

      logits_a = model(input_ids_a, attn_mask_a, task='a')
      logits_b = model(input_ids_b, attn_mask_b, task='b')
      loss_a = loss_fn(logits_a, labels_a)
      loss_b = loss_fn(logits_b, labels_b)
      loss = loss_a.item() + loss_b.item()
      acc_a = compute_accuracy(logits_a, labels_a)
      acc_b = compute_accuracy(logits_b, labels_b)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss
      total_acc_a += acc_a
      total_acc_b += acc_b

  print(f"Epoch {epoch}:")
  if (len(loader_a) == 0) or (len(loader_b) == 0):
    print(f"Not enough data")
  else:
    print(f"Loss: {total_loss}, Task A Accuracy: {total_acc_a / len(loader_a)}, "
            f"Task B Accuracy: {total_acc_b / len(loader_b)}")

Epoch 0:
Not enough data


I chose to use cross entropy loss and the AdamW optimizer because they are commonly used and effective algorithms for optimizing Bert-based models. During training, I handle the MTL framework by alternatively giving the model a Task A sample and a Task B sample rather than sequentially giving it all Task A samples and then all Task B samples. The sequential approach could lead to the model overfitting for Task B if layers of the shared model are not frozen. The forward pass for each task simply involves inputting the input ids and attention mask for each batch of sentences. The attention mask differentiates between real tokens (representing words or subwords) and padding tokens in each sentence. I also chose to represent loss as a combination of the losses for task A and task B for increased efficiency. compute_accuracy() computes the fraction of correct predictions within each batch for each task.