## Task 1 Sentence Transformer Implementation

In [2]:
import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [13]:
def get_sentence_embedding(sentence, tokenizer, model):
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        # Take the output of the last layer and calculate the average of all tokens as the sentence embedding
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.numpy()


In [14]:
# test 
sentences = ["Fetch Rewards is a good company!", "I need this job.", "Deep learning models are powerful."]

for sentence in sentences:
    embedding = get_sentence_embedding(sentence, tokenizer, model)
    print(f"Sentence: {sentence}")
    print("Embedding:", embedding[0][:10], "...")  # display only the first 10 values


Sentence: Fetch Rewards is a good company!
Embedding: [ 0.03928689 -0.00143429  0.2790085  -0.02717233 -0.14779724 -0.31053677
  0.1645635   0.37024096 -0.18737513 -0.44514358] ...
Sentence: I need this job.
Embedding: [ 0.31491905  0.1626341  -0.14195605 -0.28827474  0.25789863 -0.37616846
  0.3095152   0.5361839   0.06199948 -0.6030423 ] ...
Sentence: Deep learning models are powerful.
Embedding: [-0.08263097 -0.17430131 -0.03122409  0.19004294  0.04576752 -0.34798932
 -0.03459389  0.29230028  0.28511918 -0.5801908 ] ...


## Task 2 Multi-Task Learning Expansion

In [15]:
import torch
from transformers import AutoModel, AutoTokenizer, BertForTokenClassification, BertForSequenceClassification
from torch import nn

In [18]:
# Multi-task model
class MultiTaskBert(nn.Module):
    def __init__(self, base_model_name):
        super(MultiTaskBert, self).__init__()
        # 使用appropriate的num_labels for each task
        self.base_model = BertForSequenceClassification.from_pretrained(base_model_name, num_labels=3)
        self.ner_head = BertForTokenClassification.from_pretrained(base_model_name, num_labels=5)
    
    def forward(self, inputs, task=None):
        if task == 'classification':
            # For classification, use the sequence classification head
            return self.base_model(**inputs)
        elif task == 'ner':
            # For NER, use the token classification head
            return self.ner_head(**inputs)

# Load the model
model = MultiTaskBert(model_name)

# 
def get_predictions(sentence, tokenizer, model, task):
    # tokenize the input sentence appropriately
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=128)
    # we do not pass token_type_ids if not necessary
    inputs = {key: val for key, val in inputs.items() if key in ['input_ids', 'attention_mask']}
    if task == "classification":
        outputs = model(inputs, task='classification')
    elif task == "ner":
        outputs = model(inputs, task='ner')
    return outputs.logits


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Example usage
sentence_classification = "The new iPhone model will release next month."
sentence_ner = "John and Lisa went to Microsoft to work."

classification_logits = get_predictions(sentence_classification, tokenizer, model, 'classification')
ner_logits = get_predictions(sentence_ner, tokenizer, model, 'ner')

print("Classification Logits:", classification_logits)
print("NER Logits:", ner_logits)

Classification Logits: tensor([[ 0.1053, -0.1972, -0.1163]], grad_fn=<AddmmBackward0>)
NER Logits: tensor([[[-6.0896e-01,  3.4098e-02,  2.9981e-01,  1.4429e-02, -3.8531e-02],
         [ 2.7411e-02,  2.6915e-01,  1.9779e-01, -3.6209e-01, -2.1058e-01],
         [ 2.1264e-01, -6.8850e-02,  1.5153e-01,  3.7764e-02, -6.8783e-01],
         [-1.0392e-04,  3.1284e-01, -5.6625e-02, -4.8807e-01, -2.6233e-01],
         [ 4.4822e-01,  2.7880e-01,  1.9938e-01, -2.0946e-01, -1.0301e-01],
         [ 3.3488e-01,  6.5618e-01, -1.4251e-01, -4.8331e-01, -7.3636e-02],
         [ 9.8330e-02, -3.4392e-01,  3.1805e-01, -6.3378e-01, -2.9194e-01],
         [ 4.4972e-01,  5.7168e-01, -4.0045e-01, -4.0394e-01,  2.2617e-01],
         [ 4.2549e-01,  2.8090e-01, -4.2572e-03,  7.6852e-03, -3.1718e-02],
         [ 2.3452e-01,  3.8745e-01, -3.8035e-02, -2.3010e-01,  6.1665e-02],
         [ 4.0184e-02,  1.3816e-01, -7.4163e-03, -3.7852e-01,  1.3261e-01]]],
       grad_fn=<ViewBackward0>)


To facilitate multi-task learning, we modified the architecture by incorporating a shared base model (BERT) with two task-specific heads. The shared BERT model provides a universal representation of input sentences that both tasks can utilize. Above this shared layer, we added:

**A Classification Head:** This is a sequence classification head that takes the output of the shared BERT model (typically the CLS token's representation) and predicts a category label. This head consists of one or more fully connected layers, depending on the complexity and the number of categories to be classified.

**A Named Entity Recognition (NER) Head:** This head operates on all token outputs from the shared BERT model. It includes a token classification layer designed to predict a label for each token, indicating whether it belongs to a named entity and the type of entity it represents.

These modifications enable the model to perform two distinct tasks—sentence classification and named entity recognition—using the same underlying linguistic model. This approach leverages shared knowledge, which can improve learning efficiency and performance on related tasks.

## Task 3 Training Consideration

### **Scenario 1: Freezing the Entire Network**
*Advantages*: Low computational cost; avoids overfitting when data is scarce.\
*Approach*: Use the model as a feature extractor where you train a new output layer for adaptation.
### **Scenario 2: Freezing Only the Transformer Backbone**
*Advantages*: Leverages pre-trained features while allowing task-specific adaptation through trainable heads.\
*Approach*: Focus training on the task-specific heads to customize the model to new tasks efficiently.
### **Scenario 3: Freezing Only One of the Task-Specific Heads**
*Advantages*: Allows for focused improvements on one task without affecting performance on another task where the head is frozen.\
*Approach*: Adapt the trainable head to new or more complex tasks while keeping the other task stable.
### **Transfer Learning Approach:**
*Choice of Pre-trained Model:* Select between models like bert-base-uncased or bert-large-uncased based on the needed depth and computational budget.\
*Layers to Freeze/Unfreeze:* Freeze more layers for tasks similar to the pre-training to retain learned features. Unfreeze more layers for distinct tasks to allow finer adaptation.\
*Rationale:* Balancing the retention of useful pre-trained features and adaptation to new tasks optimizes both learning efficiency and model performance.

## Task 4 Layer-wise Learning Rate Implementation

In [20]:
from transformers import AdamW

In [21]:
def get_optimizer(model):
    # Set specific learning rates
    lr_base_model = 2e-5  # Lower learning rate for the shared backbone
    lr_task_heads = 3e-4  # Higher learning rate for the task-specific heads

    # Group parameters separately
    optimizer_grouped_parameters = [
        {'params': model.base_model.parameters(), 'lr': lr_base_model},
        {'params': model.head1.parameters(), 'lr': lr_task_heads},
        {'params': model.head2.parameters(), 'lr': lr_task_heads}
    ]
    
    # Initialize the optimizer with these grouped parameters
    optimizer = AdamW(optimizer_grouped_parameters)
    return optimizer

optimizer = get_optimizer(model)


**Rationale for Specific Learning Rates:**\
\
*Shared Backbone:* A lower learning rate is used for the shared BERT backbone because it has been pre-trained with vast amounts of data, capturing general language features that are broadly useful. Changing these features too drastically could harm the model's ability to generalize.\
\
*Task-specific Heads:* Higher learning rates for the task-specific heads allow these layers to quickly adapt to the specifics of the tasks. Since these heads start from a randomly initialized state (unless pre-trained on a similar task), they require faster updates to effectively learn their specific tasks.\
\
**Benefits of Using Layer-wise Learning Rates:**
\
\
*Customized Learning Dynamics:* Allows deeper layers to retain their generalizable features while enabling surface layers to adapt to specific tasks quickly.\
\
*Improved Efficiency:* Helps in preventing overfitting in deeper layers and underfitting in task-specific layers, optimizing the training process.\
\
*Flexibility in Multi-task Settings:* In a multi-task framework, such differentiation in learning rates is crucial. It ensures that while the shared backbone remains stable and general, each head can evolve according to the unique demands of its respective task.