# Task 3: Training Considerations

Discuss the implications and advantages of each scenario and explain your rationale as to how
the model should be trained given the following:

1. If the entire network should be frozen.
2. If only the transformer backbone should be frozen.
3. If only one of the task-specific heads (either for Task A or Task B) should be frozen.
   
Consider a scenario where transfer learning can be beneficial. Explain how you would approach
the transfer learning process, including:

1. The choice of a pre-trained model.
2. The layers you would freeze/unfreeze.
3. The rationale behind these choices.

## A. Freezing Scenarios

### 1. Entire Network Frozen  

#### What it Means

* All model parameters — including the backbone (e.g., BERT) and task-specific heads — are frozen.
* The model is used purely as a feature extractor: no gradients are computed, and no parameters are updated.

#### Implications

* Zero fine-tuning cost: No backpropagation, resulting in faster training.
* No task adaptation: The model cannot learn domain-specific features or improve performance on the target task.
* Useful for feature extraction only, not for optimization.

#### When to Use It

* Rapid prototyping or experiments on a tight compute budget.
* Working with extremely small datasets (risk of overfitting with fine-tuning).
* Deploying models on edge devices where training is infeasible.


In [4]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, AdamW

# Step 1: Define a Multi-Task model with BERT + two heads
class MultiTaskTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)  # Task A: Sentiment
        self.ner = nn.Linear(self.bert.config.hidden_size, 4)         # Task B: NER tags

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.dropout(outputs.last_hidden_state)  # shape: [B, L, H]
        logits_a = self.classifier(x[:, 0])          # sentence-level logits (CLS token)
        logits_b = self.ner(x)                       # token-level logits
        return logits_a, logits_b

############################################################################
# Step 2: Instantiate model and freeze the ENTIRE network (BERT + heads)
model = MultiTaskTransformer()
for param in model.parameters():  # <- freeze all parameters
    param.requires_grad = False

# Step 3: Print which layers will be updated (none)
print("\nTrainable Parameters:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  ✅ {name}")
print("\nFrozen Parameters:")
for name, param in model.named_parameters():
    if not param.requires_grad:
        print(f"  ❌ {name}")



Trainable Parameters:

Frozen Parameters:
  ❌ bert.embeddings.word_embeddings.weight
  ❌ bert.embeddings.position_embeddings.weight
  ❌ bert.embeddings.token_type_embeddings.weight
  ❌ bert.embeddings.LayerNorm.weight
  ❌ bert.embeddings.LayerNorm.bias
  ❌ bert.encoder.layer.0.attention.self.query.weight
  ❌ bert.encoder.layer.0.attention.self.query.bias
  ❌ bert.encoder.layer.0.attention.self.key.weight
  ❌ bert.encoder.layer.0.attention.self.key.bias
  ❌ bert.encoder.layer.0.attention.self.value.weight
  ❌ bert.encoder.layer.0.attention.self.value.bias
  ❌ bert.encoder.layer.0.attention.output.dense.weight
  ❌ bert.encoder.layer.0.attention.output.dense.bias
  ❌ bert.encoder.layer.0.attention.output.LayerNorm.weight
  ❌ bert.encoder.layer.0.attention.output.LayerNorm.bias
  ❌ bert.encoder.layer.0.intermediate.dense.weight
  ❌ bert.encoder.layer.0.intermediate.dense.bias
  ❌ bert.encoder.layer.0.output.dense.weight
  ❌ bert.encoder.layer.0.output.dense.bias
  ❌ bert.encoder.layer.0.o

### 2. Freeze Transformer Backbone Only  

#### What it Means

* The transformer encoder (e.g., `BertModel`) is frozen — its weights remain unchanged.
* Only the task-specific heads (e.g., `classifier`, `ner`) are trainable and updated during backpropagation.

#### Implications

* Fast convergence: Only the classification/token heads (typically thousands of parameters) are updated.
* Regularization: Helps retain general language knowledge from pretraining and reduces overfitting risk.
* Task specialization: The heads learn how to map frozen embeddings to task-specific labels.

#### When to Use It

* When working with moderate-sized datasets that benefit from adaptation, but full fine-tuning would be unstable or slow.
* When maintaining a consistent, shared encoder across tasks is more important than optimizing for maximum performance on one task.




In [5]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, AdamW

# Step 1: Define a Multi-Task model with BERT + two heads
class MultiTaskTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)  # Task A: Sentiment
        self.ner = nn.Linear(self.bert.config.hidden_size, 4)         # Task B: NER tags

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.dropout(outputs.last_hidden_state)  # shape: [B, L, H]
        logits_a = self.classifier(x[:, 0])          # sentence-level logits (CLS token)
        logits_b = self.ner(x)                       # token-level logits
        return logits_a, logits_b

############################################################################
# Step 2: Instantiate model and freeze BERT backbone
model = MultiTaskTransformer()
for param in model.bert.parameters():
    param.requires_grad = False  # freeze transformer

# Step 3: Print which layers will be updated
print("\nTrainable Parameters:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  ✅ {name}")
print("\nFrozen Parameters:")
for name, param in model.named_parameters():
    if not param.requires_grad:
        print(f"  ❌ {name}")
############################################################################



Trainable Parameters:
  ✅ classifier.weight
  ✅ classifier.bias
  ✅ ner.weight
  ✅ ner.bias

Frozen Parameters:
  ❌ bert.embeddings.word_embeddings.weight
  ❌ bert.embeddings.position_embeddings.weight
  ❌ bert.embeddings.token_type_embeddings.weight
  ❌ bert.embeddings.LayerNorm.weight
  ❌ bert.embeddings.LayerNorm.bias
  ❌ bert.encoder.layer.0.attention.self.query.weight
  ❌ bert.encoder.layer.0.attention.self.query.bias
  ❌ bert.encoder.layer.0.attention.self.key.weight
  ❌ bert.encoder.layer.0.attention.self.key.bias
  ❌ bert.encoder.layer.0.attention.self.value.weight
  ❌ bert.encoder.layer.0.attention.self.value.bias
  ❌ bert.encoder.layer.0.attention.output.dense.weight
  ❌ bert.encoder.layer.0.attention.output.dense.bias
  ❌ bert.encoder.layer.0.attention.output.LayerNorm.weight
  ❌ bert.encoder.layer.0.attention.output.LayerNorm.bias
  ❌ bert.encoder.layer.0.intermediate.dense.weight
  ❌ bert.encoder.layer.0.intermediate.dense.bias
  ❌ bert.encoder.layer.0.output.dense.weight

### 3. Freeze One Task Head Only  
- **What**  
  - Either the classification head or the NER head is frozen; the backbone and the other head remain trainable.  
- **Implications**  
  - **Selective stability:** The frozen head retains its existing performance, while the other head (and optionally the backbone) can adapt.  
  - **Imbalance handling:** If one task has abundant, high-quality labels and the other is low-resource, you lock the robust head to prevent drift and focus training capacity on the weaker task.  
- **When to choose**  
  - **Imbalanced data volumes:** One task’s data is noisy or scarce.  
  - **Staged fine-tuning:** After jointly training both heads, freeze one to safely fine-tune backbone + the other head on new data without harming the frozen task.  


In [6]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, AdamW

# Define multi-task model
class MultiTaskTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)  # Task A: Sentiment
        self.ner = nn.Linear(self.bert.config.hidden_size, 4)         # Task B: NER tags

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.dropout(outputs.last_hidden_state)
        logits_a = self.classifier(x[:, 0])  # CLS token
        logits_b = self.ner(x)               # token-level logits
        return logits_a, logits_b

# Step 1: Instantiate model
model = MultiTaskTransformer()

# Step 2: Freeze one head (NER head)
for name, param in model.named_parameters():
    if "ner" in name:
        param.requires_grad = False  # freeze NER head
    else:
        param.requires_grad = True   # keep rest trainable (BERT + classifier)

# Step 3: Print status
print("\nTrainable Parameters:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  ✅ {name}")
print("\nFrozen Parameters:")
for name, param in model.named_parameters():
    if not param.requires_grad:
        print(f"  ❌ {name}")



Trainable Parameters:
  ✅ bert.embeddings.word_embeddings.weight
  ✅ bert.embeddings.position_embeddings.weight
  ✅ bert.embeddings.token_type_embeddings.weight
  ✅ bert.embeddings.LayerNorm.weight
  ✅ bert.embeddings.LayerNorm.bias
  ✅ bert.encoder.layer.0.attention.self.query.weight
  ✅ bert.encoder.layer.0.attention.self.query.bias
  ✅ bert.encoder.layer.0.attention.self.key.weight
  ✅ bert.encoder.layer.0.attention.self.key.bias
  ✅ bert.encoder.layer.0.attention.self.value.weight
  ✅ bert.encoder.layer.0.attention.self.value.bias
  ✅ bert.encoder.layer.0.attention.output.dense.weight
  ✅ bert.encoder.layer.0.attention.output.dense.bias
  ✅ bert.encoder.layer.0.attention.output.LayerNorm.weight
  ✅ bert.encoder.layer.0.attention.output.LayerNorm.bias
  ✅ bert.encoder.layer.0.intermediate.dense.weight
  ✅ bert.encoder.layer.0.intermediate.dense.bias
  ✅ bert.encoder.layer.0.output.dense.weight
  ✅ bert.encoder.layer.0.output.dense.bias
  ✅ bert.encoder.layer.0.output.LayerNorm.weig

## B. Transfer-Learning Workflow

When moving from general pretrained weights into our multi-task setting, a **gradual unfreeze** strategy maximizes retention of broad language knowledge while allowing task-specific specialization.


### 1. Choice of Pretrained Model  
- **General English:** `bert-base-uncased` or `roberta-base` for balanced compute vs. performance.  
- **Domain-Specific:** e.g. `BioBERT` for biomedical text, `LegalBERT` for legal documents—to start from specialized vocabulary and style.


### 2. Layers to Freeze/Unfreeze  
- **Head-Only:** Freeze encoder entirely, train only linear heads.  
- **Partial:** Unfreeze top ~1–3 transformer blocks (and heads), keep lower blocks frozen.  
- **Full:** Unfreeze all layers once sufficient task-specific data is available.


### 3. Rationale  

To **adapt a pretrained model** (like BERT) to a downstream task (e.g., classification, NER) **without destroying** the general language knowledge it learned during pretraining.

## The Three Stages Explained

| **Stage**                 | **Purpose**                         | **Explanation**                                                                                                                                                        |
| ------------------------- | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Protect**            | Freeze lower (early) encoder layers | Lower layers capture **universal linguistic features** (e.g., morphology, syntax, part-of-speech). Keeping them frozen preserves this general knowledge.               |
| **2. Specialize**         | Fine-tune higher (later) layers     | Higher layers are more **task-adaptive**. They're where task-specific signals (e.g., sentiment, named entities) are most concentrated.                                 |
| **3. Prevent Forgetting** | Gradually unfreeze earlier layers   | This **staged unfreezing** avoids sudden parameter shifts, reducing the risk of **catastrophic forgetting** — a problem where useful general features get overwritten. |

---

## Why This Matters

Pretrained transformers are deep (e.g., BERT has 12 layers). Each layer builds on the previous ones:

* **Lower layers** = reusable across tasks
* **Higher layers** = specialized for the current task
* Updating all layers at once can **destabilize training**, especially with small datasets.


## Catastrophic Forgetting

> When a model “forgets” its pretrained knowledge because all layers are updated too aggressively during fine-tuning.

This often happens when:

* You fine-tune all layers at once.
* You use a high learning rate on small or noisy data.
* There's **domain shift** (e.g., Wikipedia → tweets).

## How to Apply This in Practice

1. **Start with all layers frozen**.
2. **Train the classification or NER head** only.
3. **Unfreeze the last layer** of the transformer, train for a few epochs.
4. **Unfreeze one more layer at a time**, moving downward.
5. Optionally use **discriminative learning rates** (smaller for early layers, larger for top layers).

ULMFiT stands for Universal Language Model Fine-tuning for Text Classification. It’s a seminal 2018 paper and methodology proposed by Howard & Ruder that introduced a 3-stage transfer learning approach for NLP tasks.

In [8]:
# # Example: Unfreeze top 4 layers of BERT
# for name, param in model.bert.named_parameters():
#     if "encoder.layer.11" in name or "encoder.layer.10" in name:
#         param.requires_grad = True
#     else:
#         param.requires_grad = False