# Task 3: Training Considerations

Discuss the implications and advantages of each scenario and explain your rationale as to how
the model should be trained given the following:

1. If the entire network should be frozen.
2. If only the transformer backbone should be frozen.
3. If only one of the task-specific heads (either for Task A or Task B) should be frozen.
   
Consider a scenario where transfer learning can be beneficial. Explain how you would approach
the transfer learning process, including:

1. The choice of a pre-trained model.
2. The layers you would freeze/unfreeze.
3. The rationale behind these choices.

## A. Freezing Scenarios

### 1. Entire Network Frozen  
- **What**  
  - All parameters (backbone + both heads) remain at their pretrained/initialized values.  
- **Implications**  
  - **Zero fine-tuning cost:** No backpropagation through any layers.  
  - **Feature extraction only:** You cannot adapt to task-specific patterns—embeddings and heads are static.  
- **When to choose**  
  - Quick prototyping or low-compute environments.  
  - Extremely small datasets where any fine-tuning would overfit.  

### 2. Freeze Transformer Backbone Only  
- **What**  
  - Encoder (`BertModel`) weights are held fixed; only the task heads (`classifier`, `ner`) are updated.  
- **Implications**  
  - **Fast convergence:** Only a few thousand head parameters train.  
  - **Regularization:** Pretrained language features are preserved, reducing overfitting risk.  
  - **Task specialization:** Heads learn to map general embeddings to each task’s label space.  
- **When to choose**  
  - Moderate-sized datasets where you need some task adaptation but cannot reliably fine-tune a large encoder.  
  - When consistency of the shared encoder across tasks is more important than specialized language features.  


### 3. Freeze One Task Head Only  
- **What**  
  - Either the classification head or the NER head is frozen; the backbone and the other head remain trainable.  
- **Implications**  
  - **Selective stability:** The frozen head retains its existing performance, while the other head (and optionally the backbone) can adapt.  
  - **Imbalance handling:** If one task has abundant, high-quality labels and the other is low-resource, you lock the robust head to prevent drift and focus training capacity on the weaker task.  
- **When to choose**  
  - **Imbalanced data volumes:** One task’s data is noisy or scarce.  
  - **Staged fine-tuning:** After jointly training both heads, freeze one to safely fine-tune backbone + the other head on new data without harming the frozen task.  


## B. Transfer-Learning Workflow

When moving from general pretrained weights into our multi-task setting, a **gradual unfreeze** strategy maximizes retention of broad language knowledge while allowing task-specific specialization.

| Stage                      | Frozen Layers                       | Trainable Layers                    | Purpose                                 |
|----------------------------|-------------------------------------|-------------------------------------|-----------------------------------------|
| **1. Head-Only Tuning**    | All encoder layers                  | Both task heads                     | Quickly learn to map general features to your tasks with minimal risk of overfitting. |
| **2. Partial Unfreeze**    | Lower encoder layers (1–8)          | Higher encoder layers (9–12) + heads | Allow top contextual layers to adapt to task idiosyncrasies while keeping base language features stable. |
| **3. Full Fine-tuning**    | None (optionally freeze very bottom)| Entire model                        | If you have large, clean datasets and validation metrics continue improving, let all layers adjust. |


### 1. Choice of Pretrained Model  
- **General English:** `bert-base-uncased` or `roberta-base` for balanced compute vs. performance.  
- **Domain-Specific:** e.g. `BioBERT` for biomedical text, `LegalBERT` for legal documents—to start from specialized vocabulary and style.


### 2. Layers to Freeze/Unfreeze  
- **Head-Only:** Freeze encoder entirely, train only linear heads.  
- **Partial:** Unfreeze top ~1–3 transformer blocks (and heads), keep lower blocks frozen.  
- **Full:** Unfreeze all layers once sufficient task-specific data is available.


### 3. Rationale  
- **Staged Unfreezing (ULMFiT-style):**  
  1. **Protect** general linguistic knowledge in lower layers.  
  2. **Specialize** higher layers where task-relevant patterns reside.  
  3. **Prevent** catastrophic forgetting by gradually exposing more parameters to task gradients.  
- **Differential Learning Rates:**  
  - Use a smaller LR (e.g., 1e-5) for encoder layers, slightly higher (e.g., 5e-5) for task heads.  
- **Validation-Guided:**  
  - Monitor each task’s validation metric independently; if one plateaus or degrades, consider re-freezing or lowering its LR.



**In Summary:**  
- **Freezing choices** trade off speed, regularization, and adaptability—choose based on dataset size and task balance.  
- A **gradual unfreeze** transfer-learning pipeline offers the best of both worlds: stable pretrained features and targeted task specialization.