# Transfer Learning with Transformers

**Module 02 | Notebook 1 of 3**

Transfer learning is the foundation of modern NLP. Instead of training from scratch, we start with a pre-trained model and adapt it to our specific task.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand when and why to use transfer learning
2. Use the Hugging Face Trainer API
3. Prepare datasets for fine-tuning
4. Monitor training progress

---

## ü§î When Should I Fine-Tune?

**Before investing time and GPU resources, ask yourself:**

| **Fine-tune if...** | **Just use prompting if...** |
|---------------------|------------------------------|
| ‚úÖ You have 1,000+ labeled examples | ‚ùå You have <100 examples |
| ‚úÖ You need consistent output format | ‚ùå A good prompt gets 80%+ of what you need |
| ‚úÖ You're deploying to production | ‚ùå You're prototyping or experimenting |
| ‚úÖ You need domain-specific vocabulary | ‚ùå General language understanding is enough |
| ‚úÖ Latency/cost matters (smaller fine-tuned model) | ‚ùå You can afford larger model API calls |

**üí∞ Cost Reality Check:**
- **Time**: 1-4 hours for small models, 8-24 hours for large models
- **Compute**: ~$5-50 on cloud GPUs (or free on Colab with limits)
- **Data prep**: Often the biggest hidden cost!

> üí° **Rule of thumb**: If you can solve it with a well-crafted prompt + few examples, try that first. Fine-tuning is for when prompting isn't enough.

In [1]:
%%capture
!pip install transformers datasets accelerate evaluate scikit-learn

In [2]:
import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


---

## What is Transfer Learning?

### The Main Idea
Imagine you want to learn how to play **squash**.
*   **Option A (From Scratch):** You learn how to hold a racket, how to move your feet, how to hit a ball, and the rules of squash all at once. This takes a long time.
*   **Option B (Transfer Learning):** You already know how to play **tennis**. You transfer your knowledge of running, swinging, and hand-eye coordination to squash. You only need to learn the specific differences (smaller court, different ball bounces).

In NLP, we do Option B. We take a model that already knows "English" (grammar, vocabulary, syntax) and teach it a specific task (sentiment analysis, spam detection).

### 1. Pre-training (The "Tennis" Phase)
A model reads billions of sentences (Wikipedia, Books) to learn language structure. This is computationally expensive (weeks of training, hundreds of GPUs).
```
[ Mass of Unlabeled Text ] ---> ( Pre-training ) ---> [ General Language Model ]
```

### 2. Fine-tuning (The "Squash" Phase)
We take that general model and train it slightly on our specific dataset. This is cheap (minutes/hours, single GPU).
```
[ General Language Model ] + [ Labeled Dataset ] ---> ( Fine-tuning ) ---> [ Task-Specific Model ]
```

### When to Use Transfer Learning

| Scenario | Recommendation |
|----------|----------------|
| Limited labeled data (<10k examples) | ‚úÖ Transfer Learning |
| Standard NLP task (classification, NER, QA) | ‚úÖ Transfer Learning |
| Limited compute budget | ‚úÖ Transfer Learning |
| Very domain-specific data (legal, medical) | ‚úÖ Transfer Learning + Domain Pre-training |
| Massive dataset (>1M examples) | Consider training from scratch |

---

## Dataset Preparation

We'll use the Rotten Tomatoes movie review dataset for sentiment classification.

In [3]:
# Load a small dataset for quick training
dataset = load_dataset("rotten_tomatoes")

print("Dataset structure:")
print(dataset)
print(f"\nTrain examples: {len(dataset['train'])}")
print(f"Test examples: {len(dataset['test'])}")

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Train examples: 8530
Test examples: 1066


In [4]:
# Explore the data
print("Sample examples:")
print("-" * 60)
for i in range(3):
    example = dataset['train'][i]
    label = "Positive" if example['label'] == 1 else "Negative"
    print(f"Label: {label}")
    print(f"Text: {example['text'][:100]}...")
    print()

Sample examples:
------------------------------------------------------------
Label: Positive
Text: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash eve...

Label: Positive
Text: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column ...

Label: Positive
Text: effective but too-tepid biopic...



In [5]:
# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding=True,
        truncation=True,
        max_length=256
    )

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

print("Tokenized dataset columns:", tokenized_dataset['train'].column_names)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Tokenized dataset columns: ['text', 'label', 'input_ids', 'attention_mask']


In [6]:
# Create a smaller subset for quick training (optional - use full dataset for better results)
small_train = tokenized_dataset['train'].shuffle(seed=42).select(range(1000))
small_val = tokenized_dataset['validation'].shuffle(seed=42).select(range(200))

print(f"Training samples: {len(small_train)}")
print(f"Validation samples: {len(small_val)}")

Training samples: 1000
Validation samples: 200


---

## Model Setup

We represent our task as **Sequence Classification**: Input text ‚Üí One label.
We use `distilbert-base-uncased`, a smaller, faster version of BERT that retains 97% of performance but is 40% lighter.

In [7]:
# Load pre-trained model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Binary classification
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nModel architecture:")
print(model)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total parameters: 66,955,010
Trainable parameters: 66,955,010

Model architecture:
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affin

### Understanding the Model Structure

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ           DistilBERT Base Model             ‚îÇ  ‚Üê Pre-trained weights
‚îÇ     (learned language understanding)        ‚îÇ    (66M parameters)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚îÇ
                      ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         Classification Head                 ‚îÇ  ‚Üê Randomly initialized
‚îÇ     (Linear: 768 ‚Üí 2 classes)               ‚îÇ    (learns during fine-tuning)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚îÇ
                      ‚ñº
              [NEGATIVE, POSITIVE]
```

---

## Training with the Trainer API

The Hugging Face `Trainer` handles:
- Training loop
- Gradient accumulation
- Mixed precision training
- Logging and checkpointing
- Evaluation

> ‚ö†Ô∏è **Before Training**: This configuration uses ~4-6 GB GPU memory. If you get an **OOM (Out of Memory) Error**:
> - Reduce `per_device_train_batch_size` from 16 ‚Üí 8 ‚Üí 4
> - Reduce `train_size` in the data preparation step
> - Use `fp16=True` (already enabled if GPU available)

In [8]:
# Define evaluation metrics
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

In [9]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    
    # Hyperparameters
    num_train_epochs=3,              # üìö How many times to see all training data (2-4 is typical)
    per_device_train_batch_size=16,  # üíæ Process 16 examples at once (lower = less memory, slower training)
    per_device_eval_batch_size=16,
    learning_rate=2e-5,              # üéØ CRITICAL! How fast to update weights (2e-5 to 5e-5 for fine-tuning)
    weight_decay=0.01,               # üõ°Ô∏è Regularization to prevent overfitting (like L2 penalty)
    warmup_ratio=0.1,                # üî• Gradually increase LR for first 10% (prevents early shock)
    
    # Evaluation
    eval_strategy="epoch",           # Evaluate at end of every epoch
    save_strategy="epoch",           # Save model checkpoint at end of every epoch
    load_best_model_at_end=True,     # Always end with the best model found
    metric_for_best_model="accuracy",
    
    # Logging
    logging_dir="./logs",
    logging_steps=50,
    
    # Performance
    fp16=torch.cuda.is_available(),  # Mixed precision (faster, less memory) if GPU available
    
    # Misc
    report_to="none",  # Disable wandb/tensorboard for this demo
    push_to_hub=False
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Mixed precision: {training_args.fp16}")

Training configuration:
  Epochs: 3
  Batch size: 16
  Learning rate: 2e-05
  Mixed precision: True


In [10]:
# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Trainer created successfully!")

Trainer created successfully!


In [11]:
# Train the model
print("Starting training...")
print("=" * 50)

train_result = trainer.train()

print("\n" + "=" * 50)
print("Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.1f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

Starting training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6407,0.469355,0.775
2,0.4106,0.427143,0.8
3,0.2686,0.428355,0.81



Training complete!
Training time: 29.7s
Samples per second: 101.1


In [12]:
# Evaluate on validation set
eval_results = trainer.evaluate()

print("\nEvaluation Results:")
print(f"  Loss: {eval_results['eval_loss']:.4f}")
print(f"  Accuracy: {eval_results['eval_accuracy']:.2%}")


Evaluation Results:
  Loss: 0.4284
  Accuracy: 81.00%


> üìä **Is 81.5% accuracy good?**
> - Random guessing = 50% (it's binary classification)
> - Simple bag-of-words baseline = ~70%
> - Our fine-tuned model = 81.5% ‚úì
> 
> With only 1,000 training samples and 3 epochs, this is solid! Using the full 8,500 sample dataset typically reaches 86-88%.

---

## Testing the Fine-tuned Model

In [13]:
from transformers import pipeline

# Create a pipeline with our fine-tuned model
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test on new examples
test_texts = [
    "This movie was absolutely fantastic! A masterpiece.",
    "Terrible film. Complete waste of time.",
    "It was okay, nothing special but watchable.",
    "The acting was superb and the plot kept me engaged.",
    "I fell asleep halfway through. So boring."
]

print("Predictions:")
print("=" * 60)
for text in test_texts:
    result = sentiment_pipeline(text)[0]
    print(f"Text: {text[:50]}...")
    print(f"  ‚Üí {result['label']} ({result['score']:.2%})\n")

Device set to use cuda:0


Predictions:
Text: This movie was absolutely fantastic! A masterpiece...
  ‚Üí POSITIVE (94.69%)

Text: Terrible film. Complete waste of time....
  ‚Üí NEGATIVE (94.08%)

Text: It was okay, nothing special but watchable....
  ‚Üí POSITIVE (76.79%)

Text: The acting was superb and the plot kept me engaged...
  ‚Üí POSITIVE (89.62%)

Text: I fell asleep halfway through. So boring....
  ‚Üí NEGATIVE (93.46%)



---

## Saving and Loading the Model

In [14]:
# Save the model
save_path = "./fine_tuned_sentiment_model"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model saved to {save_path}")

Model saved to ./fine_tuned_sentiment_model


In [15]:
# Load the saved model
loaded_model = AutoModelForSequenceClassification.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)

# Verify it works
loaded_pipeline = pipeline(
    "sentiment-analysis",
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

result = loaded_pipeline("This is a great course!")[0]
print(f"Loaded model prediction: {result['label']} ({result['score']:.2%})")

Device set to use cuda:0


Loaded model prediction: POSITIVE (87.21%)


---

## üéØ Student Challenge

### Challenge: Fine-tune on AG News

Fine-tune a model for **4-class text classification** using the AG News dataset.

In [16]:
# TODO: Your code here
# 1. Load the AG News dataset
# dataset_ag = load_dataset("ag_news")

# 2. Prepare the model for 4 classes
# labels = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
# id2label = {k: v for k, v in labels.items()}
# label2id = {v: k for k, v in labels.items()}

# 3. Initialize Model with num_labels=4
# model_ag = AutoModelForSequenceClassification.from_pretrained(
#    "distilbert-base-uncased",
#    num_labels=4,
#    id2label=id2label,
#    label2id=label2id
# )

# 4. Tokenize, Split, and Train (similar to above)

# Your solution below:


---

## Key Takeaways

1. **Transfer learning** leverages pre-trained models to reduce training time and data requirements
2. **The Trainer API** simplifies training with built-in best practices
3. **Classification heads** are added on top of pre-trained models for specific tasks
4. **Hyperparameters** like learning rate and batch size significantly impact results
5. **Save and load** models for deployment using `save_pretrained`/`from_pretrained`

---

## Next Steps

Continue to `02_sentiment_analysis.ipynb` for a deeper dive into classification!