# Session 4, Part 2 – Large-Scale Pretraining and Fine-Tuning

Previously, in **Part 1**, we built a small Transformer-based text generator (GPT-like model) and discussed training routines on a **small scale**. Now, we will **scale up** to discuss:

1. **Large-Scale Pretraining** strategies (distributed training, pipeline parallelism, model sharding).  
2. **Mixed Precision** training to reduce memory usage and speed up computation.  
3. **Fine-Tuning** on downstream tasks (using domain data, custom heads).  
4. **Leveraging third-party APIs** like **Hugging Face Transformers**, **Hugging Face Accelerate**, and **PyTorch Lightning** to handle complexities of large-scale training.

**Note**: Due to resource constraints, we will not run a real large-scale pretraining, but you’ll learn the **conceptual framework** and see **example code** that can be adapted to real HPC environments.

---

## Table of Contents

1. [Overview of Large-Scale Transformer Training](#overview)
2. [Distributed Training Approaches](#distributed)
   - [Data Parallelism](#data-parallel)
   - [Model/Pipeline Parallelism](#model-parallel)
   - [Sharding (e.g., ZeRO, Megatron-LM)](#sharding)
3. [Mixed Precision Training](#mixed-precision)
   - [FP16, BF16, and FP8 Overview](#fp16-bf16)
   - [Practical Implementation in PyTorch/Hugging Face](#mp-implementation)
4. [Fine-Tuning on Downstream Tasks](#finetuning)
   - [Task-Adaptive Pretraining vs. Fine-Tuning](#task-adapt)
   - [Using Hugging Face Transformers](#hf-transformers)
   - [Example: Fine-Tuning GPT-2 with PyTorch Lightning](#lightning-example)
5. [Putting It All Together – A Scalable Pretraining/Fine-Tuning Pipeline](#pipeline)
6. [Practical Exercises – Part 2](#exercises)
7. [Conclusion](#conclusion)


## <a id="overview"></a>1. Overview of Large-Scale Transformer Training

Modern large language models (LLMs) like **GPT-3**, **PaLM**, or **LLaMA** are trained on **massive** corpora (hundreds of billions of tokens). The **key** to training such models is **efficiently** parallelizing computations and effectively handling huge parameter counts.

**Main Challenges**:
1. **Memory**: Single GPU cannot store billions of parameters plus intermediate activations.  
2. **Speed**: Large-scale training can take weeks or months if not parallelized.  
3. **Cost**: HPC resources must be used effectively; inefficient code can skyrocket costs.

**High-Level Solutions**:
- **Distributed Training**: multiple GPUs, multiple nodes.  
- **Parallelism**: data parallel, model parallel, pipeline parallel.  
- **Mixed Precision**: FP16 or BF16 to reduce memory usage.  
- **Advanced Libraries**: **PyTorch Lightning**, **Hugging Face Transformers/Accelerate**, **DeepSpeed** or **Megatron-LM** for parallelism, memory optimization, etc.

## <a id="distributed"></a>2. Distributed Training Approaches

### <a id="data-parallel"></a>2.1 Data Parallelism

- **Idea**: Replicate the model on each GPU, split the dataset (or mini-batch) across GPUs.  
- **Parameter sync**: after computing gradients locally, an all-reduce operation merges them so each GPU keeps consistent weights.

**Pros**:
- Easiest approach to scale if the model fits in a single GPU.  
- Well-supported by frameworks (`torch.nn.DataParallel`, `torch.distributed`, or `Lightning` built-in).

**Cons**:
- Model size is limited by single-GPU memory.  
- Increasing the number of GPUs mostly helps handle bigger batch sizes or speed up the training, but doesn’t solve “model too big for one GPU” problems.



### <a id="model-parallel"></a>2.2 Model Parallelism / Pipeline Parallelism

When the model **exceeds** a single GPU’s memory:

1. **Model Parallelism**: split model layers or parameters across multiple GPUs.  
   - *Tensor Parallelism*: e.g., Megatron-LM splits large matrix multiplications across GPUs.  
   - *Layer Parallelism*: each GPU holds a subset of layers.

2. **Pipeline Parallelism**: chain layers in a “pipeline” across GPUs:
   - GPU0: processes micro-batch 1 on first stage (layers 1–N).
   - GPU1: processes micro-batch 1 on next stage, while GPU0 processes micro-batch 2.

**Pros**:
- Overcomes single-GPU memory constraints.  
- Pipeline approach can keep multiple GPUs busy simultaneously.

**Cons**:
- More complicated to implement than data parallel.  
- Potential for “bubble” inefficiencies (waiting for the next stage to finish).


### <a id="sharding"></a>2.3 Sharding (ZeRO, Megatron, etc.)

**Sharding** further splits not just layers but also **optimizer states, gradients, parameters** across GPUs. Popular frameworks:

- **DeepSpeed ZeRO**: partitions optimizer states, gradients, and parameters across multiple processes.  
- **Megatron-LM**: focuses on **tensor parallel** approaches for large matrix multiplications.  
- **Fully Sharded Data Parallel (FSDP)** in PyTorch: shards parameters/gradients and consolidates them as needed.

**Benefit**: Efficient usage of memory, allowing extremely large models to train.

## <a id="mixed-precision"></a>3. Mixed Precision Training

Large models can benefit greatly from **FP16** or **BF16**:

- **Faster** computations using GPU tensor cores.
- **Less** memory usage for activations and gradients.

### <a id="fp16-bf16"></a>3.1 FP16, BF16, and FP8 Overview

1. **FP16 (Half Precision)**:
   - 16-bit floating point.  
   - Must handle potential **loss of precision** carefully (gradient scaling).
2. **BF16 (Brain Float 16)**:
   - 16-bit, but larger exponent range.  
   - Often doesn’t require explicit gradient scaling.  
   - Widely used on Google TPUs and some newer GPUs.
3. **FP8**:
   - 8-bit float.  
   - Experimental, used in cutting-edge research for even more memory savings, with certain hardware (e.g., Hopper GPUs).


### <a id="mp-implementation"></a>3.2 Practical Implementation in PyTorch/Hugging Face

**PyTorch**:

```python
...
scaler = torch.cuda.amp.GradScaler()  # For automatic gradient scaling

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
        logits = model(x_batch)
        loss = criterion(logits, y_batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

**Hugging Face Transformers**:
- Use `Trainer` with `fp16=True` or `bf16=True` in the training arguments.

**PyTorch Lightning**:

```python
...
trainer = pl.Trainer(
    precision=16,   # for FP16
    gpus=2,         # data parallel on 2 GPUs
    ...
)
```


## <a id="finetuning"></a>4. Fine-Tuning on Downstream Tasks

### <a id="task-adapt"></a>4.1 Task-Adaptive Pretraining vs. Fine-Tuning

1. **Task-Adaptive Pretraining**:  
   - You continue large-scale language modeling on domain-specific data (e.g., biomedical text) to adapt your general LLM.  
2. **Fine-Tuning**:  
   - On a specific supervised dataset (classification, QA, summarization), you update weights while focusing on the new objective.



### <a id="hf-transformers"></a>4.2 Using Hugging Face Transformers

Hugging Face provides:

- **Pretrained models**: GPT, BERT, T5, etc.
- **Trainer API** for easy fine-tuning with built-in features (mixed precision, distributed, etc.).

**Example** (fine-tuning a GPT-2 on domain text):

In [25]:

# %pip install -q transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Suppose we have a custom dataset (your domain text)...
class DummyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 100
    def __getitem__(self, idx):
        return {
            "input_ids": torch.arange(idx, idx+42),
            "label_ids": torch.arange(idx+1, idx+43),
        }
train_dataset = DummyDataset()

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    fp16=True,                        # Mixed precision
    num_train_epochs=3,
    logging_steps=50,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # your custom dataset
    # eval_dataset=valid_dataset,   # optional
)

trainer.train()


Step,Training Loss


TrainOutput(global_step=36, training_loss=2.8256507449679904, metrics={'train_runtime': 3.7269, 'train_samples_per_second': 80.496, 'train_steps_per_second': 9.66, 'total_flos': 6001551360000.0, 'train_loss': 2.8256507449679904, 'epoch': 2.8})

**Highlight**: 
- Setting `fp16=True` automatically enables half-precision.  
- `gradient_accumulation_steps` effectively increases the batch size if memory is limited.  
- Under the hood, **Accelerate** can handle multi-GPU or multi-node setups.



### <a id="lightning-example"></a>4.3 Example: Fine-Tuning GPT-2 with PyTorch Lightning

**Benefits** of Lightning:
- Built-in support for multi-GPU, checkpointing, logging, etc.
- Precision=16 => automatic mixed precision on GPUs.

**PyTorch Lightning** abstracts away boilerplate training code:

In [27]:
# %pip install pytorch-lightning transformers

import lightning.pytorch as pl
from transformers import GPT2LMHeadModel, GPT2Config

class GPT2FineTuner(pl.LightningModule):
    def __init__(self, model_name="gpt2", lr=1e-4):
        super().__init__()
        self.save_hyperparameters()
        # self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.lr = lr
    
    def forward(self, input_ids, label_ids=None, attention_mask=None):
        if label_ids is None:
            label_ids = input_ids
        return self.model(input_ids, attention_mask=attention_mask, labels=label_ids)
    
    def training_step(self, batch, batch_idx):
        output = self(**batch)  # GPT2 outputs crossentropy if labels are provided
        self.log("train_loss", output.loss)
        return output.loss
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.lr)

# Then define a DataModule or Dataloader for your domain dataset
class DummyDataModule(pl.LightningDataModule):
    # https://lightning.ai/docs/pytorch/stable/data/datamodule.html#what-is-a-datamodule
    def __init__(self, batch_size=32):
        super().__init__()
        self.batch_size = batch_size
    def train_dataloader(self):
        return torch.utils.data.DataLoader(DummyDataset(), batch_size=self.batch_size)

domain_data_module = DummyDataModule()

trainer = pl.Trainer(
    max_epochs=3, 
    precision=16,     # FP16
    devices=2 if torch.cuda.device_count() >= 2 else 1,  # multi-GPU if available
    gradient_clip_val=1.0,
)
model_module = GPT2FineTuner()
trainer.fit(model_module, domain_data_module)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type            | Params | Mode
-------------------------------------------------
0 | model | GPT2LMHeadModel | 124 M  | eval
-------------------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.759   Total estimated model params size (MB)
0         Modules in train mode
164       Modules in eval mode


Training: |                                                                                                   …

`Trainer.fit` stopped: `max_epochs=3` reached.


## <a id="pipeline"></a>5. Putting It All Together – A Scalable Pretraining/Fine-Tuning Pipeline

### Step-by-Step (Conceptual)

1. **Gather Large Corpus** (billions of tokens) => Shard it into multiple files for streaming.  
2. **Tokenizer**: Possibly BPE/WordPiece to build a subword vocabulary, or use a known pretrained tokenizer.  
3. **Initialize** your Transformer model or load a partial checkpoint (like GPT-2 medium).  
4. **Distributed Setup**:
   - `torch.distributed.launch` or **Accelerate** CLI to spawn multiple workers.  
   - Decide if you need **data parallel** or **model/pipeline parallel** (for extremely large models).  
5. **Mixed Precision** to reduce memory usage (FP16/BF16) with gradient scaling.  
6. **Train**: 
   - Use *gradient accumulation* if you can’t fit your desired batch size per device.  
   - Monitor *loss*, *perplexity*, etc.  
7. **Checkpointing**: Save model weights periodically, especially crucial in HPC environments.  
8. **Fine-Tuning**:
   - Switch to your domain or task data. Possibly freeze some layers or do full fine-tuning.  
   - Evaluate with appropriate metrics (e.g., perplexity, BLEU, ROUGE, etc.).  
9. **Inference**: 
   - For generation tasks, apply sampling strategies (temperature, top-k).  
   - Use high-level libraries (Transformers `pipeline` or custom code) for final deployment.

**Libraries** that ease the pipeline:
- **DeepSpeed** / **Megatron-LM**: advanced distributed/memory optimization.  
- **PyTorch Lightning**: structure your training loops with minimal boilerplate.  
- **Hugging Face Accelerate**: easily convert your single-GPU script to multi-GPU/multi-node.  


## <a id="exercises"></a>6. Practical Exercises

1. **Scalable Pretraining Design (Conceptual)**  
   - Outline a training plan for a 1–2 billion parameter model on 8 GPUs.  
   - Specify how you’d shard the dataset, which parallelism approach you’d use, and how to incorporate mixed precision.

2. **Fine-Tuning GPT-2**  
   - Take a small domain dataset (could be a subset of domain-specific text).  
   - Fine-tune GPT-2 using either the **Hugging Face Trainer** or **PyTorch Lightning**.  
   - Evaluate the generated text vs. the original GPT-2 output.

3. **Experiment with Parallelism**  
   - (If you have multiple GPUs) attempt a minimal data parallel run using PyTorch Lightning or Accelerate.  
   - Observe the speed-up and memory usage changes.  
   - Discuss how you’d handle a bigger model that doesn’t fit on one GPU.

4. **Mixed Precision Pitfalls**  
   - Write a short note on potential issues with FP16 (e.g., overflow, underflow, gradient scaling).  
   - How might BF16 mitigate some of these problems?


## <a id="conclusion"></a>7. Conclusion

**Recap** of Session 4, Part 2:
- **Large-scale Transformer training** requires advanced **parallelism** (data, model, pipeline) and memory optimization.  
- **Mixed precision** (FP16, BF16) is a standard approach to speed up training and reduce memory usage.  
- **Fine-tuning** large pretrained models is often more efficient than training from scratch; *domain adaptation* and *task-specific heads* can drastically improve performance.
- **Third-party APIs**—like **Hugging Face Transformers/Accelerate** or **PyTorch Lightning**—provide robust solutions for distributed training, checkpointing, logging, and more, saving developers from writing low-level distributed code.

**Key Takeaways**:
- Deciding on a parallel strategy (data parallel vs. model parallel vs. pipeline parallel) depends on model size and hardware resources.  
- Tools such as **Accelerate** and **Lightning** simplify multi-GPU setups and mix precision usage.  
- Fine-tuning modern LLMs can yield strong results on domain tasks without incurring massive training costs from scratch.

**Next Steps**:
- Explore advanced optimization libraries like **DeepSpeed** or **FSDP** in PyTorch for extreme model scales.  
- Try actual HPC or cloud environments to test your pipeline.  
- Investigate advanced topics like **sparsity**, **quantization**, or **RLHF** for large language models.
