# Exercise: Implementing and Tuning Pipeline Parallelism

Welcome to the exercise! In the demo, we saw how DeepSpeed can easily partition a simple `nn.Sequential` model. Now, it's your turn to apply these concepts to a more realistic scenario and analyze the performance trade-offs.

### Your Goal
Your mission is to:
1.  **Adapt** a non-trivial, realistic PyTorch model to make it compatible with DeepSpeed's automatic pipeline partitioning.
2.  **Implement** the baseline measurement and the DeepSpeed initialization call.
3.  **Design and run a series of experiments** to analyze the performance impact of `micro_batch_size`.
4.  **Analyze the results** to understand the concept of the "pipeline bubble" and its effect on throughput.

## 1. Environment Setup

First, let's install the necessary libraries and check our environment. This exercise requires at least two GPUs to properly evaluate pipeline parallelism.

In [None]:
!pip install deepspeed transformers

In [None]:
import torch
import torch.nn as nn
import deepspeed
import json

print(f"PyTorch version: {torch.__version__}")
print(f"DeepSpeed version: {deepspeed.__version__}")
if torch.cuda.is_available():
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    if torch.cuda.device_count() < 2:
        print("\n!! WARNING: This exercise requires at least 2 GPUs. Performance comparison will not be meaningful otherwise. !!")

## 2. The "Realistic" Model for Our Exercise

Below is the model we will be working with. Notice that the `transformer_blocks` are stored in an `nn.ModuleList`. This is a common PyTorch pattern, but it poses a challenge for DeepSpeed's `"uniform"` partitioner, which often treats the entire `ModuleList` as a single, indivisible block.

In [None]:
# This is a mock transformer block for demonstration purposes.
class MockTransformerBlock(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.layer1 = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, hidden_size)
    
    def forward(self, x):
        return self.layer2(self.relu(self.layer1(x)))

class RealisticModel(nn.Module):
    def __init__(self, hidden_size=2048, num_layers=30, vocab_size=1000):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.transformer_blocks = nn.ModuleList(
            [MockTransformerBlock(hidden_size) for _ in range(num_layers)]
        )
        self.output_head = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        for block in self.transformer_blocks:
            x = block(x)
        x = self.output_head(x)
        return x

print("Model structure defined.")

## 3. Your Task: Create the Analysis Script

Now, you will create the Python script that DeepSpeed will launch. This involves three key implementation tasks, marked with `TODO` comments inside the script.

In [None]:
%%writefile exercise_starter.py

import torch
import torch.nn as nn
import deepspeed
import argparse
import time

# --- Model Definition ---
class MockTransformerBlock(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.layer1 = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, hidden_size)
    def forward(self, x):
        return self.layer2(self.relu(self.layer1(x)))

class RealisticModel(nn.Module):
    def __init__(self, hidden_size=2048, num_layers=30, vocab_size=1000):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        
        # ## TODO: STUDENT TASK 1: Fix the Model Architecture ##
        # DeepSpeed's "uniform" partitioner treats `nn.ModuleList` as a single block.
        # To allow DeepSpeed to split the transformer blocks, you must use a different container.
        # HINT: What container presents a sequence of layers that can be called like a single function?
        # Replace the line below with the correct implementation.
        self.transformer_blocks = nn.ModuleList(
            [MockTransformerBlock(hidden_size) for _ in range(num_layers)]
        )
        
        self.output_head = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        # HINT: If you changed the container for `transformer_blocks`,
        # you may need to change how you call it here.
        for block in self.transformer_blocks:
            x = block(x)
        x = self.output_head(x)
        return x

# --- Helper Function for Performance Measurement ---
def measure_throughput(model, dummy_input, iterations):
    # Warm-up
    for _ in range(5):
        _ = model(dummy_input)
    torch.cuda.synchronize()

    start_time = time.time()
    for _ in range(iterations):
        with torch.no_grad():
            _ = model(dummy_input)
    torch.cuda.synchronize()
    end_time = time.time()

    total_samples = dummy_input.size(0) * iterations
    duration = end_time - start_time
    throughput = total_samples / duration
    return throughput

# --- Main Execution Logic ---
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=-1)
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()

    # --- Setup ---
    global_batch_size = 64
    hidden_size = 2048
    vocab_size = 1000
    iterations = 20
    is_rank_0 = args.local_rank <= 0

    # --- Baseline Measurement (Single GPU) ---
    if is_rank_0:
        print("\n--- Measuring Baseline Performance (Single GPU) ---", flush=True)
        # ## TODO: STUDENT TASK 2: Implement the Baseline Measurement ##
        # 1. Instantiate the `RealisticModel`.
        # 2. Move the model to the first GPU ('cuda:0').
        # 3. Create a dummy input tensor on the same device.
        # 4. Call `measure_throughput` to get the baseline performance.
        # 5. Print the result.
        
        # baseline_model = ...
        # dummy_input = ...
        # baseline_throughput = ...
        # print(f"Baseline Throughput: {baseline_throughput:.2f} samples/sec", flush=True)
        print("Baseline measurement not yet implemented.") # Remove this line
        
        print("--------------------------------------------------\n", flush=True)

    if torch.distributed.is_initialized():
        torch.distributed.barrier()

    # --- DeepSpeed Pipeline Parallelism ---
    print(f"\n--- Rank {args.local_rank}: Setting up DeepSpeed Pipeline ---", flush=True)

    ds_model = RealisticModel()
    
    # ## TODO: STUDENT TASK 3: Initialize the DeepSpeed Engine ##
    # Call `deepspeed.initialize` with the correct arguments to create the `model_engine`.
    # The engine will wrap your model and handle the pipeline parallelism.
    
    # model_engine, _, _, _ = deepspeed.initialize(...)
    print("DeepSpeed engine not yet initialized.") # Remove this line
    model_engine = ds_model # Replace this line with the initializer call
    model_engine.device = 'cpu' # Replace this line
    model_engine.is_last_stage = lambda: True # Replace this line
    
    # This part will fail until the model_engine is correctly initialized.
    dummy_input_ds = torch.randint(0, vocab_size, (global_batch_size, 128), device=model_engine.device)

    pipelined_throughput = measure_throughput(model_engine, dummy_input_ds, iterations)

    if model_engine.is_last_stage():
        print(f"\n--- Results on Last Stage (Rank {args.local_rank}) ---", flush=True)
        print(f"Pipelined Throughput: {pipelined_throughput:.2f} samples/sec", flush=True)
        print("------------------------------------------\n", flush=True)

if __name__ == "__main__":
    main()

## 4. Your Task: Set Up the Experiment Configurations

A key part of performance tuning is designing the experiments. Here, you will define the different `micro_batch_size` configurations you want to test. This will require you to think about the relationship between the global batch size, the number of GPUs (pipeline stages), and the micro-batch size.

In [None]:
import json

# The global batch size is 64, and we will use 2 GPUs (2 stages).
GLOBAL_BATCH_SIZE = 64
NUM_STAGES = 2

# ## TODO: STUDENT TASK 4: Define the Experiment Configurations ##
# Populate the `configs` dictionary below.
# Create 4 configurations to test with different micro_batch_size values:
# 1. A very small size (e.g., 4)
# 2. A medium size (e.g., 8)
# 3. A larger size (e.g., 16)
# 4. A size equal to the per-stage batch size (global_batch_size / num_stages)
configs = {
    # e.g., "ds_config_mbs_4.json": { "micro_batch_size": 4, "stages": NUM_STAGES },
}

# --- This part is complete --- #
# Base config template
base_config = {
    "train_batch_size": GLOBAL_BATCH_SIZE,
    "optimizer": { "type": "Adam", "params": { "lr": 0.001 } },
    "fp16": { "enabled": True }
}

for filename, pipeline_config in configs.items():
    full_config = base_config.copy()
    full_config["pipeline"] = pipeline_config
    with open(filename, 'w') as f:
        json.dump(full_config, f, indent=2)
    print(f"Created config file: {filename}")

## 5. Run the Experiments

Once you have completed all the `TODO`s above, you can run this cell to execute your experiments. It will loop through the configuration files you created and launch the DeepSpeed job for each one.

In [None]:
if not configs:
    print("Please complete 'STUDENT TASK 4' in the cell above before running the experiments.")
else:
    for config_file in configs.keys():
        print(f"\n\n{'='*60}")
        print(f"RUNNING EXPERIMENT WITH CONFIG: {config_file}")
        print(f"{'='*60}\n")
        # We reduce log verbosity for a cleaner output
        !DS_LOG_LEVEL=ERROR PYTHONWARNINGS=ignore deepspeed --num_gpus {NUM_STAGES} exercise_starter.py --deepspeed_config {config_file}
        print("\n\n")

## 6. Analyze the Results

Your final task is to analyze the results from your experiments. Fill out the table below with the throughput numbers you observed from the output of the previous cell. 

| Micro Batch Size (`mbs`) | Pipelined Throughput (samples/sec) | Your Explanation of the Performance Trend |
|:---:|:---:|:---|
| 4   | **Fill this in** | *Explain why the throughput might be lower here.* |
| 8   | **Fill this in** | *Explain why the throughput is likely increasing.* |
| 16  | **Fill this in** | *Explain why this might be near the optimal performance.* |
| 32  | **Fill this in** | *Explain why the throughput might decrease again here.* |