# InstructLab Multi-Phase Training Hardware Configurations

This notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).

**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`

Each configuration below specifies the optimal parameters for different GPU setups, including:
- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors
- `nproc_per_node`: Number of GPUs per node for distributed training
- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization

**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware.

## FSDP Configuration Reference

The `cpu_offload_params` parameter controls whether FSDP parameters are offloaded to CPU memory to reduce GPU memory usage:

```python
from instructlab.training import FSDPOptions

fsdp_options = FSDPOptions(
    cpu_offload_params=True  # Boolean: True to enable CPU offloading, False to disable
)
```

## Usage Example

Here's how to use these configurations in the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb):

1. **Choose your hardware configuration** from the sections below based on your available GPUs
2. **Copy the configuration values** into your training script
3. **Apply them to your training configuration**

```python
# Example: Using H100 4x GPU configuration

# Step 1: Set hardware-specific parameters from this notebook
max_tokens_per_gpu = 30000
max_seq_len = 27000
nproc_per_node = 4
cpu_offload_params = False

# Step 2: Use with training_hub's sft function
from training_hub import sft
from instructlab.training import FSDPOptions

# Configure FSDP options
fsdp_options = FSDPOptions(
    cpu_offload_params=cpu_offload_params
)

# Step 3: Apply to your LAB training phases
# Phase 1: Knowledge Tuning (Phase07)
phase07_result = sft(
    model_path="/path/to/base/model",
    data_path="/path/to/knowledge_data.jsonl",
    ckpt_output_dir="/path/to/phase07_checkpoints",
    max_tokens_per_gpu=max_tokens_per_gpu,
    max_seq_len=max_seq_len,
    fsdp_options=fsdp_options,
    num_epochs=7,
    # Use torchrun with nproc_per_node for distributed training
)

# Phase 2: Skills + Replay Training (Phase10)
phase10_result = sft(
    model_path=phase07_result.checkpoint_path,  # Use Phase07 output
    data_path="/path/to/skills_plus_replay_data.jsonl",
    ckpt_output_dir="/path/to/phase10_checkpoints", 
    max_tokens_per_gpu=max_tokens_per_gpu,
    max_seq_len=max_seq_len,
    fsdp_options=fsdp_options,
    num_epochs=7,
    # Use torchrun with nproc_per_node for distributed training
)
```

**Note**: These values have been tested and optimized specifically for Granite starter models and LAB multiphase datasets. You may need to adjust these parameters for different student models and datasets.

## H200 Configurations

### H200 8x GPU Configuration

In [None]:
# H200 8x GPU Configuration
max_tokens_per_gpu = 85000
max_seq_len = 80000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = False

### H200 4x GPU Configuration

In [None]:
# H200 4x GPU Configuration
max_tokens_per_gpu = 75000
max_seq_len = 70000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 4

# FSDP CPU offloading configuration
cpu_offload_params = False

### H200 2x GPU Configuration

In [None]:
# H200 2x GPU Configuration
max_tokens_per_gpu = 45000
max_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 2

# FSDP CPU offloading configuration
cpu_offload_params = False

### H200 1x GPU Configuration

In [None]:
# H200 1x GPU Configuration
max_tokens_per_gpu = 45000
max_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 1

# FSDP CPU offloading configuration
cpu_offload_params = True

## H100 Configurations

### H100 8x GPU Configuration

In [None]:
# H100 8x GPU Configuration
max_tokens_per_gpu = 45000
max_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = False

### H100 4x GPU Configuration

In [None]:
# H100 4x GPU Configuration
max_tokens_per_gpu = 30000
max_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 4

# FSDP CPU offloading configuration
cpu_offload_params = False

### H100 2x GPU Configuration

In [None]:
# H100 2x GPU Configuration
max_tokens_per_gpu = 25000
max_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 2

# FSDP CPU offloading configuration
cpu_offload_params = True

In [None]:
# H100 1x GPU Configuration
max_tokens_per_gpu = 0  # TO BE FILLED
nproc_per_node = 1

# FSDP CPU offloading configuration
cpu_offload_params = False  # TO BE FILLED - True to enable CPU offloading, False to disable

### A100 80GB 8x GPU Configuration

In [None]:
# A100 80GB 8x GPU Configuration
max_tokens_per_gpu = 45000
max_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = False

### A100 80GB 4x GPU Configuration

In [None]:
# A100 80GB 4x GPU Configuration
max_tokens_per_gpu = 30000
max_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 4

# FSDP CPU offloading configuration
cpu_offload_params = False

### A100 80GB 2x GPU Configuration

In [None]:
# A100 80GB 2x GPU Configuration
max_tokens_per_gpu = 25000
max_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 2

# FSDP CPU offloading configuration
cpu_offload_params = True

## A100 40GB Configurations

### A100 40GB 8x GPU Configuration

In [None]:
# A100 40GB 8x GPU Configuration
max_tokens_per_gpu = 15000
max_seq_len = 13000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = False

### A100 40GB 4x GPU Configuration

In [None]:
# A100 40GB 4x GPU Configuration
max_tokens_per_gpu = 30000
max_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 4

# FSDP CPU offloading configuration
cpu_offload_params = True

### A100 40GB 2x GPU Configuration

In [None]:
# A100 40GB 2x GPU Configuration
max_tokens_per_gpu = 25000
max_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 2

# FSDP CPU offloading configuration
cpu_offload_params = True

## L40S Configurations

### L40S 8x GPU Configuration

In [None]:
# L40S 8x GPU Configuration
max_tokens_per_gpu = 10000
max_seq_len = 8000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = False

### L40S 4x GPU Configuration

In [None]:
# L40S 4x GPU Configuration
max_tokens_per_gpu = 8000
max_seq_len = 6000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 4

# FSDP CPU offloading configuration
cpu_offload_params = False

## L4 Configurations

### L4 8x GPU Configuration

In [None]:
# L4 8x GPU Configuration
max_tokens_per_gpu = 8000
max_seq_len = 6000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu
nproc_per_node = 8

# FSDP CPU offloading configuration
cpu_offload_params = True