# Tutorial 1: Pretraining with Megatron-LM

This tutorial walks you through the setup and configuration required to pretrain large-scale language models such as **Llama2** and **Llama3** using AMD’s ROCm Megatron-LM framework.

## 1. Overview

The **ROCm Megatron-LM framework** is a specialized fork of Megatron-LM designed to train large-scale language models efficiently on AMD GPUs. By leveraging AMD **Instinct™ MI300X accelerators**, this framework offers:
- Enhanced scalability and performance for AI workloads.
- Support for state-of-the-art models such as **Llama2**, **Llama3**, and **Llama3.1**.

Key features include:
- **Transformer Engine (TE):** Optimized transformer layer implementations.
- **Flash Attention 2.0:** Faster and memory-efficient attention mechanisms.
- **3D Parallelism (TP + SP + CP):** Tensor, pipeline, and sequence parallelism.
- **Fused Kernels:** For optimized training operations.
- **GEMM Tuning:** Automatically selects optimal matrix multiplication kernels.

**Pre-Optimized Models:**
- **Llama2:** 7B and 70B
- **Llama3 / Llama3.1:** 8B and 70B

> **See the [GitHub repository](https://github.com/ROCm/Megatron-LM) for more details.**

## 2. System Prerequisites

### 2.1 Disable NUMA Auto-Balancing
Disabling NUMA auto-balancing can improve application performance.
1. Check your current NUMA setting:
```bash
cat /proc/sys/kernel/numa_balancing
```
- `0`: Disabled
- `1`: Enabled
2. Disable NUMA auto-balancing if necessary:
```bash
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
```

### 2.2 Hardware Verification with ROCm
To ensure stable performance:
1. Set GPU clocks to a stable maximum of **1900 MHz**:
```bash
rocm-smi --setperfdeterminism 1900
```
2. To reset this setting to default:
```bash
rocm-smi -r
```

### 2.3 Test Communication (Optional RCCL Bandwidth Test) 
Is this required?

## 3. Environment Setup

### 3.1 Download and Launch Docker
The Docker image provides all necessary dependencies, including **PyTorch**, **PyTorch Lightning**, **ROCm libraries**, and **Megatron-LM utilities**.

1. **Pull the Docker image**:
```bash
docker pull rocm/megatron-lm:24.12-dev
```
2. **Launch the container**:
```bash
docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host \
    --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
    --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
```

### 3.2 Install Jupyter and Start the Server
After launching the Docker container, run the following commands inside the container to install Jupyter and start the notebook server:
```bash
pip install jupyter
jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
```
**Note:** Jupyter installation inside Docker is necessary to execute notebook cells. Save the token or URL provided in the terminal output to access the notebook from your host machine.

### 3.3 Clone and Configure Megatron-LM
Run the following commands inside the Docker container to clone the Megatron-LM repository and navigate to the validated commit:

In [None]:
# Clone the Megatron-LM repository and navigate to the validated commit
!git clone https://github.com/ROCm/Megatron-LM && cd Megatron-LM && git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92

## 4. Preparing Training Dataset
While in the next tutorial we use `mock data` in order to simplify the tutorial, this section covers an example of how to preprocess your data for training, If you don’t have preprocessed data. In this section we are going to use the [BookCorpus dataset](https://huggingface.co/datasets/bookcorpus/bookcorpus). The BookCorpus dataset is a collection of books that has been used for training language models. It contains diverse and continuous text passages, making it suitable for pretraining tasks.

In this tutorial, we will:

- Download and inspect the BookCorpus dataset.
- Convert it to a JSONL format.
- Download necessary tokenizer files (vocab.json and merges.txt).
- Preprocess the data for training a large language model with Megatron-LM.


### 4.1 Download and Inspect The BookCorpus Dataset

We are using the Hugging Face datasets library to download the BookCorpus dataset. This step ensures we have access to the raw data needed for preprocessing.

In [None]:
from datasets import load_dataset

# Load BookCorpus dataset
dataset = load_dataset("bookcorpus/bookcorpus", trust_remote_code=True, split="train")

# Inspect the dataset
print("Dataset Structure:", dataset)
print("Sample Data:", dataset[0])  # Access the first record

### 4.2  Convert to JSONL
Megatron-LM's preprocessing script requires the input to be in JSONL format, where each line represents a document as a JSON object. This step converts the dataset into the required format.

In [None]:
import json
from tqdm import tqdm  # Import tqdm for progress bar

output_file = "bookcorpus.jsonl"

# Open the output file
with open(output_file, "w") as f:
    # Use tqdm to display progress
    for record in tqdm(dataset, desc="Saving dataset to JSONL", unit="record"):
        json.dump({"text": record["text"]}, f)
        f.write("\n")

print(f"Dataset saved to {output_file}")

Before moving on, we ensure that the dataset is correctly converted to JSONL format and inspect the first few lines to confirm the structure.

In [None]:
# Inspect the first few lines of the JSONL file
with open(output_file, "r") as f:
    for i in range(5):  # Print the first 5 lines
        print(json.loads(f.readline()))

### 4.3  Download Tokenizer Files
We need to decide which tokenizer we are going to use for our training. Let's pick GPT-2 Byte Pair Encoding (BPE) tokenizer as an example. The preprocessing tool provided by Megatron requires some additional files for each tokenizer. For our case we need to download the folllowing two files that define the tokenzier rules:
- vocab.json: Maps tokens to unique IDs.
- merges.txt: Specifies how subword units are combined into tokens.

We will download these files to tokenize the dataset correctly.

In [None]:
# Download vocab.json and merges.txt for GPT-2 tokenizer
!wget https://huggingface.co/gpt2/resolve/main/vocab.json -O vocab.json
!wget https://huggingface.co/gpt2/resolve/main/merges.txt -O merges.txt


### 4.4  Preprocess the Data 
This step tokenizes the dataset and converts it into binary and index files suitable for training a large language model using Megatron-LM. We'll use the Megatron-LM preprocessing script with the converted JSONL dataset and tokenizer files.

In [None]:
!mkdir -p output
!python tools/preprocess_data.py \
    --input bookcorpus.jsonl \
    --json-keys text \
    --output-prefix output/bookcorpus \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file vocab.json \
    --merge-file merges.txt \
    --workers 4 \
    --append-eod \
    --partitions 2 \
    --split-sentences


Note that You may need to modify the parameters based on the tokenizer you pick and your dataset. 

Now, let's check the output files generated during preprocessing to ensure everything is processed correctly and ready for training.

In [None]:
# List output files
!ls output/


Let's wrap up this section with final inspection of a binary file.

In [None]:
import pyarrow as pa

# Replace 'bookcorpus_text_document.bin' with the actual file generated
arrow_file = "output/bookcorpus_text_document.bin"
table = pa.ipc.open_file(arrow_file).read_all()

# Display the structure of the preprocessed data
print(table)


## 5 Configure Network Interfaces (Only Applicable for Multi-Node Training)


We need to set the `NCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` variables. While this is easily done via `export` command, in this notebook, we can automatically Set Network Interface Variables.

First we need to install the `iproute2` package by running the following command:

In [None]:
!apt install -y iproute2

Then run the following commands to automatically detect the active network interfaces and set the environment variables based on the first interface available:

In [None]:
import os
import subprocess

# Detect the active network interface
try:
    result = subprocess.run(
        "ip -o link show | awk '{print $2, $9}' | grep 'UP' | awk '{print $1}' | sed 's/://g' | head -n 1",
        shell=True,
        check=True,
        capture_output=True,
        text=True
    )
    active_interface = result.stdout.strip()

    # Set environment variables
    os.environ['NCCL_SOCKET_IFNAME'] = active_interface
    os.environ['GLOO_SOCKET_IFNAME'] = active_interface

    # Verify the variables
    print(f"NCCL_SOCKET_IFNAME is set to: {os.environ['NCCL_SOCKET_IFNAME']}")
    print(f"GLOO_SOCKET_IFNAME is set to: {os.environ['GLOO_SOCKET_IFNAME']}")

except subprocess.CalledProcessError as e:
    print(f"Error detecting network interface: {e.stderr}")

After running the commands, verify the active network interface by running the following command:

In [None]:
!ip a

Ensure that the detected interface matches your system's active network interface. If necessary, modify the above script or manually set the `NCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` variables. You may manually set these variables as shown below:

```bash
export NCCL_SOCKET_IFNAME=<network_interface>
export GLOO_SOCKET_IFNAME=<network_interface>
```

## 6. Training Configuration

Before launching a training task, it's crucial to configure the training environment properly. This section covers the essential configurations for running pretraining, including:
- Training Mode (Single vs Multi-Node Training)
- Dataset Options
- Tokenizer Selection

### 6.1  Training Modes

#### Single-Node Training
In single-node training, all computations occur on a single server or machine with multiple GPUs. This is simpler to set up and sufficient for smaller-scale experiments.

To launch training on a single node, use the following command:

```bash
TEE_OUTPUT=1 MBS=2 BS=64 TP=8 TE_FP8=0 SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
```
Here:

- TEE_OUTPUT=1: Enables output streaming to all nodes.
- MBS=2: Sets the micro-batch size.
- BS=64: Sets the global batch size.
- TP=8: Configures tensor parallelism with 8-way parallelism.
- TE_FP8=0: Disables FP8 optimizations (set to 1 to enable).
- SEQ_LENGTH=4096: Specifies the maximum sequence length for training.

#### Multi-Node Training
For larger-scale training, you can distribute the workload across multiple nodes. Multi-node training requires additional configuration to enable communication between nodes.

Before running the training script, update the following environment variables:

**Master Node Address**: Specify the hostname or IP address of the master node.
```bash
MASTER_ADDR="${MASTER_ADDR:-localhost}"
```
**Number of Nodes**:  Define the total number of nodes:
```bash
NNODES="${NNODES:-1}"
```
**Node Rank**: Assign a unique rank to each node:
- 0 for the master node.
- 1, 2, etc., for worker nodes.
```bash
NODE_RANK="${NODE_RANK:-0}"
```

### Run the Training Script

Execute the training script on all nodes using the following command:

```bash
TEE_OUTPUT=1 MBS=2 BS=64 TP=8 TE_FP8=0 SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
```

**Tip**: Test multi-node communication with a mock training task before launching full-scale training. This helps debug any node communication or dataset path issues.



### 6.2 Dataset Options
The dataset is a critical component of pretraining. You can use either real data or mock data based on your requirements.

#### Using Real Data

To use a real dataset:

1. Update the DATA_PATH variable to point to the location of your dataset.
```bash
DATA_DIR="/root/.cache/data"  # Directory where your dataset is stored
DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
```
2. Pass the data path to the training script:
```bash
--data-path $DATA_PATH
```
***Note***: Ensure the dataset files are accessible inside the Docker container.

#### Using Mock Data
```bash
--mock-data
```

### 6.3 Tokenizer Selection

Tokenization is the process of converting raw text into tokens that the model can process. Different LLaMA models require specific tokenizers:

#### For LLaMA 2 Models:
Use the `Llama2Tokenizer`.

#### For LLaMA 3 and LLaMA 3.1 Models:
Use the HuggingFaceTokenizer. Set the Hugging Face model link in the TOKENIZER_MODEL variable. For example:
```bash
TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
```

## 7. Next Steps

Proceed to **Tutorial 2**, where we will:
- Use the environment and configurations set up in this tutorial.
- Run practical pretraining examples using **mock data**.