# Tutorial 2: Training Llama 3.1 with Megatron-LM 

This tutorial demonstrates how to train the **Llama 3.1 model** using **mock data**. The **Llama 3.1 model** is a popular open-source large language model designed to handle a wide range of natural language processing tasks efficiently. You can learn more about Llama models at [Llama's official website](https://www.llama.com/).

We use **mock data** in this tutorial to provide a quick and lightweight demonstration of the training workflow, enabling you to verify that your environment is correctly configured and functional. Mock data is a useful way to validate the training pipeline without requiring large datasets.

The training process leverages the **Megatron-LM framework**, a specialized framework for pretraining and fine-tuning large-scale language models. For more information about Megatron-LM, see the [official GitHub repository](https://github.com/NVIDIA/Megatron-LM). All steps will be executed within a **Docker container**, which provides a ready-to-use environment with all necessary dependencies.

This tutorial builds on the setup completed in **Tutorial 1**.

## Prerequisites

Before proceeding with this tutorial, ensure that you have the following:

### 1. Hardware
- **AMD Instinct™ GPUs** (e.g., MI300X) or compatible hardware with ROCm support.

### 2. Software
- **ROCm installed** and verified on your system. Follow the setup instructions in **Tutorial 1** if you haven’t done so.
- **Docker installed** on your system. Refer to the [Docker installation guide](https://docs.docker.com/get-docker/) if needed.

### 3. System Configuration
Ensure your system meets the configuration requirements from **Tutorial 1**, including:
- **NUMA auto-balancing disabled** for optimal performance.
- **ROCm environment validated** using `rocm-smi`.

### 4. Hugging Face API Token
Ensure you have a Hugging Face API token with the necessary permissions and approval to access [Meta's LLaMA checkpoints](https://huggingface.co/meta-llama).

## Prepare Training Environment

Once your system meets the prerequisites, follow these steps to set up the training environment.

### Step 1: Pull the Docker Image
Run the following command in your terminal to pull the prebuilt Docker image containing all necessary dependencies:
```bash
docker pull rocm/megatron-lm:24.12-dev
```

### Step 2: Launch the Docker Container
Run the following command in your terminal to launch the Docker container with the appropriate configuration:
```bash
docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host \
    --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
    --privileged -v /path/to/notebooks:/workspace/notebooks \
    --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
```
**Important:** Replace `/path/to/notebooks` with the full path to the directory on your host machine where your notebooks are stored. Ensure this directory is accessible to Docker and contains the necessary files for this tutorial.

### Step 3: Install Jupyter and Start the Server
After launching the Docker container, run the following commands inside the container to install Jupyter and start the notebook server:
```bash
pip install jupyter
jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
```
**Note:** Jupyter installation inside Docker is necessary to execute notebook cells. Save the token or URL provided in the terminal output to access the notebook from your host machine.

### Step 4: Clone the Megatron-LM Repository
Run the following commands inside the Docker container to clone the Megatron-LM repository and navigate to the validated commit:

In [None]:
# Clone the Megatron-LM repository and navigate to the validated commit
!git clone https://github.com/ROCm/Megatron-LM && cd Megatron-LM && git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92

### Step 5: Provide Your Hugging Face Token 
Hugging Face Token can be generated by signing into your account at **[Hugging Face Tokens](https://huggingface.co/settings/tokens)**.

You will need a Hugging Face API token to access Llama-3.1-8B. Tokens typically start with "hf_". Generate your token at Hugging Face Tokens and request access for Llama-3.1-8B.

Run the following interactive block in your Jupyter notebook to set up the token:

Note: Please uncheck the "Add token as Git credential" option.

In [None]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

Verify that your token was captured correctly:

In [None]:
from huggingface_hub import HfApi

try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

## Run Training Script

### Single-Node Training Overview
The training process involves running a pre-configured script that initializes and executes the training of the **Llama 3.1 model**. The script leverages the **Megatron-LM framework** and mock data to simulate a full training pipeline. This approach ensures your environment is configured correctly and functional for real-world use cases.

Before running the script, ensure all environment variables are set correctly and verify your system's network interface as described in **Step 4**.

### Key Parameters for Training:
- **Batch Size (BS):** Set to `64` for optimal GPU usage.
- **Sequence Length (SEQ_LENGTH):** Input sequence length, set to `4096`.
- **Tensor Parallelism (TP):** Set to `8` for efficient parallelism.
- **Precision (TE_FP8):** Set to `0` for BF16 precision.

### Run the Training Script
Use the following command to train the model on a single node:


In [None]:
!cd Megatron-LM && TEE_OUTPUT=1 MBS=2 BS=64 TP=8 TE_FP8=0 SEQ_LENGTH=4096  \
TOKENIZER_MODEL='meta-llama/Llama-3.1-8B' MODEL_SIZE='8' \
bash examples/llama/train_llama3.sh

### What This Command Does
This command configures the training process with the following parameters:
- **`TEE_OUTPUT=1`**: Enables logging output to the console.
- **`MBS=2`**: Micro-batch size per GPU.
- **`BS=64`**: Total batch size across all GPUs.
- **`TP=8`**: Tensor parallelism for distributing the model across GPUs.
- **`TE_FP8=0`**: Sets precision to BF16 for training.
- **`SEQ_LENGTH=4096`**: Maximum input sequence length.

The training script will:
- Use mock data as input.
- Train the **Llama 3.1 model** with the specified configurations.

You can customize these parameters based on your hardware and desired configurations by modifying the command.

## Monitor Training Progress

Monitor the output logs during the training process for the following:
- **Iteration Progress**: The number of completed iterations.
- **Loss Values**: Indicates the model's learning progress. Lower values suggest better learning.
- **GPU Utilization**: Ensures optimal usage of your hardware resources.

Logs are printed to the console and saved to a log file within the directory specified by the script.

## Key Notes

- Mock data is for validation only. To provide different dataset, please refer to Tutorial 1.
- Tune hyperparameters based on your hardware. The hyperparameters set in this tutorial are based on one node of 8x MI300x GPUs.
- This example illustrates how to run a training task on a single node. For multi-node training instructions please refer to Tutorial 1.
- Verify the logs for correctness.