# Llama model pre-training on HPU

In this Jupyter notebook, we will pre-train a [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b) model by using HPU.

We will use PyTorch for model training and Ray for distributed training. We will use pre-processed dataset [OSCAR](https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar) from bigscience-workshop. Datasets preparation steps can be found [here](https://github.com/HabanaAI/Megatron-DeepSpeed/tree/main?tab=readme-ov-file#dataset-preparation).

[Habana Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).

Basic features for this pre-training example are:
- [OSCAR](https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar) data for training. Detail [dataset preparation](https://github.com/HabanaAI/Megatron-DeepSpeed/tree/main?tab=readme-ov-file#dataset-preparation-examples).
- Running on HPUs, support three execution mode: ["lazy", "eager", "eager.compile"](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html).
- Pre-training llama model use configuration [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b)
- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/main) based data processing.
- [`GaudiTrainer`](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/trainer.py) based training.
- Ray based resource scheduling and management.

## Prepare environment
This example run on single node with 4 HPUs.

We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.

Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Habana drivers and container runtime.

### Get docker image
``` bash
docker pull vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
```
### Run docker image
``` bash
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
# maybe should mapping your workspace volumns
```
### Install dependency
``` bash
# "optimum-habana>1.11.1" if exection mode "eager" or "eager.compile" 
# "ray>=2.20.0"
pip install ray[train] notebook transformers datasets evaluate peft accelerate scikit-learn optimum-habana

# install deepspeed
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.0

# install megatron_core
pip install git+https://github.com/microsoft/Megatron-DeepSpeed.git#egg=megatron-core

# this notebook verfied with packages' version:
# transformers==4.38.2
# datasets==2.19.1
# evaluate==0.4.2
# peft==0.4.0
# accelerate==0.27.2
# scikit-learn==1.4.2
# optimum-habana==1.11.1

# deepspeed==0.12.4+hpu.synapse.v1.15.0

# megatron_core==0.2.0
```

## Import necessary libraries

In [None]:
#!/usr/bin/env python

import os
from typing import Any, Dict

from torch.utils.data import DataLoader

import transformers
from transformers import HfArgumentParser, default_data_collator

from megatron import get_args, print_rank_0
from megatron.core import mpu
from megatron.data import gpt_dataset
from megatron.initialize import initialize_megatron
from megatron.data.data_samplers import build_pretraining_data_loader
from megatron.training import build_train_valid_test_datasets, update_train_iters

from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

## Build datasets

Build train, valid, and test datasets.

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line, as `jsonl` format.

This notebook mainly focus on how to pre-train model Llama use HPUs, so we do not intend to introduce in detail how to preprocess the data.
For more steps of how to preprocess data for megatron-deepspeed pre-training, please visit [here](https://github.com/microsoft/Megatron-DeepSpeed/tree/main?tab=readme-ov-file#data-preprocessing)

In [None]:
class MegatronDataset:
    def __call__(self, config):
        def _train_valid_test_datasets_provider(train_val_test_num_samples):
            """Build train, valid, and test datasets."""
            args = get_args()
            print_rank_0("> building train, validation, and test datasets " "for GPT ...")
            train_ds, valid_ds, test_ds = gpt_dataset.build_train_valid_test_datasets(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                seq_length=args.seq_length,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup),
                train_data_prefix=args.train_data_path,
                valid_data_prefix=args.valid_data_path,
                test_data_prefix=args.test_data_path,
                data_cache_path=args.data_cache_path,
            )
            print_rank_0("> finished creating GPT datasets ...")

            return train_ds, valid_ds, test_ds

        args = get_args()
        update_train_iters(args)
        datasets = build_train_valid_test_datasets(_train_valid_test_datasets_provider)
        print_rank_0(datasets)
        return datasets


def load_datasets(config):
    dataset = MegatronDataset()
    return dataset(config)

## Process dataset to dataloader

In [None]:
class MegatronProcesser:
    def prepare(self, tokenizer, dataset, **kwargs):
        args = get_args()

        (train_dataloader, valid_dataloader, test_dataloader) = (None, None, None)

        print_rank_0("> building train, validation, and test datasets ...")
        iteration = kwargs.get("step", 0)
        if iteration:
            # passed value is starting step
            iteration -= 1
            args.consumed_train_samples = iteration * args.global_batch_size
            args.consumed_valid_samples = (
                (args.iteration // args.eval_interval) * args.eval_iters * args.global_batch_size
            )

        # Data loader only on rank 0 of each model parallel group.
        if args.use_dataset_only or mpu.get_tensor_model_parallel_rank() == 0:
            # Build datasets.
            train_ds, valid_ds, test_ds = dataset

            # Build dataloders.
            train_dataloader = build_pretraining_data_loader(train_ds, args.consumed_train_samples)
            valid_dataloader = build_pretraining_data_loader(valid_ds, args.consumed_valid_samples)
            test_dataloader = build_pretraining_data_loader(test_ds, 0)

        return train_dataloader, valid_dataloader, test_dataloader

## Load tokenizer

Download vocabulary from huggingface.co and cache.

In [None]:
def load_tokenizer(config):
    name = config["name"]
    load_config = config["config"]
    return transformers.AutoTokenizer.from_pretrained(name, **load_config)

## Load Llama model

Download configuration from huggingface.co and cache.

In [None]:
class HuggingFaceModelFromConfig:
    def __call__(self, config):
        name = config["name"]
        self.model_config = config.get("config", {})
        self.auto_config = None
        if name is not None:
            self.auto_config = transformers.AutoConfig.from_pretrained(
                pretrained_model_name_or_path=name, **self.model_config
            )
        else:
            self.auto_config = transformers.AutoConfig.for_model(**self.model_config)
        self.model = transformers.AutoModelForCausalLM.from_config(self.auto_config)

        return self.model


def load_model(config):
    model = HuggingFaceModelFromConfig()
    return model(config)

## Prepare trainer

- inherit Trainer base on `GaudiTrainer`, with custom train dataloader preparation function.
- instance Trainer with `model`, `gaudi_config`, `training_args`, `tokenizer`

In [None]:
class HFCustomerSamplerTrainer(GaudiTrainer):  # type: ignore
    def set_sampler(self, sampler):
        self.customer_sampler = sampler

    def get_train_dataloader(self) -> DataLoader:
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")

        train_dataloader, _, _ = self.customer_sampler.prepare(
            None, (self.train_dataset, None, None)
        )
        return train_dataloader


def get_trainer(config, training_args, datasets, tokenizer, model):
    gaudi_config = GaudiConfig.from_pretrained(
        training_args.gaudi_config_name,
        cache_dir=config.get("cache_dir", None),
        revision=config.get("model_revision", None),
        use_auth_token=True if config.get("use_auth_token") else None,
    )

    train_dataset, eval_dataset, test_dataset = datasets

    trainer = HFCustomerSamplerTrainer(
        model=model,
        gaudi_config=gaudi_config,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=None,
        tokenizer=tokenizer,
        # Data collator will default to DataCollatorWithPadding, so we change it.
        data_collator=default_data_collator,
        compute_metrics=None,
        preprocess_logits_for_metrics=None
    )
    return trainer

## Training Function

This function will be executed by each worker during training, with following steps:
- initialize megatron.
- load datasets with prepared local binary and index files.
- prepare dataset processor.
- load tokenizer configurations from huggingface.co
- instance object of `GaudiTrainingArguments`
- load model configurations from huggingface.co
- instance object of `GaudiTrainer` with training_args, datasets, tokenizer, and model.
- call `train` of trainer.
- save model.

In [None]:
def pretrain_llama(config: Dict[str, Any]):

    initialize_megatron(ignore_unknown_args=True, external_args=config["megatron_config"], allow_no_cuda=True)

    datasets = load_datasets(config["datasets"])

    dataprocessor = MegatronProcesser()

    tokenizer = load_tokenizer(config["tokenizer"])

    training_args = GaudiTrainingArguments(**config["training_args"])

    model = load_model(config["model"])

    trainer = get_trainer(config, training_args, datasets, tokenizer, model)
    trainer.set_sampler(dataprocessor)

    result = trainer.train()
    trainer.save_model()
    print(result)

## Main Training Function

The `main` function sets up the distributed training environment using Ray and starts the training process. To enable training using HPU, we only need to make the following changes:
- Set the exectuion mode for training, supported execution mode are:

    - "lazy": Deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi. Unlike Eager Mode with torch.compile, graph is analyzed in each iteration leading to a higher CPU usage.
    - "eager": Op-by-op execution as defined in standard PyTorch Eager mode scripts.
    - "eager.compile": Eager mode extended with `torch.compile` - Similar to Eager mode but extended with wrapping complete or part of model (such as a function) into a graph. Parts that are not wrapped are executed eagerly.

    More detail theory can be found [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html), and detail performance results can be found [here](https://developer.habana.ai/get-started/habana-models-performance/)
- Require an HPU for each worker in ScalingConfig
- Set backend to `hccl` in TorchConfig

In [None]:
def main(num_workers, execution_mode):
    import ray
    from ray.train import ScalingConfig
    from ray.train.torch import TorchTrainer, TorchConfig

    # configs for pre-training
    pretrain_config = {
        "megatron_config": {
            "data_path": ["/root/workspace/bigscience/data/oscar/zh/tokenized_text_document"],
            "data_impl": "mmap",
            "micro_batch_size": 1,
            "global_batch_size": 4,
            "seq_length": 2048,
            "use_dataset_only": True,
            # "vocab_file": "/home/user/workspace/data/gpt2-vocab.json",
            "tokenizer_type": "HFTokenizer",
            "tokenizer_model": "huggyllama/llama-7b",
            # "merge_file": "/home/user/workspace/data/gpt2-merges.txt",
            "eval_interval": 1000,
            "train_samples": 300_000_000,
            "split": "949,50,1",
        },
        "datasets": {
        },
        "tokenizer": {
            "name": "huggyllama/llama-7b",
            "config": {}
        },
        "model": {
            "name": "huggyllama/llama-7b",
            "config": {
                "torch_dtype": "bfloat16",
            },
        },
        "training_args": {
            "per_device_train_batch_size": 1,
            "per_device_eval_batch_size": 1,
            "do_train": True,
            "do_eval": False,
            "save_strategy": "epoch",
            "save_steps": 1000,
            "output_dir": "/tmp/pretrain-llama",
            "gaudi_config_name": "Habana/gpt2",
            "use_habana": True,
            "max_steps": 100000,
            "throughput_warmup_steps": 3,
            "use_lazy_mode": True,
            "overwrite_output_dir": True,
            "seed": 42,
            "bf16": True,
            "report_to":'tensorboard',
            "deepspeed": {
                "steps_per_print": 64,
                "train_batch_size": "auto",
                "train_micro_batch_size_per_gpu": "auto",
                "gradient_accumulation_steps": "auto",
                "gradient_checkpoint": True,
                "memory_efficient_linear": False,
                "bf16": {
                    "enabled": True
                },
                "gradient_clipping": 1.0,
                "zero_optimization": {
                    "stage": 3,
                    "overlap_comm": False,
                    "reduce_scatter": False,
                    "contiguous_gradients": False,
                    "stage3_gather_16bit_weights_on_model_save": True
                }
            },
        },
    }

    # if execution mode is eager with compile, must spcified with a compile backend
    if execution_mode == "eager.compile":
        pretrain_config["training_args"].update({"torch_compile_backend": "hpu_backend"})

    scaling_config = ScalingConfig(num_workers=num_workers,
                                   use_gpu=False,
                                   resources_per_worker={"CPU": 1, "HPU": 1})

    # Set backend to hccl in TorchConfig
    torch_config = TorchConfig(backend="hccl")
    runtime_env = {
        "env_vars": {
        }
    }

    ray.init(runtime_env=runtime_env)

    # Initialize a Ray TorchTrainer
    trainer = TorchTrainer(
        train_loop_per_worker=pretrain_llama,
        train_loop_config=pretrain_config,
        torch_config=torch_config,
        scaling_config=scaling_config
    )

    result = trainer.fit()
    print(result)

## Start Training

Finally, we call the `main` function to start the pre-training process.

Before calling `main` function, you must set some environment variables.

1. The visiable devices. Environment variable `HABANA_VISIBLE_DEVICES` and `HABANA_VISIBLE_MODULES` are used to control the HPU device visiable to application, you must set this two environment variable properly. For more detail usage of `HABANA_VISIBLE_DEVICES`, `HABANA_VISIBLE_MODULES`, please visit [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PT_Multiple_Tenants_on_HPU/Multiple_Dockers_each_with_Single_Workload.html)

2. The execution mode. Different execution mode has different runtime performance. The default execution mode is lazy mode.

In [None]:
# set some environment variables
os.environ["RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES"] = "0"
# if using RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES env var
# you must set HABANA_VISIBLE_MODULES, such as
# os.environ["HABANA_VISIBLE_MODULES"] = "0,1,2,3"

# execution_mode are ["lazy", "eager", "eager.compile"]
execution_mode = "lazy"
os.environ["PT_HPU_LAZY_MODE"] = "1" if execution_mode == "lazy" else "0"

main(num_workers=4, execution_mode=execution_mode)

## Possible outputs

``` bash

...
(RayTrainWorker pid=359077) ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
(RayTrainWorker pid=359077)  PT_HPU_LAZY_MODE = 1
(RayTrainWorker pid=359077)  PT_RECIPE_CACHE_PATH = 
(RayTrainWorker pid=359077)  PT_CACHE_FOLDER_DELETE = 0
(RayTrainWorker pid=359077)  PT_HPU_RECIPE_CACHE_CONFIG = 
(RayTrainWorker pid=359077)  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
(RayTrainWorker pid=359077)  PT_HPU_LAZY_ACC_PAR_MODE = 1
(RayTrainWorker pid=359077)  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
(RayTrainWorker pid=359077) ---------------------------: System Configuration :---------------------------
(RayTrainWorker pid=359077) Num CPU Cores : 152
(RayTrainWorker pid=359077) CPU RAM       : 1056440348 KB
(RayTrainWorker pid=359077) ------------------------------------------------------------------------------

...

(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.975e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.39, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}\
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.9250000000000004e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.42, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.9e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.42, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.875e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.85e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.825e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.45, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.8e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.41, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.775e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.43, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.75e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.7249999999999997e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.39, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}
(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.7249999999999997e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.39, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}

...
```