# Fine-tuning Llama-2 Model with HPU

In this Jupyter notebook, we will fine-tune a [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model by using HPU in DDP accelerate mode. We will use PyTorch for model training and Ray for distributed training. We will use dataset [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca).

[Habana Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).

Basic features for this fine-tuning example are:
- Running on HPUs, support three execution mode: "lazy", "eager", "eager.compile".
- LoRA training.
- `accelerate` based training.
- Llama-2-7b model.
- Ray based scheduling and management.

## Prepare environment
A node with Gaudi/Gaudi2 installed is required to run this example. This example will use 4 workers to train the model, each using 1 HPU.

We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.

Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Habana drivers and container runtime.

### Get docker image
``` bash
docker pull vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
```
### Run docker image
``` bash
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
# maybe should mapping your workspace volumns
```
### Install dependency
``` bash
pip install ray[train] notebook transformers datasets evaluate peft accelerate optimum-habana
```

## Import necessary libraries

In [1]:
import os
import copy
import time
from typing import Dict

import torch
from torch import nn
from torch.utils.data import DataLoader

import datasets
from datasets import load_dataset
import transformers
from transformers import (
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    AutoModelForSequenceClassification,
)

import peft

import ray.train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train.torch import TorchConfig

import habana_frameworks.torch.core as htcore

from optimum.habana.accelerate import GaudiAccelerator

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
2024-04-11 08:07:37,694	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-04-11 08:07:37,760	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-04-11 08:07:37,919	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


## Prepare Dataset Function

Preprocessing the raw dataset's each line with specified format.

In [2]:

def preprocess_dataset(raw_datasets):

    PROMPT_DICT = {
        "prompt_with_input": (
            "Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
        ),
        "prompt_without_input": (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:"
        ),
    }

    def create_prompts(examples):
        prompts = {}
        prompts["source"] = []
        prompts["target"] = []
        for example in examples:
            prompt_template = (
                PROMPT_DICT["prompt_with_input"] if example["input"] != "" else PROMPT_DICT["prompt_without_input"]
            )
            source = prompt_template.format_map(example)
            prompts["source"].append(source)
            prompts["target"].append(example["output"])
        return prompts

    # Preprocessing the datasets.
    for key in raw_datasets:
        prompts = create_prompts(raw_datasets[key])
        columns_to_be_removed = list(raw_datasets[key].features.keys())
        raw_datasets[key] = raw_datasets[key].add_column("prompt_sources", prompts["source"])
        raw_datasets[key] = raw_datasets[key].add_column("prompt_targets", prompts["target"])
        raw_datasets[key] = raw_datasets[key].remove_columns(columns_to_be_removed)

## Dataset to Tokenizer Function

Tokenize each line in dataset by model tokenizer.

In example codes, we concatenate the dataset's line content to accelerate training speed.

In [3]:

def preprocess_dataset_to_tokenizer(raw_datasets, tokenizer):
    max_seq_length = 512
    tokenizer.pad_token_id = 0
    tokenizer.eos_token_id = 1
    tokenizer.bos_token_id = 2

    def tokenize(prompt, add_eos_token=True):
        results = tokenizer(
            prompt,
            truncation=True,
            max_length=max_seq_length,
            padding=False,
            return_tensors=None,
        )
        for i in range(len(results["input_ids"])):
            if (
                results["input_ids"][i][-1] != tokenizer.eos_token_id
                and len(results["input_ids"][i]) < max_seq_length
                and add_eos_token
            ):
                results["input_ids"][i].append(tokenizer.eos_token_id)
                results["attention_mask"][i].append(1)

        results["labels"] = copy.deepcopy(results["input_ids"])
        results["input_id_len"] = [len(result) for result in results["input_ids"]]
        return results

    def preprocess_function(examples):
        keys = list(examples.data.keys())
        if len(keys) != 2:
            raise ValueError("Unsupported dataset format")

        st = [s + t for s, t in zip(examples[keys[0]], examples[keys[1]])]

        examples_tokenized = tokenize(st)
        input_ids = examples_tokenized["input_ids"]
        labels = examples_tokenized["labels"]
        return {
            "input_ids": input_ids,
            "labels": labels,
            "attention_mask": examples_tokenized["attention_mask"],
        }

    tokenized_datasets = raw_datasets.map(
        preprocess_function,
        batched=True,
        load_from_cache_file=True,
    )

    def concatenate_data(dataset, max_seq_length):
        concatenated_dataset = {}
        for column in dataset.features:
            concatenated_data = [item for sample in dataset[column] for item in sample]
            reshaped_data = [
                concatenated_data[i * max_seq_length : (i + 1) * max_seq_length]
                for i in range(len(concatenated_data) // max_seq_length)
            ]
            concatenated_dataset[column] = reshaped_data
        return datasets.Dataset.from_dict(concatenated_dataset)

    tokenized_datasets_ = tokenized_datasets["train"].remove_columns(["prompt_sources", "prompt_targets"])
    tokenized_datasets["train"] = concatenate_data(tokenized_datasets_, max_seq_length)

    return tokenized_datasets

## Prepare Dataloader Function

Convert tokenized dataset to dataloader by using `DataCollatorForLanguageModeling` in transformers.

No need to provide evaluation dataset, the example doesn't support evaluation for each epoch.

In [4]:

def prepare_dataloader(datasets, tokenizer):

    data_collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8, return_tensors="pt", mlm=False)
    print(f"Using data collator of type {data_collator.__class__.__name__}")

    train_dataloader_params = {
        "shuffle": False,
        "collate_fn": data_collator,
        "batch_size": 8,
        "pin_memory": True,
    }
    train_dataset = datasets["train"]
    train_dataloader = torch.utils.data.DataLoader(train_dataset, **train_dataloader_params)
    return train_dataloader

## Training Function

This function will be executed by each worker during training, with following steps:

- loading datasets and preprocess datasets.
- loading pretrained model as tokenizer, and process datasets to tokenizer.
- loading pretrained model, convert to lora model, and move model to HPU device.
- creating optimizer.
- creating `GaudiAccelerator` instance.
- executing training loop.
- saving the fine-tuned model.

Compared to transformers `Trainer` use `Accelerator` for training models,
here example codes use `GaudiAccelerator` to make training on distributed environment more simple, efficient and adaptable.

Compared to a training function for GPU, no changes are needed to port to HPU. Internally, Ray Train does these things:

- Detect HPU and set the device.
- Initialize the habana PyTorch backend.
- Initialize the habana distributed backend.

In [5]:

def train_func_per_worker(config: Dict):
    # prepare datasets
    raw_datasets = load_dataset("tatsu-lab/alpaca")
    preprocess_dataset(raw_datasets)

    # prepare tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(config["model"])
    tokenized_datasets = preprocess_dataset_to_tokenizer(raw_datasets, tokenizer)

    # prepare dataloader
    train_dataloader = prepare_dataloader(tokenized_datasets, tokenizer)

    # prepare model
    model = transformers.AutoModelForCausalLM.from_pretrained(config["model"], **config["model_config"])
    peft_config = peft.LoraConfig(**config["lora_config"])
    model = peft.get_peft_model(model, peft_config)
    device = ray.train.torch.get_device()
    model.to(dtype=config["model_config"]["torch_dtype"], device=device)

    # prepare optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])

    print(f"device = {device}, config = {config}")

    # create accelerator
    accelerator = GaudiAccelerator()
    accelerator.wait_for_everyone()
    steps_per_epoch = len(train_dataloader)
    num_train_epoch = config["epochs"]
    max_train_steps = num_train_epoch * steps_per_epoch
    print(f"num_train_epoch = {num_train_epoch}, max_train_steps = {max_train_steps}")
    lr_scheduler = transformers.get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=max_train_steps)
    model.train()
    if config["execution_mode"] == "eager.compile":
        model = torch.compile(model,backend="hpu_backend")
    model = accelerator.prepare(model)
    optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)

    # training
    logging_steps = 1
    for epoch in range(num_train_epoch):
        # train one epoch here
        start = time.time()
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(model):
                model.train()
                batch = batch.to(device=device)
                outputs = model(**batch)
                loss = outputs.loss
                accelerator.backward(loss)
                htcore.mark_step()
                optimizer.step()
                htcore.mark_step()
                lr_scheduler.step()
                htcore.mark_step()
                optimizer.zero_grad()
                if step % logging_steps == 0:
                    loss = loss.item()
                    epochs = epoch + step / steps_per_epoch
                    elapsed_time = time.time() - start
                    print(f"train epoch: {epochs:.6f}\tloss:{loss:.6f}\ttime:{elapsed_time:.6f}")
                    start = time.time()
        # evaluate here
        # model.eval()

        # save checkpoint here
        # torch.save(...)

        accelerator.wait_for_everyone()

    # save model
    output = config["output"]
    print(f"start save model to {output}")
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output, is_main_process=accelerator.is_main_process, save_function=accelerator.save)
    print(f"finish save model to {output}")
    accelerator.wait_for_everyone()

## Main Training Function
The `train_llama` function sets up the distributed training environment using Ray and starts the training process. To enable training using HPU, we only need to make the following changes:
- Set the exectuion mode for training, supported execution mode are:

    - "lazy": Deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi. Unlike Eager Mode with torch.compile, graph is analyzed in each iteration leading to a higher CPU usage.
    - "eager": Op-by-op execution as defined in standard PyTorch Eager mode scripts.
    - "eager.compile": Eager mode extended with `torch.compile` - Similar to Eager mode but extended with wrapping complete or part of model (such as a function) into a graph. Parts that are not wrapped are executed eagerly.

    More detail theory can be found [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html)
- Require an HPU for each worker in ScalingConfig
- Set backend to `hccl` in TorchConfig

In [6]:

def train_llama(num_workers=2, execution_mode="lazy"):
    # Setting environment variables
    os.environ["RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES"] = "true"
    if execution_mode == "lazy":
        os.environ["PT_HPU_LAZY_MODE"] = "1"
    else:
        os.environ["PT_HPU_LAZY_MODE"] = "0"

    # Preparing train configurations
    train_config = {
        "execution_mode": execution_mode,
        "model": "/root/models/llama-7b",
        "model_config": {"torch_dtype": torch.bfloat16, "trust_remote_code": False, "use_auth_token": None},
        "lora_config": {"task_type": "CAUSAL_LM", "r": 8, "lora_alpha": 32, "lora_dropout": 0.1, "target_modules": ["q_proj", "v_proj"]},
        "lr": 1e-4,
        "epochs": 2,
        "batch_size_per_worker": 8,
        "output": "/tmp/ray/",
    }

    # Configure computation resources
    # In ScalingConfig, require an HPU for each worker
    scaling_config = ScalingConfig(num_workers=num_workers, resources_per_worker={"CPU": 1, "HPU": 1})
    # Set backend to hccl in TorchConfig
    torch_config = TorchConfig(backend = "hccl")

    # start your ray cluster
    ray.init()

    # Initialize a Ray TorchTrainer
    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config=train_config,
        torch_config=torch_config,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

## Start Training

Finally, we call the `train_llama` function to start the training process. You can adjust the number of workers to use, and the execution mode for HPU.

In [7]:
# execution_mode are ["lazy", "eager", "eager.compile"]
train_llama(num_workers=4, execution_mode="lazy")

0,1
Current time:,2024-04-11 08:18:16
Running for:,00:09:23.44
Memory:,97.3/1007.4 GiB

Trial name,status,loc
TorchTrainer_bca23_00000,TERMINATED,10.7.4.144:152049


[36m(pid=152049)[0m   _torch_pytree._register_pytree_node(
[36m(TrainTrainable pid=152049)[0m   _torch_pytree._register_pytree_node(
[36m(RayTrainWorker pid=152619)[0m   _torch_pytree._register_pytree_node(
[36m(RayTrainWorker pid=152616)[0m   _torch_pytree._register_pytree_node(
[36m(RayTrainWorker pid=152616)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=152049)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=152049)[0m - (ip=10.7.4.144, pid=152616) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=152049)[0m - (ip=10.7.4.144, pid=152617) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=152049)[0m - (ip=10.7.4.144, pid=152618) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=152049)[0m - (ip=10.7.4.144, pid=152619) world_rank=3, local_rank=3, node_rank=0
Map:   0%|          | 0/52002 [00:00<?, ? examples/s]
[36m(RayTrainWorker pid=152618)[0m   _torch_pytree._register_pyt

[36m(RayTrainWorker pid=152618)[0m Using data collator of type DataCollatorForLanguageModeling


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Map:  98%|█████████▊| 51000/52002 [00:10<00:00, 5425.65 examples/s][32m [repeated 15x across cluster][0m
Map: 100%|██████████| 52002/52002 [00:10<00:00, 4870.22 examples/s][32m [repeated 3x across cluster][0m
Map:  90%|█████████ | 47000/52002 [00:09<00:00, 5162.04 examples/s][32m [repeated 4x across cluster][0m
Loading checkpoint shards:  50%|█████     | 1/2 [00:10<00:10, 10.81s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.32s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:10<00:10, 10.67s/it][32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=152618)[0m device = hpu, config = {'execution_mode': 'lazy', 'model': '/root/models/llama-7b', 'model_config': {'torch_dtype': torch.bfloat16, 'trust_remote_code': False, 'use_auth_token': None}, 'lora_config': {'task_type': 'CAUSAL_LM', 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.1, 'target_modules': ['q_proj', 'v_proj']}, 'lr': 0.0001, 'epochs': 2, 'batch_size_per_worker': 8, 'output': '/tmp/ray/'}
[36m(RayTrainWorker pid=152618)[0m num_train_epoch = 2, max_train_steps = 3224
[36m(RayTrainWorker pid=152616)[0m Using data collator of type DataCollatorForLanguageModeling[32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=152616)[0m  PT_HPU_LAZY_MODE = 1
[36m(RayTrainWorker pid=152616)[0m  PT_RECIPE_CACHE_PATH = 
[36m(RayTrainWorker pid=152616)[0m  PT_CACHE_FOLDER_DELETE = 0
[36m(RayTrainWorker pid=152616)[0m  PT_HPU_RECIPE_CACHE_CONFIG = 
[36m(RayTrainWorker pid=152616)[0m  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
[36m(RayTrainWorker pid=152616)[0m  PT_HPU_LAZY_ACC_PAR_MODE = 1
[36m(RayTrainWorker pid=152616)[0m  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
[36m(RayTrainWorker pid=152616)[0m ---------------------------: System Configuration :---------------------------
[36m(RayTrainWorker pid=152616)[0m Num CPU Cores : 160
[36m(RayTrainWorker pid=152616)[0m CPU RAM       : 1056375244 KB
[36m(RayTrainWorker pid=152616)[0m ------------------------------------------------------------------------------
Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.22s/it][32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=152618)[0m train epoch: 0.000000	loss:2.170027	time:54.988902
[36m(RayTrainWorker pid=152616)[0m device = hpu, config = {'execution_mode': 'lazy', 'model': '/root/models/llama-7b', 'model_config': {'torch_dtype': torch.bfloat16, 'trust_remote_code': False, 'use_auth_token': None}, 'lora_config': {'task_type': 'CAUSAL_LM', 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.1, 'target_modules': ['q_proj', 'v_proj']}, 'lr': 0.0001, 'epochs': 2, 'batch_size_per_worker': 8, 'output': '/tmp/ray/'}[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=152616)[0m num_train_epoch = 2, max_train_steps = 3224[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=152617)[0m train epoch: 0.000620	loss:2.236194	time:32.765280[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=152616)[0m train epoch: 0.005583	loss:1.603664	time:0.488202[32m [repeated 32x across cluster][0m
[36m(RayTrainWorker pid=152616)[0m train epoch: 0.012407	loss:1.465473	

2024-04-11 08:18:16,938	INFO tune.py:1021 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/TorchTrainer_2024-04-11_08-08-53' in 0.0064s.


Trial TorchTrainer_bca23_00000 completed. Last result: 


2024-04-11 08:18:16,950	INFO tune.py:1053 -- Total run time: 563.48 seconds (563.44 seconds for the tuning loop).


Training result: Result(
  metrics={},
  path='/root/ray_results/TorchTrainer_2024-04-11_08-08-53/TorchTrainer_bca23_00000_0_2024-04-11_08-08-53',
  filesystem='local',
  checkpoint=None
)
