# Finetuning llama2 models - from turorial

[Tutorial](https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/)

# Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks

*This tutorial walks through the process of fine-tuning [LLaMA 2](https://ai.meta.com/llama/) models, providing step-by-step instructions.*

*All the code related to this article is available in our dedicated [GitHub repository](https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/llm/miniconda/llama2-fine-tuning/llama_2_finetuning.ipynb).*

## Introduction
On July 18, 2023, [Meta](https://about.meta.com/) released LLaMA 2, the latest version of their **Large Language Model** (LLM).

Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of **[7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)**, **[13B](https://huggingface.co/meta-llama/Llama-2-13b-hf)** and a mind-blowing **[70B](https://huggingface.co/meta-llama/Llama-2-70b-hf)**. Models are intended for free for both commercial and research use in English.

To suit every text generation needed and fine-tune these models, we will use [QLoRA](https://arxiv.org/abs/2305.14314) (Efficient Finetuning of Quantized LLMs), a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small “Low-Rank Adapters”. This unique approach allows for fine-tuning LLMs **using just a single GPU**! This technique is supported by the [PEFT](https://huggingface.co/docs/peft/) library.

## Set up Python environment
The following libraries are used for this method (`requirements.txt` file):

```
torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy
```

Then install and import the installed libraries

In [1]:
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, BitsAndBytesConfig, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import networkx as nx

## Download LLaMA 2 model
As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.

To download the model you have been granted access to, **make sure you are logged in to the Hugging Face model hub**. As mentioned in the requirements step, you need to use the `huggingface-cli` login command.

The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.

In [2]:
import os

base_dir = '/home/manish/thesis-implementations/quest_generation/llama2/'
model_name = 'meta-llama/llama-2-13b-hf'
model_dir = 'models'
model_path = os.path.join(base_dir, model_dir, model_name)

In [None]:
print(model_path, os.path.exists(model_path))

/home/manish/thesis-implementations/quest_generation/llama2/models/meta-llama/llama-2-13b-hf True


In [3]:
def load_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token
    
    return tokenizer

In [5]:
def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{12288}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        use_cache=False,
        device_map="auto",
        max_memory = {i: max_memory for i in range(n_gpus)},
    )

    return model, load_tokenizer(model_name)

## Quest Dataset

Load the qust dataset for training and create prompts accordingly

In [25]:
import json
import numpy as np
import os
import random

dataset_path = os.path.join(base_dir, 'data')
train_file = 'train.jsonl'
val_file = 'val.jsonl'
data_files = {
	"train": train_file, 
	"val": val_file
}

## Create a bitsandbytes configuration and load the model and tokenizer
This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

In [5]:
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a [LoRa configuration](https://huggingface.co/docs/peft/conceptual_guides/lora):

In [6]:
def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        lora_alpha=16,  # parameter for scaling
        lora_dropout=0.1,  # dropout probability for layers
        r=64,  # dimension of the updated matrices
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=modules,
    )

    return config

In [9]:
# Load model from HF with user's token and with bitsandbytes config
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_path, bnb_config)
# tokenizer = load_tokenizer(model_path)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
# TRAIN_TYPE = 'no_kg'
TRAIN_TYPE = 'text_kg'
# TRAIN_TYPE = 'tree_kg'
KG_DEPTH = 2

In [11]:
EOS_TOKEN = tokenizer.eos_token
PAD_TOKEN = tokenizer.pad_token
BOS_TOKEN = tokenizer.bos_token

In [12]:
map_game = {
    'TESO': 'TESOblivion_KG.gml',
    'TESS': 'TESSkyrim_KG.gml',
    'TL2': 'Torchlight2_KG.gml',
    'MC': 'Minecraft_KG.gml',
    'BG1': 'BaldursGate1_KG.gml',
    'BG2': 'BaldursGate2_KG.gml'
}

kg_map = {}

In [13]:
kg_base_dir = '/home/manish/thesis-implementations/data/VartinenFormatted/KGs'

for gid, gname in map_game.items():
    kg_path = os.path.join(kg_base_dir, map_game[gid])
    kg = nx.read_gml(kg_path)
    kg_map[gid] = kg

In [33]:
def create_training_prompt_formats(input):
    """
    Format various fields of the input quest data ('plots', 'kb', 'quest')
    Then concatenate them using two newline characters
    :param input: input dictionary
    """

    BACKGROUND = "### Background:"
    PLOTS_KEY = "### Plots:"
    INTRO_BLURB = "The quest related to the above information is as follows:"
    QUEST = "### Quest:"
    END_KEY = "### End"
    

    blurb = f"{INTRO_BLURB}"  # add intro blurb - model system instruction

    # add background - only if knowledge graph as text
    background = ''
    
    # add plots - key plot points
    plots_str = '\n'.join(input['plots'])
    plots = f"{PLOTS_KEY}\n{plots_str}"
    
    if TRAIN_TYPE == 'text_kg':
        completed_rels = []
        completed_nodes = []
        
        for kb in input['kbs']:
            entity = kb['name']
            e_desc = kb['description']
            e_type = kb['type']
            e_relations = kb['relations']
            
            background += f'{entity} is a {e_type}. '
            
            if entity != e_desc:
                background+= f'{entity} is {e_desc}. '
                
            for rel in e_relations:
                background += f'{entity} is {rel[0]} {rel[1]}.'
                completed_rels.append((entity, rel[1]))
            completed_nodes.append(entity)                     
            background += '\n'
        
        kg = kg_map[input['game']]
        all_nodes = kg.nodes(data=True)
        for node in all_nodes:
            entity = node[0]
            if entity.lower() in plots.lower():
                edges = list(nx.dfs_edges(kg, source=entity))
                for ent1, ent2 in edges:
                    if (ent1, ent2) in completed_rels or (ent2, ent1) in completed_rels:
                        continue
                    e1_type = all_nodes[ent1]['type']
                    e1_desc = all_nodes[ent1]['description']
                    e2_type = all_nodes[ent2]['type']
                    e2_desc = all_nodes[ent2]['description']
                    
                    if ent1 not in completed_nodes:
                        background += f'{ent1} is a {e1_type}. '
                        if e1_desc != ent1:
                            background += f'{ent1} is {e1_desc}. '
                        completed_nodes.append(ent1)
                        background += '\n'
                    
                    if ent2 not in completed_nodes:
                        background += f'{ent1} is a {e2_type}. '
                        if e2_desc != ent2:
                            background += f'{ent2} is {e2_desc}. '
                        completed_nodes.append(ent2)
                        background += '\n'
                    
                    rel = kg[ent1][ent2]['label']
                    if rel == 'connected to':
                        background += f'{ent1} is {rel} {ent2}. '
                    if rel == 'present in':
                        if e1_type == 'location':
                            background += f'{ent2} is {rel} {ent2}. '
                        else:
                            background += f'{ent1} is {rel} {ent2}. '
                    if rel == 'held by':
                        if e1_type == 'character':
                            background += f'{ent2} is {rel} {ent1}. '
                        else:
                            background += f'{ent1} is {rel} {ent2}. '
                    background += '\n'
                    completed_rels.append((ent1, ent2))
                        
        background = f"{BACKGROUND}\n{background}"
    
    # add concatenated quest text
    quest_str = ''
    for k,v in input['quest'].items():
        if k == 'description':
            continue
        if k == 'tasks':
            value = '\n ' + '\n '.join(np.char.capitalize(v[:-1]))
        else:
            value = v.capitalize()
        quest_str += f'{k.capitalize()}: {value}\n' 
    quest = f"{QUEST}\n{quest_str}"  # add quest output
    
    end = f"{END_KEY}"  # add end key
    
    if TRAIN_TYPE in ['no_kg', 'tree_kg']:
        parts = [part for part in [plots, blurb, quest, end] if part]
    else:
        parts = [part for part in [background, plots, blurb, quest, end] if part]

    formatted_prompt = "\n\n".join(parts)
    input['text'] = formatted_prompt

    return input

In [18]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, dataset: str, include_kg: bool = True):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    :param include_kg (bool): Whether to include knowledge graph in the prompt
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_training_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["id", "game", "kbs", "plots", "quest"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    return dataset

Now, we will use the **model tokenizer to process these prompts into tokenized ones**.

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

In [17]:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the `print_trainable_parameters()` helper function to see how many trainable parameters are in the model. We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

In [18]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

## Train
Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:

In [26]:
dataset = load_dataset(dataset_path, data_files=data_files)
train_dataset = dataset['train']
val_dataset = dataset['val']

Found cached dataset json (/home/manish/.cache/huggingface/datasets/json/data-6359e290ba54d2fa/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)


  0%|          | 0/2 [00:00<?, ?it/s]

In [19]:
## Preprocess dataset

max_length = get_max_length(model)
train_dataset = preprocess_dataset(tokenizer, max_length, train_dataset)

Found max lenth: 4096
Preprocessing dataset...


Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Filter:   0%|          | 0/77 [00:00<?, ? examples/s]

In [None]:
kg_map = None
import gc
gc.collect()

In [None]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            output_dir=output_dir,
            overwrite_output_dir=True,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            optim="paged_adamw_32bit",
            logging_steps=20,
            learning_rate=2e-4,
            fp16=True,
            warmup_steps=10,
            max_steps=200,
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
    
    # Launch training
    print("Training...")
    
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()
    print(metrics)    
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = os.path.join(base_dir, model_dir, 'results', model_name, f'{TRAIN_TYPE}')
train(model, tokenizer, train_dataset, output_dir)

*If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the `max_steps` argument by `num_train_epochs`.*

To later load and use the model for inference, we have used the `trainer.model.save_pretrained(output_dir)` function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a `EarlyStoppingCallback`, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.