# Custom LoRA: Finetuning LLaMA3-8B-Instruct
---

**Overview:** 

In this notebook, we will explore the process of fine-tuning the LLaMA 3 model using Low-Rank Adaptation (LoRA). The topics covered include:

1. **Data Preprocessing**: Preparing and cleaning the data to ensure optimal input for the model training.
2. **Training Setup**:
   - **Loading the Tokenizer**: Initializing the tokenizer suited for LLaMA 3 to process the text data.
   - **Setting Train Hyperparameters**: Defining the key hyperparameters for effective model training.
   - **PEFT (Parameter-Efficient Fine-Tuning)**: Techniques to fine-tune the model with fewer resources.
   - **Quantization**: Reducing the model size while maintaining performance, allowing faster inference and reduced memory usage.
   - **Loading the Pre-trained Model**: Bringing in the pre-trained LLaMA 3 model to further fine-tune it with custom data.
   - **Setting the Trainer Hyperparameters**: Configuring parameters specific to the training loop for optimal performance.
   - **Run Inference**: Evaluating the fine-tuned model on new data to test its performance and accuracy.

This notebook will guide you through the steps of effectively fine-tuning and optimizing the LLaMA 3 model for your specific use case.

## Data Preprocessing
We would use the openassistant guanaco dataset for training.

In [None]:
import warnings
warnings.filterwarnings("ignore")
from datasets import load_dataset

In [None]:
ds = load_dataset("timdettmers/openassistant-guanaco")

Let us view 5 samples from the dataset

In [None]:
for sample in ds['train'].select(range(5)):
    print(f"\n {'*' * 64}\n{sample}\n{'*' * 64}")

We can see dict objects with `text` key in multiple languages. For this lab, we limit ourselves to the english samples

In [None]:
from langdetect import detect

def remove_nonEnglish_rows(ds):
    # Initialize an empty list to store rows detected as English
    new_ds = []
    
    # Initialize a list to store indices of rows that cause issues (corner cases)
    corner_case = []
    
    # Iterate through each row in the dataset's 'text' column
    for i, row in enumerate(ds['text']):
        try:
            # Detect the language of the text
            if detect(str(row)) == 'en':
                # If the language is English, add the row to new_ds
                new_ds.append(row)
        except:
            # If an exception occurs, add the index to corner_case
            corner_case.append(i)
    
    # Return the list of English rows and the indices of corner cases
    return new_ds, corner_case


In [None]:
filter_train_samples,cc_train = remove_nonEnglish_rows(ds['train'])

print("Count of training samples: ",len(filter_train_samples))

In [None]:
filter_test_samples,cc_test = remove_nonEnglish_rows(ds['test'])
print("Count of testing samples: ",len(filter_test_samples))


In [None]:
# save English text samples
import json
def save_jsonl(ds,filename):
    with open(f"data/{filename}.jsonl", "w") as write_file:
            json.dump(ds, write_file, indent=4)
            print("dataset saved in jsonl format ....")

The LLaMA3 models expect the train datasets to be in a specific format as listed [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)

In [None]:
def transform_to_template(example):
    conversation_text = example['text']
    segments = conversation_text.split("###")[1:]
    

    for idx,segment in enumerate(segments):
        if idx%2==0:
            segments[idx] = segment.replace('Human:',"<|start_header_id|>user<|end_header_id|>") + "<|eot_id|>"
        else:
            segments[idx] = segment.replace('Assistant:',"<|start_header_id|>assistant<|end_header_id|>") + "<|eot_id|>"
    
    

    segments = ["<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant<|eot_id|>"] + segments

    return {'text': ''.join(segments)}

We will make a seperate folder for storing these datasets.

In [None]:
! mkdir -p data
! mkdir -p data/filtered

In [None]:
# set file names  
save_train_filename = 'filtered/train'
save_test_filename = 'filtered/test'

# save file
save_jsonl(filter_train_samples, save_train_filename)
save_jsonl(filter_test_samples, save_test_filename)

In [None]:
dataset = load_dataset('data/filtered/')

In [None]:
template_dataset = dataset.map(transform_to_template)

In [None]:
!mkdir -p data/ds_preprocess
template_dataset.save_to_disk('data/ds_preprocess/')

We have now saved the transformed dataset. This would help in loading the dataset directly in case of kernel failures.

## Training Setup

The following libraries would be helpful in setting up the finetuning process

In [None]:
# In some cases where you have access to limited computing resources, you might have to uncomment os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64" if you run into not enough memory issue 

import os
import torch
import json
import re

from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from datasets import load_dataset, load_from_disk
from langdetect import detect
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)

# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

Setting up the important paths for loading and saving important artifacts.

Llama-3 family of models are open source but require an access request approval. For the bootcamp environment, the weights have already been converted to huggingface compatible format and stored at a shared location for quicker access for the participants. 

In case of running the material on your own environment, please request access for Llama models from [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and generate your HuggingFace user access token from this [link](https://huggingface.co/settings/tokens)

In [None]:
# initialize path to the base model 
# base_model = "meta-llama/Meta-Llama-3-8B-Instruct" # Use this while running the material in your own standalone environment.

base_model = "/local/llama3_8b_weights/" # shared model weight location

# set the path to the dataset template
data_path = "data/ds_preprocess/train"
# set the path to the dataset template
eval_path = "data/ds_preprocess/test"

# load the transformed dataset
dataset = load_from_disk(data_path)
eval_dataset = load_from_disk(eval_path)

In [None]:
# Needed for standalone run
token= '<INSERT_HUGGINGFACE_TOKEN_HERE>'

### Loading the Tokenizer

In [None]:
#The below code should be used to download the tokenizer for the Llama-3-8V-Instruct model, once you get the permissions.
tokenizer = AutoTokenizer.from_pretrained(base_model,
                                          token=token,
                                          trust_remote_code=True
                                         )

In [None]:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

We would also need a directory to save the results.

In [None]:
! mkdir -p model

### Setting Train Hyper Parameters

In [None]:
training_params = TrainingArguments(
    output_dir="model/results",             # Directory to save the model results
    num_train_epochs=2,                     # Number of training epochs
    per_device_train_batch_size=5,          # Batch size per device during training
    gradient_accumulation_steps=4,          # Number of steps to accumulate gradients
    group_by_length=True,                   # Group sequences of similar lengths together
    save_steps=100,                         # Save model checkpoint every 100 steps
    logging_steps=25,                       # Log metrics every 25 steps
    learning_rate=2e-4,                     # Initial learning rate
    weight_decay=0.001,                     # Weight decay to apply (L2 regularization)
    fp16=False,                             # Use 16-bit precision (half-precision) during training
    bf16=False,                             # Use bfloat16 precision
    max_grad_norm=0.3,                      # Maximum gradient norm (for gradient clipping)
    max_steps=-1,                           # Total number of training steps (-1 means no limit)
    warmup_ratio=0.03,                      # Ratio of steps to perform learning rate warmup
    optim="paged_adamw_32bit",              # Optimizer to use (32-bit AdamW with paged memory)
    lr_scheduler_type="constant",           # Learning rate scheduler type (constant)
    report_to="tensorboard"                 # Reporting tool (TensorBoard in this case)
)


### PEFT
LoRA techniques are applied through `LoraConfig`, which provides PEFT parameters that control how the method is applied to the base model. A description of the parameter used in the cell below is given as follows:

- **lora_alpha**: LoRA scaling factor
- **lora_dropout**: The dropout probability for LoRA layers.
- **r**: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
- **bias**: Specifies if the bias parameters should be trained. It can be 'none', 'all', or 'lora_only'.
- **task_type**: Possible task types which include `CAUSAL_LM`, `FEATURE_EXTRACTION`, `QUESTION_ANS`, `SEQ_2_SEQ_LM`, and `SEQ_CLS and TOKEN_CLS`.   

Because the task we want to perform is text generation, we have set the task_type to Causal language model `(CAUSAL_LM)`, which is frequently used for text generation tasks. Please run the cell below to set up the LoRA configuration. 

In [None]:
peft_params = LoraConfig(
    lora_alpha=16,                # Alpha parameter for Lora scaling
    lora_dropout=0.1,             # Dropout rate for Lora layers
    r=64,                         # Rank of the Lora matrices
    bias="none",                  # Type of bias to apply (none in this case)
    task_type="CAUSAL_LM",        # Type of task (Causal Language Modeling in this case)
)


### Quantization
**4-bit quantization configuration**

Model quantization is a popular deep-learning optimization method in which model data—network parameters and activations—are converted from floating-point to lower-precision representation, typically using 8-bit integers. Quantization represents data with fewer bits, making it a useful technique for reducing memory usage and accelerating inference, especially in large language models (LLMs). It can be combined with PEFT methods to make it easier to train and load LLMs for inference.

<center><img src="imgs/quantization.png" height="400px" width="700px" /></center>
<center> <a href="https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/" > source: Using Quantization Aware Training with NVIDIA TensorRT</a></center>

Several ways and algorithms to quantize a model including can be found [here](https://huggingface.co/docs/peft/main/en/developer_guides/quantization). A library to easily implement quantization and integrate with transformers is the `bitsandbytes` library. The library provides config parameters to quantize a model to 8 or 4 bits using the `BitsAndBytesConfig` class. The 4 bits parameters used in the cell below are described as follows:

- **load_in_4bit**: set `True` to quantize the model to 4-bits when you load it
- **bnb_4bit_quant_type**: set to `"nf4"` to use a special 4-bit data type for weights initialized from a normal distribution
- **bnb_4bit_use_double_quant**: set `True` to use a nested quantization scheme to quantize the already quantized weights
- **bnb_4bit_compute_dtype**: set to `torch.float16` or `torch.bfloat16` to use bfloat16 for faster computation 

Run the cell below to set the 4-bit quantization for our model.

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

### Loading the pre-trained Model

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0},
    token=token
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [None]:
!nvidia-smi

We can see the model is loaded on the GPU

### Set the Trainer Hyperparameters

To initiate our model trainer, we create a trainer object from [Supervised fine-tuning (SFT)](https://huggingface.co/docs/trl/en/sft_trainer). SFT is part of the integrated transformer [Reinforcement Learning (TRL)](https://huggingface.co/docs/trl/en/index) tools used to train transformer language models using Reinforcement Learning. Others include [Reward Modeling step (RM)](https://huggingface.co/docs/trl/en/reward_trainer) and  Proximal [Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347). In our SFT trainer object, we set our model, training dataset, PEFT config  object, model tokenizer, and training argument parameter. We also specify the field (`text`) to use within our dataset.

**Note:** *If running on a single DGX A100 GPU, modify the value of `max_seq_length` to 1024 or set it to none (as default).*

In [None]:
len(dataset)

In [None]:
# The notebook uses a 1000 random samples for training in the interest of time.
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset.shuffle(seed=42).select(range(1000)),
    eval_dataset = eval_dataset.select(range(len(eval_dataset))),
    dataset_text_field="text",
    peft_config=peft_params,
    args=training_params,
    max_seq_length=1024,
    packing=False,
)

Run the cell below to train the SFT trainer object. 

In [None]:
import jinja2
print(jinja2.__version__)


In [None]:
trainer.train()

In [None]:
# save model
new_model = "model/Llama-3-8b-instruct-hf-finetune"
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

In [None]:
trainer.model.save_pretrained(new_model,safe_serialization=False)

### Run Inference

In [None]:
from transformers import pipeline

In [None]:
pp= pipeline(model=model, tokenizer=tokenizer, max_length=200, task="text-generation")

In [None]:
print(pp("what has research identified in potential monopsonies"))

In [None]:
print(pp("what are some artists similar to Dvorak?"))

In [None]:
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

In [None]:
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
pp= pipeline(model=model, tokenizer=tokenizer, max_length=200, task="text-generation")

In [None]:
!nvidia-smi

The below command kills the current kernel so as to free up the GPU for running NIMs.

In [None]:
!kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader | awk 'NR==1')


To use a custom LoRA adapter, you can head over to the <a href="nim_lora_adapter.ipynb"> nim_lora_adapter</a> notebook.

---
## Licensing

Copyright © 2024 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.