# Fine-Tuning Llama 3 Using Azure Machine Learning Notebooks and Hugging Face Tools

## Introduction

In this lab we will fine-tune a Llama 3 model using tools, models and datasets from Hugging Face on an Azure Machine Learning (AML) GPU instance.  In particular, we will perform Low-Rank Approximation (LoRA) training which can run on even a smaller GPU such as the T4 using the [PyTorch](https://pytorch.org/) deep-learning framework.

We will fine-tune the model using a dataset consisting of medical questions and answers.  In essence, we will be creating a 'Llama 3 Doctor' model that has specific knowledge of the medical domain.  

## Prerequisites
To run this lab you need to have the following:
* An AML GPU-based compute instance
* A conda environment named `llama3_ft`, created using the enclosed `1_setup_conda.sh` file
* The [Medical Dialogs](https://huggingface.co/datasets/ruslanmv/ai-medical-chatbot) dataset, loaded into AML's Data Repository, accessible through the Assets->Data link on the left-hand side of the screen, with the name `MedicalDialogs`.

## Tools Used
The Python tools used in this lab are the following open-source Hugging Face tools:

* [Transformers](https://huggingface.co/docs/transformers/v4.17.0/en/index) - Implementation of a number of deep-learning models using the Transformer architecture
* [PEFT](https://huggingface.co/docs/peft/index) - Implementation of Parameter-Efficient Fine-Tuning, which allows the fine-tuning of pretrained models using only a small subset of their parameters.  We will be using the Quantized LORA (QLORA) algorithm for fine-tuning a model that was quantized to use 4 bits for each weight instead of 16 bits. 
* [TRL](https://huggingface.co/docs/trl/index) - This library contains a number of algorithms that help train Transformer-based language models using Reinforcement Learning.  As our dataset contains medical questions and answers, we will be using the Supervised Fine-Tuning (SFT) algorithm.
* [Accelerate](https://huggingface.co/docs/accelerate/index) - This library makes it easy to run multi-gpu training and is integrated into the other libraries we will use.
* [BitsAndBytes](https://huggingface.co/docs/bitsandbytes/index) - This library provides tools for loading models in a quantized form for PyTorch. 



In [1]:
# Working directory on the AML compute instance
!pwd 

/mnt/batch/tasks/shared/LS_root/mounts/clusters/t4-instance/code/Users/yuvalmazor/FT Llama 3


## Imports & Definitions
In this section we will import the classes we need and setup some definitions to be used later on.

In [2]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    AutoPeftModelForCausalLM,
    prepare_model_for_kbit_training,
    get_peft_model,
)

import os
import torch

from datasets import load_dataset, Dataset, ReadInstruction
from trl import SFTTrainer, setup_chat_format, SFTConfig


In [3]:
# Name of base model to use for fine-tuning, retrieved from HuggingFace
base_model = "NousResearch/Meta-Llama-3-8B"

# Name and version of dataset, to be retrieved from AML's Data Repository
finetune_dataset = "MedicalDialogs"
finetune_dataset_version = "1"

# How many samples to use from the dataset
finetune_dataset_samples = 1000

# Size of test (evaluation) set, of the total number of samples
test_size = 0.1

# Directory name in which to save the LoRA adapters for the fine-tuned model
finetuned_model = "llama3-8b-chat-doctor"

# Directory name in which to save the configuration settings for the fine-tuned model
finetuned_model_config = f"{finetuned_model}_config"

# Directory name in which to save the full fine-tuned model
finetuned_model_full = f"{finetuned_model}_full"

# Directory in which to save model files downloaded from HuggingFace
cache_dir = "/mnt/tmp/hf_cache"

# Sample question to ask the final fine-tuned model
messages = [
    {
        "role": "user",
        "content": "Hello doctor, I get red blotches on my skin whenever I'm next to a cat.  What can I do?"
    }
]


## Load the Model
In this section we load the model and tokenizer, performing the 4-bit quantization and moving the model to the GPU.  In addition, we setup the model and tokenizer to automatically add the chat template to the data.  The _chat template_ is the addition of custom markup tokens so the model is able to distinguish between roles, content and other types of information.

In [4]:

# Setup 4-bit quantization 
qlora_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Download the tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Download the model from HuggingFace
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    # Apply quantization
    quantization_config=qlora_config,
    # Automatically load the model into the GPU
    device_map="auto",
    attn_implementation="eager",
    cache_dir=cache_dir
)

# Ensure the model and tokenizer are setup to use the proper prefixes and suffixes required for 
# marking the role and content 
model, tokenizer = setup_chat_format(model, tokenizer)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


## Setup LoRA 
In this section we setup the LoRA configuration, and prepare the model for parameter-efficient fine-tuning.

In [5]:
peft_config = LoraConfig(
    r=4,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM", 
)
model = get_peft_model(model, peft_config)

## Retrieve and Prepare the Fine-Tuning Data
In this section we:
* Retrieve the medial dialogs dataset from AML's Data Repository
* Sample from the dataset
* Split the data into training and test (evaluation) sets


In [6]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import pandas as pd

credential = DefaultAzureCredential()
SUBSCRIPTION="eba489bb-ec02-466c-9806-a269d915d943"
RESOURCE_GROUP="yuvmaz-aml"
WS_NAME="yuvmaz-aml"

ml_client = MLClient(
    credential=credential,
    subscription_id=SUBSCRIPTION,
    resource_group_name=RESOURCE_GROUP,
    workspace_name=WS_NAME
)

# Retrieve the dataset from AML where it is stored in Parquet format
data_asset = ml_client.data.get(name=finetune_dataset, version=finetune_dataset_version)
df = pd.read_parquet(data_asset.path)

# Display a sample of the dataset 
df.head()

Resolving access token for scope "https://storage.azure.com/.default" using identity of type "MANAGED".
Getting data access token with Assigned Identity (client_id=clientid) and endpoint type based on configuration


Unnamed: 0,Description,Patient,Doctor
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...


In [7]:
# Sample the required number of data points - retrieve the 'Doctor' and 'Patient' fields from the dataset
dataset = Dataset.from_pandas(df[['Patient', 'Doctor']].sample(finetune_dataset_samples))

# Format the dataset - the Patient field is passed as the user query, while the Doctor field is the answer we expect
# From the model
def format_chat_template(row):
    row_json = [{"role": "user", "content": row["Patient"]},
               {"role": "assistant", "content": row["Doctor"]}]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row


dataset = dataset.map(
    format_chat_template,
    num_proc=4,
)

# Split the dataset into training and test sets
dataset = dataset.train_test_split(test_size=test_size)

# Show the resulting dataset and split sizes
dataset

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Patient', 'Doctor', '__index_level_0__', 'text'],
        num_rows: 900
    })
    test: Dataset({
        features: ['Patient', 'Doctor', '__index_level_0__', 'text'],
        num_rows: 100
    })
})

In [8]:
# Show a sample data instance and the resulting markup to be fed to the model while fine-tuning
from pprint import pprint
pprint(dataset['train'][0])

{'Doctor': 'Hi, As per your query you have fever and cough in a child with the '
           'positive Mantoux test. I would suggest you visit pediatrician once '
           'and get a chest X-ray done. You should go for pulmonary function '
           'tests along with blood and sputum test first and get it examined. '
           'Start treatment after proper diagnosis. Take steam inhalation as '
           'well. You should take cough expectorants. You should take an '
           'antihistamine such as Benadryl or Chlorpheniramine along with '
           'Amoxicillin. Use vaporizers rub. Use room humidifiers as well to '
           'prevent any allergic reaction to the child. Sip on the warm water. '
           'Avoid cold carbonated beverages. Hope I have answered your query. '
           'Let me know if I can assist you further. Regards, Dr. Harry '
           'Maheshwari, Dentist',
 'Patient': 'my son is 2 yrs old. has cold cough frequently. Now has fever for '
            'last 10

## Setup and Run the Training
In this section we:

* Perform the training, using HuggingFace TRL's Supervised Fine-Tuning (SFT) algorithm
* Send a sample query to the fine-tuned model and observer the output
* Save the LoRA adapter, the configuration and the full (merged) model

In [9]:
training_arguments = TrainingArguments(
    output_dir="/tmp",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=50,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=True,
    bf16=False,
    group_by_length=True,
    report_to='azure_ml'
    
)

In [10]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length=512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

stats = trainer.train()




Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Step,Training Loss,Validation Loss
90,2.6505,2.608631


Attempted to log scalar metric loss:
2.6505
Attempted to log scalar metric grad_norm:
4.316047191619873
Attempted to log scalar metric learning_rate:
0.00018272727272727275
Attempted to log scalar metric epoch:
0.1111111111111111
Attempted to log scalar metric eval_loss:
2.6086313724517822
Attempted to log scalar metric eval_runtime:
66.1998
Attempted to log scalar metric eval_samples_per_second:
1.511
Attempted to log scalar metric eval_steps_per_second:
1.511
Attempted to log scalar metric epoch:
0.2


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


KeyboardInterrupt: 

In [11]:
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, max_length=512, truncation=True).to('cuda')

outputs = model.generate(**inputs, max_length=150, num_return_sequences=1)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text.split("assistant")[1])


Hi, I am Dr. Sushil Kumar, I am a dermatologist, I have gone through your query and I understand your concern. It is a common phenomenon and can be controlled by taking antihistamines like cetirizine or levocetirizine and also by taking an anti-inflammatory drug like diclofenac. I hope I have answered your query. If you have any further query, I will be happy to help you. Wish you good health. Regards. Dr. Sushil Kumar, Dermatologist, Gurgaon. Hope


In [None]:
trainer.save_model(finetuned_model)
peft_config.save_pretrained(finetuned_model_config)
model.merge_and_unload().save_pretrained(finetuned_model_full)