<a href="https://colab.research.google.com/github/quazirab/fine-tuning-llama-3.1-on-medical-questionnaires/blob/Llama_3_1_fine_tuning_with_LoRA/notebooks/Llama_3_1_fine_tuning_with_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

For this experiment, we will be using the Llama 3.1 model to train a medical exam questionaire dataset. The data preperation on that dataset has been done in this [notebook](https://colab.research.google.com/drive/1wUU-V1vEEdtPAomUjm7JyKDBSZgbHL1d?usp=sharing) and has been saved in google drive from which it will be retrived for the fine tuning of Llama 3.1

## Setup
Before starting set the runtime GPU accelator to T4. The exercise will be using the unsloth training accelarator to finetune the Llama 3.1 model.

We will be using Meta-Llama-3.1-8B-bnb-4bit model. We will be using the unsloth to fine tune the model. unsloth is an great in finetuning and training LLMs faster. It has great support for optimizing Google Colab.


In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Now download the pretrained model

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!

model_pretrained, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    load_in_4bit = True,
    max_seq_length = max_seq_length
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

# Data Prep

Now lets download the prepared dataset and format the data for creating prompt for fine tuning

In [3]:
from datasets import load_dataset

ds = load_dataset("parquet", data_files="/content/drive/MyDrive/colab_drive/medical-question-w-answer-and-explanation-training-dataset.parquet")

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
promt = """Below is a medical question, paired with answer that contains explanation. Write a answer with explanation that is appropriate for the question.

### Question:
{}

### Answer:
{}. {}
"""

EOS_TOKEN = tokenizer.eos_token


def formatting_prompts_func(examples):
    questions = examples["question"]
    answers = examples["answer"]
    explanations = examples["exp"]

    texts = []
    for question, answer, explanation in zip(questions, answers, explanations):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = promt.format(question, answer, explanation) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }


ds = ds.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/182822 [00:00<?, ? examples/s]

In [5]:
df = ds["train"].to_pandas()
df.head(10)

Unnamed: 0,question,answer,exp,text
0,Chronic urethral obstruction due to benign pri...,Atrophy,urethral obstruction because of urinary calcul...,"Below is a medical question, paired with answe..."
1,Which vitamin is supplied from only animal sou...,Vitamin B12,Vitamin B12 Ref: Harrison's 19th ed. P 640* Vi...,"Below is a medical question, paired with answe..."
2,All of the following are surgical options for ...,Roux en Y Duodenal By pass,en Y Duodenal Bypass Bariatric surgical proced...,"Below is a medical question, paired with answe..."
3,Following endaerectomy on the right common car...,Central aery of the retina,central aery of the retina is a branch of the ...,"Below is a medical question, paired with answe..."
4,Growth hormone has its effect on growth through?,IG1-1,has two major functions :-i) Growth of skeleta...,"Below is a medical question, paired with answe..."
5,Scrub typhus is transmitted by: September 2004,Mite,Mite,"Below is a medical question, paired with answe..."
6,Abnormal vascular patterns seen with colposcop...,Satellite lesions,"vascular pattern include punctation, mosaicism...","Below is a medical question, paired with answe..."
7,Per rectum examination is not a useful test fo...,Pilonidal sinus,SINUS/DISEASE (Jeep Bottom; Driver's Bottom) P...,"Below is a medical question, paired with answe..."
8,Characteristics of Remifentanyl – a) Metabolis...,abc,is the shortest acting opioid due to its metab...,"Below is a medical question, paired with answe..."
9,Hypomimia is ?,Deficit of expression by gesture,Deficit of expression by gestureHypomimiaHypom...,"Below is a medical question, paired with answe..."


# Train

Now the Low-rank Adapter (LoRA) needs to be added the to model. Since the Llama is a huge model, retraining all the pre-trained weights is a huge task. LoRA helps in freezing the pre-trained model weights and adds trainable weights on top of it, which significantly reduces the parameters while maintaining model quality

In [6]:
model_training = FastLanguageModel.get_peft_model(
    model_pretrained,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Using Huggingface TRL's SFTTrainer, the training will be done

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model_training,
    tokenizer = tokenizer,
    train_dataset = ds["train"],
    dataset_text_field = "text",
    max_seq_length=max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/182822 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 182,822 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.0598
2,2.7791
3,2.5766
4,2.6218
5,2.6525
6,2.0942
7,1.9862
8,1.8847
9,1.8054
10,1.7403


Save the model locally

In [9]:
model_training.save_pretrained("llama-3.1-model-medical")
tokenizer.save_pretrained("llama-3.1-model-medical")

('llama-3.1-model-medical/tokenizer_config.json',
 'llama-3.1-model-medical/special_tokens_map.json',
 'llama-3.1-model-medical/tokenizer.json')

Time to test the model for inference

In [12]:
FastLanguageModel.for_inference(model_training)

inputs = tokenizer(
    [
        promt.format(
            "Which vitamin is supplied from animal source",
            "",  # this is for the answer
            "", # this is for the explanation
        )
    ], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model_training.generate(**inputs, streamer = text_streamer, max_new_tokens = 1248)

<|begin_of_text|>Below is a medical question, paired with answer that contains explanation. Write a answer with explanation that is appropriate for the question.

### Question:
Which vitamin is supplied from animal source

### Answer:
. 
Vitamin D. None
<|end_of_text|>


# Push code to Hugging Face Hub

Push the model to the HuggingFace hub

In [16]:
if True:
  model_training.push_to_hub("quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets", variant="test")
  tokenizer.push_to_hub("quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets", variant="test")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets
