<a href="https://colab.research.google.com/github/quazirab/fine-tuning-llama-3.1-on-medical-questionnaires/blob/main/notebooks/Llama_3_1_fine_tuning_with_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

For this experiment, we will be using the Llama 3.1 model to train a medical exam questionaire dataset. The data preperation on that dataset has been done in this [notebook](https://colab.research.google.com/drive/1wUU-V1vEEdtPAomUjm7JyKDBSZgbHL1d?usp=sharing) and has been saved in google drive from which it will be retrived for the fine tuning of Llama 3.1

## Setup
Before starting set the runtime GPU accelator to T4. The exercise will be using the unsloth training accelarator to finetune the Llama 3.1 model.

We will be using Meta-Llama-3.1-8B-bnb-4bit model. We will be using the unsloth to fine tune the model. unsloth is an great in finetuning and training LLMs faster. It has great support for optimizing Google Colab.


In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Now download the pretrained model

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096 # Choose any! unsloth supports RoPE Scaling internally automatically!

model_pretrained, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    load_in_4bit = True,
    max_seq_length = max_seq_length
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

# Data Prep

Now lets download the prepared dataset and format the data for creating prompt for fine tuning

In [3]:
from datasets import load_dataset

ds = load_dataset("parquet", data_files="/content/drive/MyDrive/colab_drive/medical-question-w-answer-and-explanation-training-dataset.parquet")

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'exp', '__index_level_0__'],
        num_rows: 10000
    })
})

Seems like there are too many rows

In [15]:
promt = """Give an answer for the following medical question.

### Question:
{}

### Answer:
{}
"""

EOS_TOKEN = tokenizer.eos_token


def formatting_prompts_func(examples):
    questions = examples["question"]
    answers = examples["answer"]
    explanations = examples["exp"]

    texts = []
    for question, answer, explanation in zip(questions, answers, explanations):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = promt.format(question, f"{answer}. {explanation}") + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }


In [16]:
ds = ds.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [17]:
df = ds["train"].to_pandas()
df.head(5)

Unnamed: 0,question,answer,exp,__index_level_0__,text
0,Pathognomic lesion of scabies is?,Burrow,"BurrowRef : Rook's 8/e, p 38.36.39",132587,Give an answer for the following medical quest...
1,WHO STEPS is used for:,Non-communicable diseases,diseases [Ref. http://wwwwho.int/mediacentre/f...,72018,Give an answer for the following medical quest...
2,Acute bilirubin encephalopathy is characterize...,Hypertonia,bilirubin encephalopathy is characterized by h...,158061,Give an answer for the following medical quest...
3,In some kidney transplants Hyperacute graft re...,Preformed antibodies,Bailey and love 25 e p1410,137653,Give an answer for the following medical quest...
4,Clasp arms serves the function of,Both 2 and 3,and Position of Clasp Assembly Parts,176938,Give an answer for the following medical quest...


# Train

Now the Low-rank Adapter (LoRA) needs to be added the to model. Since the Llama is a huge model, retraining all the pre-trained weights is a huge task. LoRA helps in freezing the pre-trained model weights and adds trainable weights on top of it, which significantly reduces the parameters while maintaining model quality

Lets find the available modules for the model

In [18]:
print(model_pretrained)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Identity()
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): I

Lets choose the modules in LlamaAttention and LlamaMLP Layers

In [19]:
model_training = FastLanguageModel.get_peft_model(
    model_pretrained,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Using Huggingface TRL's Supervised Fine Tuning Trainer, the training will be done.

In [20]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model_training,
    tokenizer = tokenizer,
    train_dataset = ds["train"],
    dataset_text_field = "text",
    max_seq_length=max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        auto_find_batch_size=True,
    ),
)

Map (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.5375
2,3.7669
3,2.9003
4,2.7847
5,3.0131
6,2.471
7,2.1685
8,1.8101
9,2.0488
10,2.1191


Save the model locally

In [22]:
model_training.save_pretrained("llama-3.1-model-medical")
tokenizer.save_pretrained("llama-3.1-model-medical")

('llama-3.1-model-medical/tokenizer_config.json',
 'llama-3.1-model-medical/special_tokens_map.json',
 'llama-3.1-model-medical/tokenizer.json')

Time to test the model for inference

In [24]:
FastLanguageModel.for_inference(model_training)

inputs = tokenizer(
    [
        promt.format(
            "What is X-linked muscular dystrophy?",
            "",  # this is for the answer
        )
    ], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model_training.generate(**inputs, streamer = text_streamer, max_new_tokens = max_seq_length)

<|begin_of_text|>Give an answer for the following medical question.

### Question:
What is X-linked muscular dystrophy?

### Answer:

Duchenne's muscular dystrophy. Duchenne's muscular dystrophy is X-linked muscular dystrophy.
<|end_of_text|>


# Hurray, we now have a Model that is giving answers!!!!!!!

# Push code to Hugging Face Hub

Push the model to the HuggingFace hub

In [25]:
if False:
  model_training.push_to_hub("quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets", variant="test")
  tokenizer.push_to_hub("quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets", variant="test")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets


## Pull the model

This is an example for the next time we would like to run inference for the pretrained model pushed to Huggingface Hub

In [26]:
model_from_hub, tokenizer_from_hub = FastLanguageModel.from_pretrained("quazirab/Llama-3.1-fine-tuning-with-LoRA-medical-qa-datasets")

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

In [27]:
FastLanguageModel.for_inference(model_pretrained)

inputs = tokenizer(
    [
        promt.format(
            "What is the cause of cherry red spot",
            "",  # this is for the answer
        )
    ], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model_pretrained.generate(**inputs, streamer = text_streamer, max_new_tokens = max_seq_length)

<|begin_of_text|>Give an answer for the following medical question.

### Question:
What is the cause of cherry red spot

### Answer:

Retinal haemorrhage. Retinal haemorrhage is seen in cases of head injury.
<|end_of_text|>
