# Fine-tuning Llama 3.1 on PubMedQA Dataset (Local Use)

This notebook demonstrates how to fine-tune the Llama 3.1 model on the PubMedQA dataset using the Unsloth library for efficient training, and how to use the model locally.

In [1]:
# Install necessary libraries
!pip install transformers datasets peft trl accelerate bitsandbytes
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install triton
!pip install xformers

# Import necessary libraries
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if device.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU not found. Please make sure to enable GPU in the runtime settings.")

# Ensure we're using an A100 GPU for optimal performance
if device.type == "cuda" and "A100" not in torch.cuda.get_device_name(0):
    print("Warning: For optimal performance, it's recommended to use an A100 GPU instance.")

Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-8zimv2jq/unsloth_6eb767774c7f4bc8ab35b14fa11b346b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-8zimv2jq/unsloth_6eb767774c7f4bc8ab35b14fa11b346b
  Resolved https://github.com/unslothai/unsloth.git to commit 38663b01f5dd0e610b12475bd95b144303cff539
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting unsloth-zoo (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2024.10.1-py3-none-any.whl.metadata (48 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m48.2/48.2 kB[0m [31m583.4 kB/s[0m eta [36m0:00:00[0m [

## Load and Prepare the Model

In [2]:
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

# Prepare the model for fine-tuning with LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-80GB. Max memory: 79.254 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth 2024.10.0 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Prepare the Dataset

In [3]:
# Apply chat template to the tokenizer
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

# Function to format PubMedQA dataset entries
def format_pubmedqa(example):
    context = f"Context: {example['context']}\n\n"
    question = f"Question: {example['question']}\n\n"
    answer = f"Answer: {example['long_answer']}"

    messages = [
        {"from": "human", "value": context + question},
        {"from": "gpt", "value": answer}
    ]

    return {"conversations": messages}

# Function to apply the chat template to dataset entries
def apply_template(examples):
    # Correctly access the 'conversations' list in the examples dict
    conversations_list = examples['conversations']
    text = [
        tokenizer.apply_chat_template(
            conversations,
            tokenize=False,
            add_generation_prompt=True
        ) for conversations in conversations_list
    ]
    return {"text": text}

# Load and preprocess the PubMedQA dataset
dataset = load_dataset("pubmed_qa", "pqa_labeled", split="train")
dataset = dataset.map(format_pubmedqa)
dataset = dataset.map(apply_template, batched=True, remove_columns=dataset.column_names)

# Display dataset information
print(f"Dataset size: {len(dataset)}")
print("Sample entry:")
print(dataset[0]['text'][:500] + "...")


Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


Downloading readme:   0%|          | 0.00/5.19k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset size: 1000
Sample entry:
<|im_start|>user
Context: {'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been...


## Set Up Training Arguments and Start Training

In [4]:
# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=1,  # Reduced batch size for Colab
        gradient_accumulation_steps=8,  # Adjusted to maintain effective batch size
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="paged_adamw_8bit",  # Corrected optimizer name
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

# Start training
trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 263 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 32
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.4605
20,1.2702
30,1.2511


TrainOutput(global_step=32, training_loss=1.3238190189003944, metrics={'train_runtime': 181.4876, 'train_samples_per_second': 1.449, 'train_steps_per_second': 0.176, 'total_flos': 2.3740393073934336e+16, 'train_loss': 1.3238190189003944, 'epoch': 0.973384030418251})

## Save the Model Locally

In [5]:
# Merge LoRA weights into the base model
model = model.merge_and_unload()

# Save the merged model
model.save_pretrained("fine_tuned_llama_3_1_pubmedqa")
tokenizer.save_pretrained("fine_tuned_llama_3_1_pubmedqa")
print("Model saved locally in 'fine_tuned_llama_3_1_pubmedqa' directory.")



Model saved locally in 'fine_tuned_llama_3_1_pubmedqa' directory.


## Use the Locally Saved Model

In [16]:
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from transformers import AutoTokenizer
import torch

# Load the saved model and tokenizer
model_path = "fine_tuned_llama_3_1_pubmedqa"
max_seq_length = 2048
tokenizer = AutoTokenizer.from_pretrained(model_path)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=max_seq_length,
    device_map="auto",
    dtype=torch.float16
)

# Prepare the model for inference
FastLanguageModel.for_inference(model)

# Apply the chat template to the tokenizer
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def generate_response(prompt, max_length=200):
    # Format the prompt using the chat template
    messages = [{"from": "human", "value": prompt}]
    prompt_formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Prevent tokenizer from returning 'token_type_ids'
    inputs = tokenizer(prompt_formatted, return_tensors="pt", return_token_type_ids=False).to(model.device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove the prompt from the output to get only the response
    response = output_text[len(prompt_formatted):].strip()
    return response


==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-80GB. Max memory: 79.254 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



### Test the Model

In [17]:
test_prompt = "What are the potential side effects of chemotherapy?"
response = generate_response(test_prompt)
print(f"Prompt: {test_prompt}")
print(f"Response: {response}")

Prompt: What are the potential side effects of chemotherapy?
Response: he potential side effects of chemotherapy?
Chemotherapy is a type of cancer treatment that uses one or more drugs to kill cancer cells. It works by interfering with the ability of cells to grow and divide. Chemotherapy can be used to cure cancer, to shrink or control the growth of cancer, or to kill cancer cells in the body.
Chemotherapy is often used in combination with other cancer treatments, such as surgery or radiation therapy. It can be given before or after these treatments.
The side effects of chemotherapy depend on the type of chemotherapy used and the dose given. Some of the most common side effects of chemotherapy include:
Fatigue (feeling very tired)
Chemotherapy can also cause side effects that are not related to the cancer itself, such as:
Increased risk of infection
These side effects can be caused by the chemotherapy drugs themselves or by the


### Interactive Loop for Testing

In [None]:
print("You can now interact with the fine-tuned model. Type 'quit' to exit.")
while True:
    user_input = input("\nEnter a medical question (or 'quit' to exit): ")
    if user_input.lower() == 'quit':
        break
    response = generate_response(user_input)
    print(f"Response: {response}")