#Build your MedBot

---
The goal of this notebook is to get you more familiar with LLM fine-tuning by creating a simple QA LLM that can answer medical questions. By the end of it you will be able to customize this LLM with any dataset.

**Just to give you a heads up:** We won't be having a model performing like ChatGPT or Bard, but at least we will have an idea about how we can create our own smaller versions of such powerful LLMs.  

## Importing and Installing Libraries/Packages
We will start by installing our necessary packages.

**bitsandbytes**: This package will allow us to run 4bit quantization on our model

**transformers**: This Hugging Face package will allow us to load state-of-the-art models easily into our notebook

**peft**: This package allows us to add PEFT techniques easily to our model, such as LoRA

**accelerate**: Accelerate is a handy package that allows us to run boiler plate code with a few lines of code

**datasets**: This package allows us to easily import datasets from the Hugging Face platform to be directly used

In [None]:
import torch

if torch.cuda.is_available():
    print("CUDA is available.")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2)} GB")
else:
    print("No GPU found.")

CUDA is available.
GPU Name: NVIDIA A100-SXM4-80GB
Memory: 85.1 GB


In [None]:
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl (72.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m157.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.46.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-k6c32qyw
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-k6c32qyw
  Resolved https://github.com/hu

In [None]:
import torch
import transformers
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM

## Loading our model

Let's start by loading our model. We will use the GPT Neox 20b Model by EleutherAI!

In [None]:
hf_model = "EleutherAI/gpt-neox-20b"

We will also set the bitsandbytes configurations needed for our model to run on our single colab GPU. The needed paramaters will be 'Double Quantization' 'Quantization Type' and the computational type needs to be set to bfloat16.

In [None]:
bitsbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

We will then set our tokenizer, and our model using the AutoTokenizer and AutoModelforCausalLM classes

In [None]:
#Load tokenizer and model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(hf_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    hf_model,
    quantization_config=bitsbytes_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 46 files:   0%|          | 0/46 [00:00<?, ?it/s]

model-00004-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00003-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00001-of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

model-00008-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00002-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00006-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00005-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00007-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00009-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00010-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00011-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00012-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00013-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00014-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00015-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00016-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00017-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00018-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00019-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00020-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00021-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00022-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00023-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00024-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00025-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00026-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00027-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00028-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00029-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00030-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00031-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00032-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00033-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00035-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00034-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00036-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00037-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00038-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00039-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00040-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00041-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00042-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00043-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00044-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00045-of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

model-00046-of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

## Model Preprocessing

We now have to apply some preprocessing to our model so we can prepare it for training. First we need to further reduce our memory consumption by using the gradient_checkpointing_enable() fucntion on our model. We then use the prepare_model_for_kbit_training function so that we can use 4bit quantization training.

In [None]:
# Model Preprocessing
from peft import prepare_model_for_kbit_training

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Prepare model for k-bit (4-bit) training
model = prepare_model_for_kbit_training(model)


**Q: How 4-bit quantization affects accuracy:**

4-bit quantization reduces memory usage by storing model weights in lower precision.
While this enables training large models on limited hardware, it may cause a slight drop in accuracy due to reduced numerical precision.
However, techniques like LoRA and double quantization help minimize this impact.

We will also set a function that will print the number of trainable parameters our model has.

In [None]:
def print_trainable_parameters(model):
    trainable_parameters = 0
    all_paramaters = 0
    for _, param in model.named_parameters():
        all_paramaters += param.numel()
        if param.requires_grad:
            trainable_parameters += param.numel()
    print(
        f"Trainable: {trainable_parameters} || All: {all_paramaters} || Trainable %: {100 * trainable_parameters / all_paramaters}"
    )

Finally we will set the configurations for our LoRA. The paramaters needed are the rank updates, the default LoRa alpha value, the target modules which need to be set to query_key_value, the default lora dropout rate, bias should be set to none, and the task type according to the model we are using.

In [None]:
config = LoraConfig(
    r=8, # Rank of LoRA update matrices
    lora_alpha=16, # Scaling factor
    target_modules=["query_key_value"], # Layer to apply LoRA on
    lora_dropout=0.05, # Dropout to help generalization
    bias="none",  # Only train LoRA weights, not biases
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

# Print the trainable parameters of the model
print_trainable_parameters(model)


Trainable: 8650752 || All: 10597552128 || Trainable %: 0.08162971878329976


Trainable: 8,650,752 || All: 10,597,552,128 || Trainable %: 0.0816%
This means only 0.08% of the model's parameters are being trained.
If LoRA weren’t applied, we’d see billions of trainable parameters.
But since only a tiny fraction is trainable, it confirms that LoRA wrapping is successful and active, enabling efficient fine-tuning
with minimal resource usage.

## Dataset Loading

Let's load our medical dataset from Hugging Face. We will use the `medalpaca/medical_meadow_wikidoc_patient_information` dataset. You can access it [here](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc).

In [None]:
data = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information")

# Mapping the needed column as our data using a lambda statement
data = data.map(lambda samples: tokenizer(samples["output"]), batched=True)

# Map input_ids to labels for supervised training
data = data.map(lambda samples: {"labels": samples["input_ids"]})

README.md: 0.00B [00:00, ?B/s]

medical_meadow_wikidoc_patient_info.json:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5942 [00:00<?, ? examples/s]

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

## Model Training and Testing

Now we train the model usig the transformers library. Before doing so, we set the tokenizer to be the end of sequence tokens since it is required by our model. Your goal here is to tune the paramaters until you get a running model on a single colab GPU.

In [None]:
# Setting the tokenizer padding to be 'eos' tokens
tokenizer.pad_token = tokenizer.eos_token

from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./medbot_model",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    optim="paged_adamw_8bit",
    save_strategy="no",
    report_to="none"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    tokenizer=tokenizer,
    args=training_args
)


# This silences the warnings
model.config.use_cache = False

# Train the model!

trainer.train()


  trainer = transformers.Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Step,Training Loss
10,2.0656
20,2.2522
30,2.078
40,2.0367
50,1.9399
60,2.0655
70,1.9998
80,2.0152
90,1.9967
100,1.8316


TrainOutput(global_step=1486, training_loss=1.9016111176235357, metrics={'train_runtime': 9147.6206, 'train_samples_per_second': 0.65, 'train_steps_per_second': 0.162, 'total_flos': 6.955353632386253e+16, 'train_loss': 1.9016111176235357, 'epoch': 1.0})

## Training Summary

The GPT-NeoX-20B model was fine-tuned using LoRA with 4-bit quantization over a medical dataset. Training was completed on an A100 GPU with gradient accumulation and memory-efficient optimization strategies to reduce compute load.

The full training process ran for approximately 2.5 hours and completed one full epoch over the dataset. While the final loss value was not logged, the training ran without interruption, and the model was successfully saved for inference.

This process demonstrated the feasibility of adapting large-scale language models using LoRA and quantization on constrained hardware environments.



**Q: Explanation of how 4 of the training arguments we used in our Trainer are used and what do they represent**

1- per_device_train_batch_size=1: Sets the batch size to 1 sample per GPU before accumulation.This is useful for large models when memory is limited.

2- gradient_accumulation_steps=4: Simulates a larger batch size by accumulating gradients over 4 steps before updating model weights.This allows training with an effective batch size of 4.

3- learning_rate=2e-4: Sets how quickly the model updates during training. A moderate learning rate helps LoRA adapters learn effectively without overshooting.

4- optim="paged_adamw_8bit": Uses a memory-efficient optimizer designed for 4-bit training.This helps fit and train large models like GPT-NeoX efficiently on a single GPU.

In [None]:
# Clean up unused memory to free GPU resources and to prevent errors before saving or generating
import gc
gc.collect()
torch.cuda.empty_cache()

We now save our model as a pretrained version so that we can set the LoRA configurations. This model will be saved to a separate folder on the next block.

In [None]:
# Extract and save clean base model
saved_model = model.base_model if hasattr(model, "base_model") else model

# Save to a new folder
saved_model.save_pretrained("outputs_clean")
model.save_pretrained("lora_outputs_clean")

After training, we save both the base model and the LoRA adapter separately.
"outputs" contains the original model, while "lora_outputs" stores only the lightweight fine-tuned LoRA parameters.
This allows us to reload and reuse the adapters efficiently without storing the full model again.


Before testing our model, we have to get the LoRA configs from our pre-trained model and set them to our new model using the get_peft_model() function.

In [None]:
#lora_configs = LoraConfig.from_pretrained("lora_outputs")
#model = get_peft_model(saved_model, lora_configs)



Note: LoRA is already applied in memory. This is the correct reload code,
but we skip running it again to avoid duplicate adapter warnings.

In [None]:
# Clean up unused memory to free GPU resources and to prevent errors before saving or generating
import torch, gc

gc.collect()
torch.cuda.empty_cache()


We need to set our prompt as a variable, and also our device currently in use.

In [None]:
prompt = """### Instruction:
What are the symptoms of diabetes?

### Response:"""

device = "cuda:0"


Finally, we will make our LLM generate text based on the data. First we user the tokenizer() function on our prompt.

In [None]:
inputs = tokenizer(prompt, return_tensors="pt").to(device)

Let's now use the generate() function on our model, and print the decoded version of our output.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).strip())

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer. Setting `layer_past=None`.
Caching is incompatible with gradient checkpointing in GPTNeoXLayer.

### Instruction:
What are the symptoms of diabetes?

### Response:


Note: MedBot is very enthusiastic about "the first time to be found."
Unfortunately, it hasn't quite learned how to answer medical questions yet.
Sometimes it gave me "the first time to be found" as answer many times, sometimes it gave me nothing.
Future improvement: fine-tune on structured instruction-response pairs!


### We swapped out the GPT-NeoX-20B model for a smaller one (GPT-Neo 2.7B), allowing us to complete training on a Colab T4 GPU, while keeping the same dataset, tokenizer, LoRA setup, and quantization.

# Importing and Installing Libraries/Packages

In [None]:
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-ih_vbn_m
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-ih_vbn_m
  Resolved https://github.com/huggingface/transformers.git to commit 4fcf45551775b05a3a78481ad53552635026c7d2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-i8kd60fa
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-i8kd60fa
  Resolved https://github.com/huggingface/peft.git to commit a91ec33fc515ad71d8acdc67f396bfec7e38873f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements

In [None]:
import torch
import transformers
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM

## Loading our model

Let's start by loading our model. We will use the GPT Neo 2.7B Model by EleutherAI!

In [None]:
hf_model = "EleutherAI/gpt-neo-2.7B"

We will also set the bitsandbytes configurations needed for our model to run on our single colab GPU. The needed paramaters will be 'Double Quantization' 'Quantization Type' and the computational type needs to be set to bfloat16.

In [None]:
bitsbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

We will then set our tokenizer, and our model using the AutoTokenizer and AutoModelforCausalLM classes.

In [None]:
# Load tokenizer and model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(hf_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    hf_model,
    quantization_config=bitsbytes_config,
    device_map="auto",
    trust_remote_code=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Model Preprocessing

We now have to apply some preprocessing to our model so we can prepare it for training. First we need to further reduce our memory consumption by using the gradient_checkpointing_enable() fucntion on our model. We then use the prepare_model_for_kbit_training function so that we can use 4bit quantization training.

In [None]:
# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Prepare model for k-bit (4-bit) training
model = prepare_model_for_kbit_training(model)


We will also set a function that will print the number of trainable parameters our model has.

In [None]:
def print_trainable_parameters(model):
    trainable_parameters = 0
    all_paramaters = 0
    for _, param in model.named_parameters():
        all_paramaters += param.numel()
        if param.requires_grad:
            trainable_parameters += param.numel()
    print(
        f"Trainable: {trainable_parameters} || All: {all_paramaters} || Trainable %: {100 * trainable_parameters / all_paramaters}"
    )

Finally we will set the configurations for our LoRA. The paramaters needed are the rank updates, the default LoRa alpha value, the target modules which need to be set to query_key_value, the default lora dropout rate, bias should be set to none, and the task type according to the model we are using.

In [None]:
# Set LoRA configuration
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["k_proj", "q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

# Print the trainable parameters of the model
print_trainable_parameters(model)

Trainable: 3932160 || All: 1396948480 || Trainable %: 0.28148210591130746


## Dataset Loading

Let's load our medical dataset from Hugging Face. We will use the medalpaca/medical_meadow_wikidoc_patient_information dataset. You can access it [here](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc).

In [None]:
# Load the medical dataset
data = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information")

# Mapping the needed column as our data using a lambda statement
data = data.map(lambda samples: tokenizer(samples["output"]), batched=True)

In [None]:
# Add labels
data = data.map(lambda samples: {"labels": samples["input_ids"]}, batched=True)

## Model Training and Testing

Now we train the model using the transformers library. Before doing so, we set the tokenizer to be the end of sequence tokens since it is required by our model. Your goal here is to tune the parameters until you get a running model on a single colab GPU.

In [None]:
# Setting the tokenizer padding to be 'eos' tokens
tokenizer.pad_token = tokenizer.eos_token

# Define training arguments
training_args = transformers.TrainingArguments(
    output_dir="./medbot_model",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    optim="paged_adamw_8bit",
    save_strategy="no",
    report_to="none"
)


trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    tokenizer=tokenizer,
    args=training_args
)
# This silences the warnings
model.config.use_cache = False

# Train the model!

trainer.train()


  trainer = transformers.Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Step,Training Loss
10,2.3019
20,2.4263
30,2.2675
40,2.2452
50,2.1789
60,2.3426
70,2.2734
80,2.2472
90,2.2346
100,2.036


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# Note: The training of the GPT-Neo 2.7B model was interrupted at 87% (0.87 / 1 epoch)
# due to a CUDA error. As a result, the model and LoRA adapter were not saved,
# and inference was not possible.

In [None]:
#saved_model = model.base_model if hasattr(model, "base_model") else model
#saved_model.save_pretrained("outputs")

In [None]:
# Save LoRA adapter (after training)
#model.save_pretrained("lora_outputs")

Before testing our model, we have to get the LoRA configs from our pre-trained model and set them to our new model using the get_peft_model() function.

In [None]:
#lora_configs = LoraConfig.from_pretrained("lora_outputs")
#model = get_peft_model(saved_model, lora_configs)

We need to set our prompt as a variable, and also our device currently in use.

In [None]:
#prompt = "What are the symptoms of diabetes?"
#device = "cuda:0"

Finally, we will make our LLM generate text based on the data. First we user the tokenizer() function on our prompt.

In [None]:
#inputs = tokenizer(prompt, return_tensors="pt").to(device)


Let's now use the generate() function on our model, and print the decoded version of our output.

In [None]:
#outputs = model.generate(**inputs, max_new_tokens=40)
#print(tokenizer.decode(outputs[0], skip_special_tokens=True))


# Project Summary:

In this project, we explored the fine-tuning of LLMs for medical question answering using LoRA and 4-bit quantization.

The initial model used was **GPT-NeoX 20B**, followed by an attempt with
**GPT-Neo 2.7B** for efficiency comparison.

The **GPT-NeoX 20B model** was successfully loaded and fine-tuned using LoRA with 4-bit quantization. Despite its size, the training and inference pipelines executed correctly without runtime errors. Approximately 0.08% of the model’s parameters were updated using LoRA, enabling efficient training on a limited GPU setup.

The model was saved in two parts:
- Base model: outputs/
- LoRA adapter: lora_outputs/


The full pipeline executed without errors. However, in test cases such as:

**Instruction**: What are the symptoms of diabetes?  
**Response:** (No response was generated in this specific test case).

The output remained either **repetitive** or **empty**. This highlights a limitation of fine-tuning without highly structured instruction-response pairs.


This behavior may also be due to limited training time (only 1 epoch), the dataset, conservative decoding settings (e.g., low max_new_tokens), or lack of diversity in output patterns during training.


A second attempt was made using **GPT-Neo 2.7B**, a smaller model aimed at reducing training time and memory usage. Training reached **87**% completion
(0.87 / 1 epoch) before being interrupted by a CUDA error. The model was not saved, and inference could not be tested.




### Key Learnings:
- Fine-tuning large models using LoRA and 4-bit quantization is achievable on constrained resources.
- Only a small fraction of parameters (0.08%) can be trained to achieve lightweight updates.
- Prompt design, decoding strategy, and structured training data are critical to improving model output quality.