### Installation

In [1]:
%%capture
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

!pip install pip3-autoremove
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu124
!pip install unsloth
!pip install --upgrade transformers==4.53.2 "huggingface_hub>=0.34.0" "datasets>=3.4.1,<4.0.0"


### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
import re
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-08-18 10:44:21.223550: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755513861.422791      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755513861.480006      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.6: Fast Llama patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. Our dataset is in system-instruction format(`<s>[INST]...[/INST]...</s>`). But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

In [5]:
from datasets import load_dataset, concatenate_datasets
ds1 = load_dataset("DR-DRR/Medical_Customer_care", split="train")
ds2 = load_dataset("shetumohanto/doctor_qa_bangla",split="train" )

ds1_4k = ds1.select(list(range(4000)))
ds2_4k = ds2.select(list(range(4000)))

merged_dataset = concatenate_datasets([ds1_4k, ds2_4k])

dataset = merged_dataset.shuffle(seed=42)

dataset.txt: 0.00B [00:00, ?B/s]

final_dataset.txt:   0%|          | 0.00/191M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/207487 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

dataset_mistral.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/5135 [00:00<?, ? examples/s]

In [6]:
def parse_llama2_to_list(llama_text):
    """
    Converts a Llama 2 template string into a list of dictionaries
    for the chat template function.
    """
    # Remove the <s> and </s> markers
    cleaned_text = llama_text.replace("<s>", "").replace("</s>", "").strip()
    
    # Use regex to find the content inside [INST]...[/INST] and the assistant's response
    match = re.search(r"\[INST\](.*?)\[/INST\](.*)", cleaned_text, re.DOTALL)
    
    if not match:
        # Handle cases where there's no assistant response (e.g., a prompt for a new turn)
        user_message = cleaned_text.replace("[INST]", "").strip()
        return [{"role": "user", "content": user_message}]

    user_message = match.group(1).strip()
    assistant_message = match.group(2).strip()
    
    return [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_message}
    ]

In [7]:
def formatting_prompts_func(examples):
    """
    Converts Llama 2 formatted strings to Llama 3.1 chat templates.
    """
    llama2_strings = examples["text"]
    
    # First, convert the Llama 2 strings into the list of dicts format
    conversations_list = [parse_llama2_to_list(text) for text in llama2_strings]
    
    # Then, apply the chat template to each list of dictionaries
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in conversations_list
    ]
    
    return {"text": texts}

In [8]:
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [9]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n5 days ago I fell down three steps and hit my head and back. The headache and soreness as a result of the fall have gone, however now I am experiencing light dizziness - similar to how one feels when they are short of breath. Is this a result of the fall and how should I handle this?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello, It is due to minor concussion and will settle by itself. As of now  ChatDoctor.<|eot_id|>'

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [10]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/8000 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [11]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=4):   0%|          | 0/8000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [12]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n5 days ago I fell down three steps and hit my head and back. The headache and soreness as a result of the fall have gone, however now I am experiencing light dizziness - similar to how one feels when they are short of breath. Is this a result of the fall and how should I handle this?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello, It is due to minor concussion and will settle by itself. As of now  ChatDoctor.<|eot_id|>'

In [13]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                    Hello, It is due to minor concussion and will settle by itself. As of now  ChatDoctor.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [14]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.441 GB of memory reserved.


In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.7494
2,2.694
3,1.8825
4,1.3566
5,1.8017
6,1.5879
7,1.2309
8,1.3644
9,0.9039
10,1.2951


In [16]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

334.5142 seconds used for training.
5.58 minutes used for training.
Peak reserved memory = 5.342 GB.
Peak reserved memory for training = 1.901 GB.
Peak reserved memory % of max memory = 36.239 %.
Peak reserved memory for training % of max memory = 12.896 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [17]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Explain the importance of staying hydrated and its benefits for overall health."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain the importance of staying hydrated and its benefits for overall health.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi, Welcome to Chat Doctor, I am Chat Doctor. Today I will answer your question. Hydration is the process by which your body is maintaining adequate amounts of fluid in order to perform its various physiological functions. Adequate fluid intake helps in maintaining an effective functioning of organs, body, tissues and enzymes. It helps']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [18]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Explain the importance of staying hydrated and its benefits for overall health."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Hello, Welcome to ChatDoctor. Thank you for visiting ChatDoctor. Hope you got the complete information for your query. Wish you good health. Have any query. Let me know. Thanks for using ChatDoctor.  Chat Doctor  You may chat Doctor 24/7. Have any query  Just type it Chat Doctor. Don't worry about privacy. We are 100% confidential. Hope this helps. If you  still have  doubt.  Go ahead. Chat Doctor  Chat Doctor<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [19]:
huggingface_model_name = "samanta-scratch/bilingual-health-qna"
model.push_to_hub(huggingface_model_name, token = "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") # Online saving
tokenizer.push_to_hub(huggingface_model_name, token = "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") # Online saving

README.md:   0%|          | 0.00/620 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...pj98zn471/adapter_model.safetensors:   0%|          | 45.8kB / 97.3MB            

Saved model to https://huggingface.co/samanta-scratch/bilingual-health-qna


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpf_3se4bq/tokenizer.json       : 100%|##########| 17.2MB / 17.2MB            