### **This is Edited from original Unsolth notebook**

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
# Install latest Hugging Face for Gemma-3!
!pip install --no-deps git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [2]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.14: Fast Gemma3 patching. Transformers: 4.50.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

<div style="font-family: Arial, sans-serif; color: #2e3b4e; line-height: 1.6;">
  <h2>Explanation of the PEFT Model Configuration</h2>
  <ul>
    <li><strong>finetune_vision_layers = False</strong>: Disables fine-tuning of vision layers. Use this when working with text-only tasks since visual processing is not required.</li>
    <li><strong>finetune_language_layers = True</strong>: Enables fine-tuning of the language layers, which is essential for adapting the model to textual tasks.</li>
    <li><strong>finetune_attention_modules = True</strong>: Fine-tunes the attention modules, helping the model capture relationships and dependencies within the text, which is beneficial for tasks like GRPO.</li>
    <li><strong>finetune_mlp_modules = True</strong>: Keeps the multi-layer perceptron (MLP) modules active for fine-tuning, generally improving the model's overall performance.</li>
    <li><strong>r = 8</strong>: Sets the rank for LoRA. A higher rank (here, 8) can provide higher accuracy by allowing more capacity for learning, but it also increases the risk of overfitting.</li>
    <li><strong>lora_alpha = 8</strong>: The scaling factor for LoRA updates. It is recommended to set this value equal to or at least as high as <code>r</code> to ensure effective adaptation.</li>
    <li><strong>lora_dropout = 0</strong>: Indicates that no dropout is applied during the LoRA fine-tuning process.</li>
    <li><strong>bias = "none"</strong>: Specifies that bias terms are not adjusted during fine-tuning.</li>
    <li><strong>random_state = 3407</strong>: Sets a fixed random seed for reproducibility, ensuring that the fine-tuning process is deterministic and can be replicated.</li>
  </ul>
</div>


In [3]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

### **Data Farmating as like Gemma Model train**
This section explains how we prepare conversation data for fine-tuning our model using the "Gemma-3" format. Here's a simple breakdown:

- **Gemma-3 Format:**  
  We use a specific format that organizes conversations into clear turns. Special tokens mark the beginning of the conversation (`<bos>`), when a user starts speaking (`<start_of_turn>user`), and when their turn ends (`<end_of_turn>`). Then, the model's response is similarly marked. This helps the model learn the structure of a conversation.


- **Chat Template Function:**  
  The `get_chat_template` function is used to automatically format the conversation data in the correct style. This function supports various conversation formats (like zephyr, chatml, mistral, llama, alpaca, vicuna, and more) so that we can adapt our training to different standards if needed.



In [4]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [5]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

**In this format the are provide in Gemma 3**

In [6]:
dataset[10]

{'conversations': [{'from': 'human',
   'value': 'Write Python code to solve the task:\nIn programming, hexadecimal notation is often used.\n\nIn hexadecimal notation, besides the ten digits 0, 1, ..., 9, the six letters `A`, `B`, `C`, `D`, `E` and `F` are used to represent the values 10, 11, 12, 13, 14 and 15, respectively.\n\nIn this problem, you are given two letters X and Y. Each X and Y is `A`, `B`, `C`, `D`, `E` or `F`.\n\nWhen X and Y are seen as hexadecimal numbers, which is larger?\n\nConstraints\n\n* Each X and Y is `A`, `B`, `C`, `D`, `E` or `F`.\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nX Y\n\n\nOutput\n\nIf X is smaller, print `<`; if Y is smaller, print `>`; if they are equal, print `=`.\n\nExamples\n\nInput\n\nA B\n\n\nOutput\n\n<\n\n\nInput\n\nE C\n\n\nOutput\n\n>\n\n\nInput\n\nF F\n\n\nOutput\n\n='},
  {'from': 'gpt',
   'value': "Step 1:  To compare two hexadecimal numbers represented by the letters X and Y, we need to understand the

### Loading own mediacal Dataset froom hugging face

The **lavita/ChatDoctor-HealthCareMagic-100k** dataset is a collection of 100,000 healthcare-related conversations designed for training and fine-tuning medical chatbots. Here’s a brief overview:

- **Domain-Specific:**  
  The dataset consists of dialogues between patients and doctors, covering a wide range of medical topics and queries.

In [7]:
ds = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k")

README.md:   0%|          | 0.00/542 [00:00<?, ?B/s]

(…)-00000-of-00001-5e7cb295b9cff0bf.parquet:   0%|          | 0.00/70.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112165 [00:00<?, ? examples/s]

In [8]:
ds['train'][10]

{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.",
 'input': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
 'output': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health

**This Dataset is not in a format of gemma model so we need to convert this like format**

The conversion process takes each record containing an "input" (user's message) and an "output" (assistant's response) and restructures it into a conversation format. This new format organizes the data as a series of turns, with clear roles for the user and the assistant, which helps train models to better understand and generate multi-turn dialogues.

`The Dataset.from_list()` method is part of the Hugging Face Datasets library. It allows you to create a Dataset object directly from a Python list of records (usually dictionaries). Each dictionary in the list represents a single data example, and its keys become the columns of the dataset.

In [9]:
from datasets import Dataset
def convert_dataset(input_dataset):
    converted_records = []

    # Process each record in the dataset
    for record in input_dataset:
        # Extract input and output from the original record
        user_content = record['input']
        assistant_content = record['output']

        # Create the new format
        converted_record = {
            'conversations': [
                {
                    'content': user_content,
                    'role': 'user'
                },
                {
                    'content': assistant_content,
                    'role': 'assistant'
                }
            ]
        }

        converted_records.append(converted_record)

    # Create a new HuggingFace dataset from the converted records
    converted_dataset = Dataset.from_list(converted_records)

    return converted_dataset

# Convert the dataset
converted_dataset = convert_dataset(ds['train'])



In [10]:
converted_dataset[10]

{'conversations': [{'content': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
   'role': 'user'},
  {'content': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.',
   'role': 'assistant'}]}

The function `standardize_data_formats(converted_dataset)` takes your converted dataset (which already has a conversation structure) and ensures that all records adhere to this consistent format.

In [11]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(converted_dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's see how row 10 looks like!

In [12]:
dataset[10]

{'conversations': [{'content': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
   'role': 'user'},
  {'content': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.',
   'role': 'assistant'}]}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [13]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"])
    return { "text" : texts }
pass
dataset = dataset.map(apply_chat_template, batched = True)

Map:   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [14]:
dataset[10]["text"]

'<bos><start_of_turn>user\nI have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???<end_of_turn>\n<start_of_turn>model\nCellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.<end_of_turn>\n'

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [15]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16
Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

- **Template Parts:**  
  - `instruction_part = "<start_of_turn>user\n"` indicates the prefix used for the user’s prompt.  
  - `response_part = "<start_of_turn>model\n"` indicates the prefix for the assistant’s (model’s) response.

In [16]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again:

In [17]:
tokenizer.decode(trainer.train_dataset[10]["input_ids"])

'<bos><start_of_turn>user\nI have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???<end_of_turn>\n<start_of_turn>model\nCellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.<end_of_turn>\n'

Now let's print the masked out example - you should see only the answer is present:

In [19]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
4.283 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [20]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 112,165 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,4.6602
2,4.7419
3,4.8622
4,5.037
5,4.6915
6,4.7268
7,4.8831
8,4.9874
9,4.6283
10,4.6924


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [21]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "I had an icecream and immediately i felt giddy .What is wrong with me ",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

["<bos><start_of_turn>user\nI had an icecream and immediately i felt giddy .What is wrong with me<end_of_turn>\n<start_of_turn>model\nOkay, let's break down what might be happening when you felt giddy after eating ice cream. It's a surprisingly common phenomenon, and there are several potential explanations, ranging from perfectly normal to needing a little more attention. Here's a breakdown of the possibilities, categorized by likelihood:\n\n**1. The Most Likely: Sugar Rush & Dopamine**\n\n* **Sugar High:** Ice cream is loaded with sugar.  A rapid influx of sugar into your bloodstream can cause a temporary spike in energy and a feeling of euphoria. This is due to the body releasing insulin to process the sugar, which can briefly boost dopamine levels.\n* **Dopamine Release:** Dopamine is a neurotransmitter associated with pleasure, reward, and motivation. Eating something sweet, especially something you enjoy, can trigger a release of dopamine in the brain, leading to that giddy feeli

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [23]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Hypertension",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Okay, let's break down hypertension, also known as high blood pressure. It's a really important condition to understand, as it often has no symptoms and can lead to serious health problems if left unmanaged.

**1. What is Blood Pressure?**

* **Blood pressure** is the force of your blood pushing against the walls of your arteries as it circulates throughout your body.
* It’s measured with two numbers:
    * **Systolic (the top number):** This is the pressure when your heart beats (contracts).
    * **Diastolic (the bottom number):** This is the pressure when your heart rests between beats.

**2. What is Hypertension (High Blood Pressure)?**

* **Hypertension** is when your blood pressure consistently remains elevated – generally considered to be:
    * **Systolic ≥ 130 mmHg**
    * **Diastolic ≥ 80 mmHg**
*  It’s important to note that these are general guidelines, and some people may have slightly different thresholds based on individual factors and medical evaluation.

**3. Types of 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("HF_ACCOUNT/gemma-3_finetune_medical_data", token = "hf_token") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3_finetune_medical_data", token = "hf_token") # Online saving

['gemma-3/processor_config.json']