# Fine-tuning LFM2.5-1.2B on Medical Data

This notebook demonstrates how to fine-tune Liquid AI's LFM2.5-1.2B-Instruct model (Architecture: 16 layers (10 double-gated LIV convolution blocks + 6 GQA blocks)) on medical instruction data using Unsloth.

**Requirements:**
- GPU: T4 (free on Google Colab)
- RAM: 12GB+
- Time: ~15-20 minutes for 100 steps

**Important:**
1. Runtime â†’ Change runtime type â†’ T4 GPU
2. Run cells in order
3. This model is for educational purposes only - not medical advice!

## 1. Install Dependencies

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.57.3
!pip install --no-deps trl==0.22.2

In [2]:
# !pip install --upgrade --no-deps --force-reinstall unsloth unsloth_zoo

## 2. Import Libraries

In [3]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Check GPU
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


2026-02-03 13:34:28.182320: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770125668.583232      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770125668.694449      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770125669.732554      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770125669.732603      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770125669.732606      55 computation_placer.cc:177] computation placer alr

ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
GPU Available: True
GPU Name: Tesla T4
GPU Memory: 14.56 GB


## 3. Load Model and Tokenizer

In [4]:
# Configuration
model_name = "LiquidAI/LFM2.5-1.2B-Instruct"

# Load model with 16-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/LFM2.5-1.2B-Instruct",
    max_seq_length = 2048, # Can go up to 32,768 for LFM2.5
    load_in_4bit = False, # 4 bit quantization to reduce memory
    load_in_8bit = False, # A bit more accurate, uses 2x memory
    load_in_16bit = True, # Enables 16bit LoRA
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...", # use one if using gated models
    # device_map = "balanced",
)

print("Model loaded successfully!")

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Model loaded successfully!


In [5]:
messages = [
    {"role": "system", "content": "You are a knowledgeable medical assistant."},
    {"role": "user", "content": "What are the common symptoms of diabetes?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.1, top_k = 50, top_p = 0.1, repetition_penalty = 1.05,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Diabetes is a chronic condition characterized by elevated blood sugar levels due to problems with insulin production or use. Common symptoms include:

1. **Increased Thirst (Polydipsia)** â€“ The body loses more fluids than usual, leading to frequent urination.
2. **Frequent Urination (Polyuria)** â€“ Excess glucose in the blood draws water into the urine, increasing urine output.
3. **Unexplained Weight Loss** â€“ Despite eating more, some people lose weight due to the body breaking down fat and muscle for energy.
4. **Extreme Fatigue** â€“ High blood sugar can


## 4. Configure LoRA

In [6]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "in_proj",
                      "w1", "w2", "w3"],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable_params:,} ({100 * trainable_params / all_params:.2f}%)")

Unsloth: Making `model.base_model.model.model` require gradients
Trainable params: 11,108,352 (0.94%)


## 5. Load and Prepare Dataset

In [7]:
# Load medical dataset
dataset = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information", split="train")
print(f"Dataset size: {len(dataset)} samples")

# Show sample
print("\nSample data:")
print(dataset[0])

README.md: 0.00B [00:00, ?B/s]

medical_meadow_wikidoc_patient_info.json:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5942 [00:00<?, ? examples/s]

Dataset size: 5942 samples

Sample data:
{'input': 'What are the symptoms of Allergy?', 'output': 'Allergy symptoms vary, but may include:\nBreathing problems (coughing, shortness of breath) Burning, tearing, or itchy eyes Conjunctivitis (red, swollen eyes) Coughing Diarrhea Headache Hives Itching of the nose, mouth, throat, skin, or any other area Runny nose Skin rashes Stomach cramps Vomiting Wheezing\nWhat part of the body is contacted by the allergen plays a role in the symptoms you develop. For example:\nAllergens that are breathed in often cause a stuffy nose, itchy nose and throat, mucus production, cough, or wheezing. Allergens that touch the eyes may cause itchy, watery, red, swollen eyes. Eating something you are allergic to can cause nausea, vomiting, abdominal pain, cramping, diarrhea, or a severe, life-threatening reaction. Allergens that touch the skin can cause a skin rash, hives, itching, blisters, or even skin peeling. Drug allergies usually involve the whole body and 

In [8]:
# Format dataset for LFM2.5
def format_instruction(example):
    # system_message = "You are a knowledgeable medical assistant providing accurate health information. Always recommend consulting healthcare professionals for medical advice."

    conversation = [
        # {"role": "system", "content": system_message},
        {"role": "user", "content": example.get('input', '')},
        {"role": "assistant", "content": example.get('output', '')}
    ]

    return {"text": conversation}

# Apply formatting
dataset = dataset.map(
    format_instruction,
    remove_columns=dataset.column_names,
    desc="Formatting dataset"
)

print("âœ“ Dataset formatted!")

Formatting dataset:   0%|          | 0/5942 [00:00<?, ? examples/s]

âœ“ Dataset formatted!


In [9]:
from unsloth.chat_templates import standardize_data_formats
dataset_fmt = standardize_data_formats(dataset)

In [10]:
dataset_fmt[0]

{'text': [{'content': 'What are the symptoms of Allergy?', 'role': 'user'},
  {'content': 'Allergy symptoms vary, but may include:\nBreathing problems (coughing, shortness of breath) Burning, tearing, or itchy eyes Conjunctivitis (red, swollen eyes) Coughing Diarrhea Headache Hives Itching of the nose, mouth, throat, skin, or any other area Runny nose Skin rashes Stomach cramps Vomiting Wheezing\nWhat part of the body is contacted by the allergen plays a role in the symptoms you develop. For example:\nAllergens that are breathed in often cause a stuffy nose, itchy nose and throat, mucus production, cough, or wheezing. Allergens that touch the eyes may cause itchy, watery, red, swollen eyes. Eating something you are allergic to can cause nausea, vomiting, abdominal pain, cramping, diarrhea, or a severe, life-threatening reaction. Allergens that touch the skin can cause a skin rash, hives, itching, blisters, or even skin peeling. Drug allergies usually involve the whole body and can lead

In [11]:
def formatting_prompts_func(examples):
    texts = tokenizer.apply_chat_template(
        examples["text"],
        tokenize = False,
        add_generation_prompt = False,
    )
    return { "text" : [x.removeprefix(tokenizer.bos_token) for x in texts] }

dataset_fmt = dataset_fmt.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

In [12]:
dataset_fmt[0]["text"]

'<|im_start|>user\nWhat are the symptoms of Allergy?<|im_end|>\n<|im_start|>assistant\nAllergy symptoms vary, but may include:\nBreathing problems (coughing, shortness of breath) Burning, tearing, or itchy eyes Conjunctivitis (red, swollen eyes) Coughing Diarrhea Headache Hives Itching of the nose, mouth, throat, skin, or any other area Runny nose Skin rashes Stomach cramps Vomiting Wheezing\nWhat part of the body is contacted by the allergen plays a role in the symptoms you develop. For example:\nAllergens that are breathed in often cause a stuffy nose, itchy nose and throat, mucus production, cough, or wheezing. Allergens that touch the eyes may cause itchy, watery, red, swollen eyes. Eating something you are allergic to can cause nausea, vomiting, abdominal pain, cramping, diarrhea, or a severe, life-threatening reaction. Allergens that touch the skin can cause a skin rash, hives, itching, blisters, or even skin peeling. Drug allergies usually involve the whole body and can lead to 

## 6. Train the Model

In [13]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.563 GB.
2.27 GB of memory reserved.


In [14]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_fmt,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        max_steps = 200,
        num_train_epochs = 1,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit", # "ademamix"
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=8):   0%|          | 0/5942 [00:00<?, ? examples/s]

In [15]:
# Start training
trainer_stats = trainer.train()
print(f"\nâœ“ Training completed! Final loss: {trainer_stats.training_loss:.4f}")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,942 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,108,352 of 1,181,448,960 (0.94% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.9439
2,3.0991
3,4.0748
4,3.0578
5,2.9621
6,2.8913
7,2.8599
8,3.289
9,2.7001
10,2.4222



âœ“ Training completed! Final loss: 1.8382


In [16]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

367.2401 seconds used for training.
6.12 minutes used for training.
Peak reserved memory = 3.152 GB.
Peak reserved memory for training = 0.882 GB.
Peak reserved memory % of max memory = 21.644 %.
Peak reserved memory for training % of max memory = 6.056 %.


## 7. Test the Fine-tuned Model

In [17]:
messages = [
    {"role": "system", "content": "You are a knowledgeable medical assistant."},
    {"role": "user", "content": "What are the common symptoms of diabetes?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
# 1. Define the "Stop Tokens"
# We include the standard EOS token and the specific ChatML token <|im_end|>

tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})

# 2. Run Generation
_ = model.generate(
    **inputs,
    max_new_tokens = 256,
    temperature = 0.1,
    top_k = 50,
    top_p = 0.1,
    repetition_penalty = 1.05,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Symptoms of diabetes include:
Frequent urination Frequent thirst Increased hunger Increased fatigue Weight loss<|im_end|>


## 8. Save the Fine-tuned Model

In [20]:
# Save LoRA adapters (small size)
model.save_pretrained("lfm25_medical_lora")
tokenizer.save_pretrained("lfm25_medical_lora")
print("âœ“ LoRA adapters saved!")

# Merge and save 16-bit model
model.save_pretrained_merged("lfm25_medical_merged", tokenizer, save_method="merged_16bit")
print("âœ“ Merged model saved!")

# Export to GGUF for llama.cpp
model.save_pretrained_gguf("lfm25_medical_gguf", tokenizer, quantization_method="q4_k_m")
print("âœ“ GGUF model saved!")

âœ“ LoRA adapters saved!
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `lfm25_medical_merged`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:01<00:00,  1.38s/it]


Successfully copied all 1 files from cache to `lfm25_medical_merged`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 9554.22it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:15<00:00, 15.96s/it]


Unsloth: Merge process complete. Saved to `/kaggle/working/lfm25_medical_merged`
âœ“ Merged model saved!
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `lfm25_medical_gguf`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:01<00:00,  1.29s/it]


Successfully copied all 1 files from cache to `lfm25_medical_gguf`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 9845.78it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:15<00:00, 15.73s/it]


Unsloth: Merge process complete. Saved to `/kaggle/working/lfm25_medical_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['LFM2.5-1.2B-Instruct.F16.

## 9. Upload to Hugging Face (Optional)

In [22]:
# Uncomment and run to upload to Hugging Face Hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [21]:
# model.push_to_hub("your-username/lfm25-medical-1.2b", token=True)
# tokenizer.push_to_hub("your-username/lfm25-medical-1.2b", token=True)

## ðŸŽ‰ Congratulations!

You've successfully fine-tuned LFM2.5 on medical data!

**Next Steps:**
1. Train for more steps (increase `max_steps` or use `num_train_epochs`)
2. Experiment with different hyperparameters
3. Try different medical datasets
4. Deploy your model using vLLM or llama.cpp
5. Share your model on Hugging Face Hub

**Important Reminder:**
This model is for educational purposes only. Always consult healthcare professionals for medical advice.

**Resources:**
- [LFM2.5 Documentation](https://docs.liquid.ai/lfm)
- [Unsloth Documentation](https://docs.unsloth.ai/)
- [GitHub Repository](https://github.com/yourusername/lfm25-medical-finetuning)