# Fine-tuning Phi-4 for Medical Reasoning: From Traditional to Reasoning SLM

## Overview

This notebook demonstrates how to transform Microsoft's Phi-4, a traditional small language model (SLM), into a reasoning-capable model for medical applications. We'll use the `unsloth` library to efficiently fine-tune the model on the `o1-medical-reasoning` dataset, teaching it to "think" through medical problems step-by-step before providing answers.

### What You'll Learn

1. **Small Language Models (SLMs)** and their advantages
2. **Reasoning vs Non-Reasoning Models** and their differences  
3. **Model Quantization** for memory-efficient training
4. **Parameter Efficient Fine-Tuning (PEFT)** with LoRA
5. **Supervised Fine-Tuning (SFT)** techniques
6. **Model deployment** with interactive interfaces

### Prerequisites

- Basic Python programming knowledge
- Understanding of machine learning concepts
- Access to GPU with at least 16GB VRAM (recommended)

---

![](./assets/finetuning.png)

## 1. Understanding Small Language Models (SLMs)

### What are Small Language Models?

Small Language Models (SLMs) are compact versions of large language models, typically containing **1-15 billion parameters** compared to the **100+ billion parameters** found in models like GPT-4 or Claude. Despite their smaller size, they offer several advantages:

**Key Advantages of SLMs:**
- 🚀 **Speed**: Faster inference times due to fewer parameters
- 💾 **Memory Efficiency**: Lower VRAM requirements (can run on consumer GPUs)
- ⚡ **Cost-Effective**: Cheaper to train and deploy
- 🎯 **Task-Specific**: Can be highly optimized for specific domains
- 🏠 **On-Device Deployment**: Can run locally without internet connectivity

![](./assets/llm-vs-slm.png)

### Microsoft Phi-4: Our Base Model

**Phi-4** is Microsoft's latest small language model with:
- **14 billion parameters**
- **BF16 precision** (~25GB when fully loaded)
- **High performance** on reasoning benchmarks
- **Optimized architecture** for efficiency

![](./assets/phi-4.png)

---

## 2. Reasoning vs Non-Reasoning Models

### Traditional (Non-Reasoning) Models

Most language models, including the base Phi-4, are **non-reasoning models**. They:
- Generate responses **immediately** based on patterns learned during training
- **Don't show their work** or explain their thought process
- Can make mistakes due to **lack of deliberation**
- Are **fast** but may lack depth in complex problem-solving

### Reasoning Models (Like OpenAI's o1)

Reasoning models introduce a **"thinking" phase** before responding:
- **Chain-of-Thought**: Step-by-step problem breakdown
- **Self-reflection**: Ability to reconsider and correct reasoning
- **Explicit reasoning**: Show intermediate steps
- **Higher accuracy** on complex problems, especially in STEM and medical domains

![](./assets/regular-vs-reasoning.png)

### The Medical Reasoning Challenge

Medical diagnosis requires:
1. **Symptom analysis**: Understanding patient presentation
2. **Differential diagnosis**: Considering multiple possibilities  
3. **Evidence weighing**: Balancing different clinical indicators
4. **Systematic thinking**: Following medical reasoning protocols


### Our Goal: Transform Phi-4

We'll teach Phi-4 to:
- **Think step-by-step** using `<think>` tags
- **Show medical reasoning** process explicitly
- **Provide final answers** after deliberation
- **Handle complex medical scenarios** accurately

---

## 3. Setting Up the Environment

### What is Unsloth?

[**Unsloth**](https://docs.unsloth.ai/) is an open-source library that makes fine-tuning large language models **2x faster** and uses **30% less memory**. 

### Key features:

- 🚀 **Speed optimization**: Optimized kernels for training
- 💾 **Memory efficiency**: Advanced memory management
- 🔧 **Easy-to-use**: Simplified APIs for complex operations
- 🎯 **PEFT support**: Built-in LoRA and other efficient fine-tuning methods
- 📊 **Multi-format support**: Works with various model architectures

### Installation

Let's install the Unsloth library to get started:

In [None]:
%%capture

!pip install unsloth
!pip install transformers==4.55.4   # Installing this version of transformers as it is compatible

Collecting unsloth
  Using cached unsloth-2025.8.9-py3-none-any.whl.metadata (52 kB)
Collecting unsloth_zoo>=2025.8.8 (from unsloth)
  Using cached unsloth_zoo-2025.8.8-py3-none-any.whl.metadata (9.4 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Using cached xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting bitsandbytes (from unsloth)
  Using cached bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting tyro (from unsloth)
  Using cached tyro-0.9.28-py3-none-any.whl.metadata (11 kB)
Collecting trl!=0.15.0,!=0.19.0,!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,>=0.7.9 (from unsloth)
  Using cached trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting huggingface_hub>=0.34.0 (from unsloth)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets<4.0.0,>=3.4.1->unsloth)
  Using cached fsspec-2025.3.0-py3-none-any.whl.metadata

## 4. Model Quantization and Loading

### Understanding Quantization

**Quantization** reduces model memory usage by using lower-precision number formats:

- **BF16** (16-bit): Original Phi-4 format (~25GB)
- **INT4** (4-bit): ~75% memory reduction (~14GB for Phi-4)

**4-bit Quantization Benefits:**
- 🎯 **Memory Efficient**: Fits larger models on consumer GPUs
- ⚡ **Faster Loading**: Reduced data transfer
- 🎪 **Maintained Quality**: Minimal performance degradation with modern techniques
- 🏗️ **Training Compatible**: Can fine-tune quantized models


### Loading Phi-4 with Unsloth

We'll load the Phi-4 model with the following configuration:
- **4-bit quantization** for memory efficiency
- **2048 token context** length  
- **RoPE scaling** support for longer sequences

In [None]:
from unsloth import FastLanguageModel 
import torch
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-4",                         # Use this to load the base Phi-4 model from Unsloth
    # model_name = "/kaggle/input/phi-4-base/transformers/default/1/phi-4-base",   # This is the same base Phi-4 model, downloaded before hand, to save time during the demo
    max_seq_length = max_seq_length,
    load_in_4bit = load_in_4bit,
)

| Model Version               | Precision / Quantization            | Parameter Count | Storage per Weight       | Expected Size (naive) | Actual Size (reported) | Why the Difference                                                                                     |
| --------------------------- | ----------------------------------- | --------------- | ------------------------ | --------------------- | ---------------------- | ------------------------------------------------------------------------------------------------------ |
| **`microsoft/phi-4`**       | BF16 (16-bit)                       | 14.7B           | 2 bytes                  | \~29.4 GB             | \~25 GB                | HuggingFace safetensors compress + sharding; some metadata savings                                     |
| **`unsloth/phi-4`** | 4-bit quantized (NF4 / QLoRA style) | 14.7B           | 0.5 bytes (weights only) | \~7.35 GB             | \~10.39 GB             | Extra storage for quantization scales/offsets, some weights left in higher precision, padding/overhead |

## 5. Parameter Efficient Fine-Tuning (PEFT) with LoRA

### What is PEFT?

**Parameter Efficient Fine-Tuning** allows us to adapt large models while training only a small fraction of parameters:

- **Traditional Fine-tuning**: Updates all 14B parameters of Phi-4
- **PEFT**: Updates only ~0.1-1% of parameters (millions vs billions)
- **Benefits**: Lower memory, faster training, less overfitting risk

### LoRA (Low-Rank Adaptation)

**LoRA** is the most popular PEFT technique that:
- **Freezes** original model weights
- **Adds** small trainable matrices (rank decomposition)
- **Approximates** weight updates with low-rank matrices
- **Maintains** model performance with far fewer parameters

### LoRA Mathematics (Simplified)

```
Original: W' = W + ΔW (full update)
LoRA: W' = W + A×B (low-rank approximation)
```

Where:
- **W**: Original frozen weights
- **A, B**: Small trainable matrices  
- **A×B**: Approximates the full update ΔW

![](./assets/lora.png)

For a weight matrix W with dimensions $d = k = 4096$, here's how LoRA (with rank $r = 16$) drastically reduces the no. of trainable parameters:

| Method               | Trainable Parameters Formula                      | Example (d=4096, k=4096, r=16)        | Result         |
| -------------------- | ------------------------------------------------- | ------------------------------------- | -------------- |
| **Full Fine-tuning** | $d \times k$                                      | $4096 \times 4096$                    | **16.8M**      |
| **LoRA**             | $(d \times r) + (r \times k)$                     | $(4096 \times 16) + (16 \times 4096)$ | **0.13M**      |

**Reduction Factor** = $\dfrac{d \times k}{(d \times r) + (r \times k)} = \dfrac{16.8M}{0.13M}$ = **128× fewer** 


### Our LoRA Configuration

- **r=16**: Rank of adaptation matrices (higher = more expressive)
- **alpha=16**: Scaling factor for LoRA updates
- **Target modules**: Attention and MLP layers we'll adapt
- **Dropout=0**: No dropout for optimal performance

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

## 6. The Medical O1-Reasoning Dataset

### Dataset Overview

The [**medical-o1-reasoning-SFT**](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) dataset contains medical questions with:
- **Questions**: Real medical scenarios and case studies
- **Complex_CoT**: Detailed chain-of-thought reasoning process  
- **Response**: Final medical answers and recommendations


![image.png](./assets/dataset.png)

### Dataset Structure

Each example contains:
1. **Medical Question**: Patient presentation, symptoms, history
2. **Reasoning Process**: Step-by-step medical thinking
3. **Final Answer**: Diagnosis, treatment, or medical advice

### Why This Dataset?

- 🏥 **Medical Domain**: Specialized medical knowledge
- 🧠 **Reasoning Focus**: Emphasizes thinking process

### Loading the Dataset

Let's load the English portion of the medical reasoning dataset:

In [None]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'en', split = "train")

## 7. Data Preprocessing and Conversation Formatting

### Converting to Conversation Format

To train our model effectively, we need to convert the dataset into a **conversation format** that:
- Follows **chat templates** (user/assistant pairs)
- Includes **reasoning tokens** (`<think>` tags)
- Maintains **medical context** and structure

### The Reasoning Pattern

Our target format:
```
User: [Medical Question]
Assistant: <think>[Medical Reasoning]</think>
[Final Medical Answer]
```

This teaches the model to:
1. **Think first** before responding
2. **Show medical reasoning** explicitly  
3. **Provide clear answers** after deliberation

### Data Transformation Process

We'll transform each dataset example:
- **Question** → User message
- **Complex_CoT** → Reasoning within `<think>` tags
- **Response** → Final assistant response

Let's examine our dataset structure first:

In [None]:
dataset

Now let's convert our dataset to the conversation format:

In [None]:
def convert_to_conversations(examples):
    questions = examples["Question"]
    reasonings = examples["Complex_CoT"]
    answers = examples["Response"]

    conversations = []
    for question, reasoning, answer in zip(questions, reasonings, answers):
        conversation = [
            {"role": "user", "content": question},
            {"role": "assistant", "content": f"<think>{reasoning}</think>\n{answer}"}
        ]
        conversations.append(conversation)
    return {
        "conversations": conversations,
    }


dataset = dataset.map(
    convert_to_conversations,
    batched=True,
)
print(dataset)

## 8. Chat Templates and Tokenization

### What are Chat Templates?

**Chat templates** define how conversations are formatted for different models:
- **Special tokens**: Mark user/assistant boundaries
- **Consistent formatting**: Ensures proper model understanding
- **Model-specific**: Each model family has its own template

### Phi-4 Chat Template

Phi-4 uses specific tokens:
- `<|im_start|>user<|im_sep|>`: Start of user message
- `<|im_start|>assistant<|im_sep|>`: Start of assistant response
- `<|im_end|>`: End of message

### Tokenization Process

We'll:
1. **Apply chat template** to format conversations
2. **Tokenize text** into model-readable format
3. **Add special tokens** for training
4. **Prepare data** for the training pipeline

Let's set up the chat template and tokenize our dataset:

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize = False, add_generation_prompt = False
        )
        for convo in convos
    ]
    return { "text" : texts, }


dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)
print(dataset)

### Examining the Processed Data

Let's look at how our data has been transformed:

In [None]:
import json
print(json.dumps(dataset["conversations"][5], indent=4))

In [None]:
dataset["text"][5]

## 9. Pre-Training Baseline Test

### Testing the Original Model

Before fine-tuning, let's test how the original Phi-4 model performs on medical reasoning tasks. This will give us a **baseline** to compare against after training.

### Expected vs Actual Behavior

**Expected (after fine-tuning)**: 
- Step-by-step reasoning in `<think>` tags
- Medical knowledge application
- Systematic differential diagnosis

**Current (before fine-tuning)**:
- Direct response without reasoning
- May lack medical depth
- No explicit thought process

Let's define our test case:

In [None]:
TEST_SAMPLE = [
    {
        "role": "user",
        "content": "A 50-year-old man presents with progressive decrease in visual acuity and excessive sensitivity to light over a period of 6 months. \
During a slit lamp examination, discrete brown deposits on the corneal epithelium are observed in both eyes. \
Considering his long history of schizophrenia managed with a single antipsychotic drug, which medication would likely cause these corneal deposits?"
    }
]

Here's what we want our fine-tuned model to produce - a detailed reasoning process followed by a clear answer:

In [None]:
SAMPLE_REASONING = "So, we've got this 50-year-old guy who's having trouble with his vision getting worse and he's really sensitive to light now. \
This has been gradually happening over six months, which is pretty significant. Then, during a slit lamp exam, they find these unusual brown spots in his cornea, kind of like deposits. Interesting.\
\
Now, he’s been dealing with schizophrenia for a long time and has been on a single antipsychotic this whole time. \
That’s a big clue because not all antipsychotics affect the eyes in this way. But I remember that some specific medications do have unusual side effects like this.\
\
Okay, let me think. Which antipsychotic could cause brown deposits on the cornea? It's not something very common with all of them. Oh, right! Chlorpromazine. \
I've read that Chlorpromazine, which is a first-generation antipsychotic, can cause these kinds of eye changes, especially with long-term use. \
Things like brown or golden deposits on the cornea and even sensitivity to light.\
\
Chlorpromazine is kind of the textbook example for this. So, when you're talking about eye issues like these with a history of schizophrenia and antipsychotic use, this particular drug stands out. \
It’s like everything he's experiencing fits neatly into the expected side effects of Chlorpromazine.\
\
Given all that, it makes sense that Chlorpromazine is probably the medication that's causing these visual symptoms. \
I can't think of any other antipsychotic that has quite the same eye-related effects. Yeah, I’m pretty confident this makes sense."





SAMPLE_ANSWER = "The medication likely causing the brown corneal deposits in this patient is Chlorpromazine. \
Chlorpromazine, a first-generation antipsychotic, is known for its potential side effects involving the eyes, especially with long-term use. \
These side effects can include the development of brown or golden deposits on the cornea and increased sensitivity to light, both of which align with the symptoms and findings observed in this patient."

### Running Pre-Training Inference

Let's see how the original model responds to our medical question:

In [None]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer.apply_chat_template(
    TEST_SAMPLE,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")


text_streamer = TextStreamer(tokenizer, skip_prompt = True)

outputs = model.generate(
    input_ids = inputs, streamer = text_streamer, max_new_tokens = 1000,
    use_cache = True, temperature = 1.5, min_p = 0.1
)

In [None]:
response = tokenizer.batch_decode(outputs)
print(response[0].split("<|im_start|>assistant<|im_sep|>")[1])

## 10. Supervised Fine-Tuning (SFT) Setup

### What is Supervised Fine-Tuning?

**Supervised Fine-Tuning (SFT)** is a training method where:
- Model learns from **input-output pairs**
- **Supervised learning**: We provide the "correct" responses
- **Task-specific**: Adapts model for specific domains/tasks
- **Behavior shaping**: Teaches new response patterns

### SFT vs Other Training Methods

| Method | Data Type | Purpose | Use Case |
|--------|-----------|---------|----------|
| **Pre-training** | Raw text | General language understanding | Foundation model |
| **SFT** | Input-output pairs | Task-specific behavior | Medical reasoning |
| **RLHF** | Human preferences | Alignment & safety | Helpful assistant |

### Our Training Configuration

**Memory & Performance:**
- `batch_size=2`: Small batches for memory efficiency
- `gradient_accumulation_steps=4`: Effective batch size of 8
- `adamw_8bit`: Memory-optimized optimizer

**Learning Parameters:**
- `learning_rate=2e-4`: Moderate learning rate for stability
- `warmup_steps=5`: Gradual learning rate increase
- `max_steps=20`: Short training for demonstration (increase for production)

**Optimization:**
- `weight_decay=0.01`: Regularization to prevent overfitting
- `linear` scheduler: Learning rate decay
- `packing=False`: Process sequences individually

*[Image Placeholder: Training process diagram showing Input → Model → Output → Loss → Backpropagation]*

In [None]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 20,         # Currently, using this to show the training process quickly
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

### Response-Only Training

**Response-only training** is a crucial technique where:
- **Loss calculated only on assistant responses**
- **User messages ignored** during loss computation  
- **Prevents overfitting** to user input patterns
- **Focuses learning** on generating better responses

This is important because:
- We don't want the model to "memorize" user questions
- We want it to learn how to respond appropriately
- It improves generalization to new questions

Let's configure response-only training:

In [None]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user<|im_sep|>",
    response_part="<|im_start|>assistant<|im_sep|>",
)

### Examining Training Data Processing

Let's inspect how our data looks after processing and verify the response-only training setup:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

## 11. Training Process

### Starting Fine-Tuning

Now we'll begin the actual fine-tuning process. The training will:
- **Update LoRA weights** based on medical reasoning examples
- **Learn reasoning patterns** from the dataset
- **Adapt to medical terminology** and concepts
- **Develop step-by-step thinking** abilities

**Training Progress Monitoring:**
- **Loss values**: Should generally decrease over time
- **Learning rate**: Will follow the linear schedule
- **Memory usage**: Monitor GPU utilization
- **Training speed**: Unsloth optimizations in action

Let's start training:

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## 12. Post-Training Evaluation

### Testing Our Fine-Tuned Model

Now let's test our fine-tuned model on the same medical question to see the improvement:

**Expected Improvements:**
- ✅ **Reasoning Process**: Should show `<think>` tags with medical reasoning
- ✅ **Medical Knowledge**: Better application of medical concepts
- ✅ **Systematic Approach**: Step-by-step diagnostic thinking
- ✅ **Clear Conclusions**: Well-reasoned final answers

**Comparison Points:**
- **Pre-training**: Direct answers without reasoning
- **Post-training**: Structured thinking → clear answers

Let's test the fine-tuned model:

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer.apply_chat_template(
    TEST_SAMPLE,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)

outputs = model.generate(
    input_ids = inputs, streamer = text_streamer, max_new_tokens = 1000,
    use_cache = True, temperature = 1.5, min_p = 0.1
)

## 13. Model Saving and Deployment

### Saving the Fine-Tuned Model

After successful training, we need to save our model for:
- **Future use**: Load the model later without retraining
- **Deployment**: Use in production applications
- **Sharing**: Distribute to other researchers/developers
- **Backup**: Preserve our training results

**What Gets Saved:**
- **Model weights**: The fine-tuned LoRA adapters
- **Tokenizer**: With chat template configuration
- **Configuration**: Model architecture and settings

Let's save our fine-tuned model:

In [None]:
model.save_pretrained("phi-4-medical-reasoning")
tokenizer.save_pretrained("phi-4-medical-reasoning")

In [None]:
!zip -r phi-4-medical-reasoning.zip phi-4-medical-reasoning/

In [None]:
!unzip phi-4-medical-reasoning

### [Optional] Save to Hugging Face Hub

You need to be logged in to push to HuggingFace. 

Uncomment the below lines to login and push.

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

# Push the LoRA model and tokenizer to HuggingFace Hub
model.push_to_hub(
    "tezansahu/phi-4-medical-reasoning",  # Replace "tezansahu" with your HuggingFace username
    token=True,
    private=False,  # Set to True if you want a private repository
    commit_message="Fine-tuned Phi-4 for medical reasoning with LoRA"
)

tokenizer.push_to_hub(
    "tezansahu/phi-4-medical-reasoning",  # Replace "tezansahu" with your HuggingFace username
    token=True,
    private=False,
    commit_message="Tokenizer for Phi-4 medical reasoning model"
)

print("Model and tokenizer successfully pushed to HuggingFace Hub!")

### Load the (Saved) Finetuned Model

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "tezansahu/phi-4-medical-reasoning",  # This is the model finetuned on the dataset for 1 epoch
    model_name = "/kaggle/input/phi-4-medical-reasoning/transformers/default/1/phi-4-medical-reasoning", # Load from the unzipped directory
    max_seq_length = 2048, # Use the same max_seq_length as during training
    load_in_4bit = True,   # Use the same quantization as during training
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

## 14. Interactive Medical Reasoning Interface

### Building a User-Friendly Interface

To make our medical reasoning model accessible, we'll create an interactive interface using **Gradio**:

**Interface Features:**
- 🩺 **Medical Question Input**: Text box for medical queries
- 🧠 **Reasoning Display**: Shows the model's thought process
- 💡 **Final Answer**: Clear medical conclusions
- ⚡ **Streaming Response**: Real-time response generation

### Streaming Inference

Our interface implements **streaming inference** to:
- **Show thinking in real-time**: Users see the reasoning as it develops
- **Improve user experience**: No waiting for complete responses
- **Handle long reasoning**: Medical cases can have extensive thought processes
- **Interactive feel**: More engaging than batch processing

### Medical Reasoning UI Components

1. **Thinking Accordion**: Collapsible section showing `<think>` content
2. **Answer Box**: Final medical advice and conclusions  
3. **Streaming Logic**: Separates reasoning from final answers
4. **Medical Context**: Optimized for medical question formats


### Sample Questions to Ask

- What drug is known to reduce alcohol cravings and decrease the likelihood of resumed heavy drinking in alcoholics after completing a detoxification program?

- A 21-year-old college student faints while giving a presentation in class. Friends say he looked pale and sweaty before collapsing. He regained consciousness within a minute and felt weak but alert. He denies chest pain but recalls skipping breakfast that morning. He drinks a lot of energy drinks during exams and recently started going to the gym. His ECG shows a fast irregular heartbeat. What is the most likely diagnosis?

- A 29-year-old startup founder had a bad flu two weeks ago. Now he has sharp chest pain that worsens when lying flat or taking deep breaths, but eases when sitting up and leaning forward. He gets worried it might be a heart attack. On exam, there’s a scratchy sound over the chest. ECG shows diffuse ST elevation and PR depression. What is the most likely diagnosis? What is the first-line treatment?

- A 52-year-old IT manager comes for a routine check-up. He feels fine but says he often gets mild headaches after long workdays and sometimes feels “pressure” in his head when climbing stairs. He has a BMI 31. On examination, his blood pressure is 168/102 mmHg. Fundoscopy shows early retinal changes, but his kidney function is normal. What is the most likely diagnosis?


### Creating the Interactive Interface

In [None]:
import gradio as gr
import torch
from transformers import TextIteratorStreamer
import threading

def stream_inference(prompt, max_new_tokens=1024):
    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
    generation_kwargs = dict(
        input_ids=inputs,
        max_new_tokens=max_new_tokens,
        temperature=1.2,
        streamer=streamer,
        use_cache=True,
    )

    thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    thinking_mode = False
    thinking_text = ""
    answer_text = ""

    for new_text in streamer:
        if "<think>" in new_text:
            thinking_mode = True
            new_text = new_text.replace("<think>", "")
        if "</think>" in new_text:
            thinking_mode = False
            new_text = new_text.replace("</think>", "")
            # collapse accordion by closing the thought stream
            yield gr.update(value=thinking_text, visible=True), gr.update(value=answer_text)
            continue

        if thinking_mode:
            thinking_text += new_text
            yield gr.update(value=thinking_text, visible=True), gr.update(value=answer_text)
        else:
            answer_text += new_text
            yield gr.update(value=thinking_text, visible=True), gr.update(value=answer_text)


with gr.Blocks() as demo:
    gr.Markdown("### 🧠 Fine-Tuned Phi-4 Model for Medical Reasoning")

    with gr.Row():
        with gr.Column():
            user_input = gr.Textbox(label="Your Question")
            submit = gr.Button("Ask")
        with gr.Column():
            with gr.Accordion("Model is thinking...", open=True) as acc:
                thought_stream = gr.Textbox(
                    label="Chain of Thought", value="", interactive=False, visible=True
                )
            answer_box = gr.Textbox(label="Final Answer", value="", interactive=False)

    submit.click(
        stream_inference,
        inputs=[user_input],
        outputs=[thought_stream, answer_box]
    )

demo.launch()


## 15. Conclusion and Next Steps

### What We Accomplished

In this notebook, we successfully:

✅ **Transformed a Traditional LLM**: Converted Microsoft's Phi-4 from a standard response model to a reasoning-capable medical assistant

✅ **Implemented Advanced Techniques**:
- **4-bit quantization** for memory efficiency
- **LoRA** for parameter-efficient fine-tuning
- **Response-only training** for focused learning
- **Chat template optimization** for proper formatting

✅ **Medical Reasoning Enhancement**:
- Taught the model to use `<think>` tags for reasoning
- Trained on high-quality medical reasoning data
- Created step-by-step diagnostic thinking patterns

✅ **Built Interactive Interface**: Created a user-friendly medical reasoning assistant with streaming responses

### Key Technical Insights

1. **SLMs can achieve specialized performance** comparable to larger models when fine-tuned properly
2. **LoRA enables efficient fine-tuning** with minimal computational resources
3. **Reasoning patterns can be learned** through structured training data
4. **Response-only training** improves generalization significantly
5. **Quantization** doesn't significantly impact reasoning quality

### Performance Improvements

**Before Fine-tuning:**
- Direct answers without reasoning
- Limited medical knowledge application
- No systematic approach

**After Fine-tuning:**
- Structured medical reasoning
- Step-by-step diagnostic thinking
- Clear, evidence-based conclusions

### Next Steps and Extensions

#### 🔬 **Research Directions**
- **Multi-modal integration**: Add medical images and charts
- **Longer reasoning chains**: Support complex multi-step diagnoses
- **Uncertainty quantification**: Express confidence in medical conclusions
- **Differential diagnosis**: Generate multiple potential diagnoses

#### 🏥 **Clinical Applications**
- **Medical education**: Training tool for medical students
- **Clinical decision support**: Assist healthcare professionals
- **Patient education**: Explain medical concepts clearly
- **Telemedicine integration**: Support remote consultations

#### ⚙️ **Technical Improvements**
- **Larger training datasets**: More diverse medical cases
- **Longer training**: More epochs for better performance
- **Evaluation metrics**: Systematic assessment of medical reasoning quality
- **Safety measures**: Add medical disclaimers and safety checks


### Important Disclaimers

⚠️ **Medical Disclaimer**: This model is for educational and research purposes only. It should never replace professional medical advice, diagnosis, or treatment.

⚠️ **Validation Required**: All medical AI systems require extensive clinical validation before real-world use.

⚠️ **Continuous Learning**: Medical knowledge evolves rapidly; models need regular updates.

### Resources for Further Learning

**Technical Resources:**
- [Unsloth Documentation](https://docs.unsloth.ai/)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)


---

### 🎉 Congratulations!

You've successfully created a reasoning-capable medical AI assistant from a traditional language model. This represents the cutting edge of AI development - teaching models to think before they respond, just like human experts do in complex domains like medicine.