# **Fine-Tuning Deepseek-R1 for medical use-case**
Fine-tuning adapts a pre-trained language model to a specific task or dataset by training it on new examples. This process is usually done with Hugging Face’s Transformers library, which demands high computational power and memory.
- **Unsloth** offers a more optimized approach, making fine-tuning possible even on slower GPUs.
- It reduces memory usage, speeds up downloads, and uses techniques like **LoRA** to fine-tune large models efficiently with minimal resources.

In [None]:
%%capture

!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

## Loading DeepSeek R1 and the Tokenizer
Here we are using [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) for faster computation. This  model generally requires **32GB** of RAM and a GPU with **16GB** of VRAM for optimal performance.

`FastLanguageModel` module is used to optimize inference & fine-tuning.

**Parameters :**
- `max_seq_length` : max no. of tokens per input
- `dtype` = None : automatically choose the best data type (float16, bfloat16, etc.)
- `load_in_4bit` = True : enables 4-bit quantization to reduce memory usage. FP32bit -> 4bit
This allows **large language models to run efficiently on consumer GPUs** without needing massive amounts of memory.



In [None]:
from unsloth import FastLanguageModel
import torch

# configurations for loading model
max_seq_length = 2048
dtype = None          #choose the best data type (float16, bfloat16, etc.)
load_in_4bit = True   # enable 4-bit quantization to reduce memory usage

model, tokenizer = FastLanguageModel.from_pretrained(model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",
                                                      max_seq_length=max_seq_length,
                                                      dtype=dtype,
                                                      load_in_4bit=load_in_4bit)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Testing DeepSeek R1 on a medical use-case before fine-tuning


### Defining a system prompt
a system prompt is defined which includes placeholders for the question and response generation. The prompt will guide the model to think step-by-step and provide a logical, accurate response.

In [None]:
prompt_style = """Below is an instruction that outlines a task, along with an input that adds context.
                  Write a response that properly completes the request.
                  Before responding, thoroughly consider the issue and develop a step-by-step chain of thoughts to provide a logical and accurate response.

### Instruction:
You are a medical specialist with extensive knowledge of clinical reasoning, diagnosis, and treatment planning.
Please respond to the following medical inquiry.

### Question:
{}

### Response:
<think>{}"""

### **Running inference on the model**
The model is tested by providing a question and generating a response through the following steps -
- define a **question** based on medical condition
-`for_inference` enables optimized inference mode for Unsloth models by improving speed and efficiency.
-now the **question** is formatted using `prompt_style` so that model performs the logical reasoning process.
-**question** is tokenized and moved towards GPU **(cuda)** through pytorch
-model will then generate the response after specifying some important parameters.<br>
`input_ids` : tokenized input question<br>
`attention_mask` : attention mask to handle padding<br>
`max_new_tokens` limit response length to 1200 tokens to prevent excessive output<br>
`use_cache` =True, caching for faster inference
-decode the output tokens back into user-readable text

In [None]:
question = """A 45-year-old male presents with persistent fatigue, weight gain, cold intolerance, and constipation.
            On examination, he has dry skin, bradycardia, and mild facial puffiness.
            What is the most likely diagnosis, and which laboratory test would best confirm it?"""

FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)

# decode the output tokens into readable text
response = tokenizer.batch_decode(outputs)

# extract and print only the relevant response part (after "### Response:")
print(response[0].split("### Response:")[1])


<think>
Alright, so I've got this medical case to think through. Let me try to break it down step by step. The patient is a 45-year-old male with persistent fatigue, weight gain, cold intolerance, and constipation. On exam, he has dry skin, bradycardia, and mild facial puffiness. I need to figure out the most likely diagnosis and the best lab test to confirm it.

First, I'll list out the symptoms: fatigue, weight gain, cold intolerance, constipation. On exam, dry skin, bradycardia, and facial puffiness. Let's think about what each of these points towards.

Fatigue and cold intolerance make me think of hypothyroidism. Weight gain is a common symptom of hypothyroidism, as is constipation. Dry skin is also a sign of hypothyroidism because the skin becomes thickened and dry. Bradycardia is another clue; hypothyroidism can cause slow heart rate.

But wait, I should also consider other possibilities. Could it be something else? Let's think about hyperthyroidism. Hyperthyroidism causes weigh

Our model effectively produced a series of ideas and offered justification prior to providing the final response, even in the absence of fine-tuning. The <think> </think> tags encapsulate the reasoning process. But fine-tuning is still necessary beaceuse the reasoning process was long and unconcise. We also want the final response to follow a specific style consistently.

## Fine-tuning


### Step 1 — Update the system prompt

We will slightly change the prompt style for processing the dataset by adding the third placeholder for the complex chain of thought column. `</Response>`

In [None]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
                        Write a response that appropriately completes the request.
                        Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</Response>
{}"""

### Step 2 — Download the fine-tuning dataset
download the **medical-o1-reasoning-SFT** from [Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT).

In [None]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:1000]",trust_remote_code=True) # Keep only first 1000 rows
dataset

README.md:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19704 [00:00<?, ? examples/s]

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 1000
})

In [None]:
dataset[1]

{'Question': 'A 33-year-old woman is brought to the emergency department 15 minutes after being stabbed in the chest with a screwdriver. Given her vital signs of pulse 110/min, respirations 22/min, and blood pressure 90/65 mm Hg, along with the presence of a 5-cm deep stab wound at the upper border of the 8th rib in the left midaxillary line, which anatomical structure in her chest is most likely to be injured?',
 'Complex_CoT': "Okay, let's figure out what's going on here. A woman comes in with a stab wound from a screwdriver. It's in her chest, upper border of the 8th rib, left side, kind of around the midaxillary line. First thought, that's pretty close to where the lung sits, right?\n\nLet's talk about location first. This spot is along the left side of her body. Above the 8th rib, like that, is where a lot of important stuff lives, like the bottom part of the left lung, possibly the diaphragm too, especially considering how deep the screwdriver went.\n\nThe wound is 5 cm deep. Tha

according to train prompt style, structure the fine tune dataset:<br>
Question --> chain of thought --> Response <br>
by adding EOS , the model will not continue beyond the expected response

In [None]:
EOS_TOKEN = tokenizer.eos_token  # end of sequence token beyond which the model will stop generating text during training
EOS_TOKEN

'<｜end▁of▁sentence｜>'

In [None]:
def formatting_prompts_func(examples): # a batch of dataset examples is taken as input & medical question,COT,response is extracted from dataset
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]

    texts = []  # empty list to store the formatted prompts

    # iterate over the dataset, formatting each question, reasoning step, and response
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN  # insert values into prompt template & append EOS token
        texts.append(text) # stire them in list

    return {
        "text": texts,  # return the newly formatted dataset with a "text" column containing structured prompts
    }

In [None]:
# update dataset formatting
dataset_finetune = dataset.map(formatting_prompts_func, batched = True)
dataset_finetune["text"][0]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

"Below is an instruction that describes a task, paired with an input that provides further context.\n                        Write a response that appropriately completes the request.\n                        Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\nGiven the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?\n\n### Response:\n<think>\nOkay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nB

### Step 3 - Apply LoRA Adapters for Efficient Fine-Tuning

Full fine-tuning is not possible here: we need **Parameter-Efficient Transfer Learning**(PEFT) techniques like LoRA or QLoRA.

**Low-Rank Adaptation of Large Launguage Models** (LoRA) allows us to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient. <br>
`get_peft_model`  to enable LoRa fine-tuning <br>
`model`: DeepSeek-R1-Distill-Llama-8B on which we want to apply LoRa<br>
`r=16` : Rank of the LoRA matrices , it will give 3M parameters which is much less compared to 8B.<br>
target_modules : [<br>
    `q_proj`,   Query projection in attention<br>
    `k_proj`,  Key projection<br>
    `v_proj`,  Value projection<br>
    `o_proj`,  Output projection<br>
    `gate_proj`, Projection inside the MLP (feedforward)<br>
    `up_proj`,   "Up" projection in MLP (usually enlarges dimensions)<br>
    `down_proj`, "Down" projection in MLP (reduces dimensions)<br>
]<br>
`lora_alpha=16` : After the low-rank matrices are multiplied, the result is scaled by lora_alpha/r. It helps control how much "strength" the LoRA layers have relative to the base model. lora_alpha=16 with r=16 means scaling by 1.0, meaning no up or down scaling.<br>
`lora_dropout=0` : 0 means no dropout, full retention of information<br>
`bias="none"` : Determines how to handle biases in the model layers during LoRA.<br>
`use_gradient_checkpointing` : Memory-saving technique<br>


In [None]:
# Low-Rank Adaptation fine-tuning to the model
model_lora = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj",
        "o_proj","gate_proj","up_proj",
        "down_proj",],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)


Unsloth 2025.4.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Now, we initialize `SFTTrainer`, a supervised fine-tuning trainer from `trl` (Transformer Reinforcement Learning), to fine-tune our model efficiently on a dataset.

In [None]:
# we initialize SFTTrainer, a supervised fine-tuning trainer from trl (Transformer Reinforcement Learning), to fine-tune our model efficiently on a dataset.
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model_lora,
    tokenizer=tokenizer,
    train_dataset=dataset_finetune,
    dataset_text_field="text",  # which field in the dataset contains training text
    max_seq_length=max_seq_length,
    dataset_num_proc=2,  # will use 2 cpu threads to speed up data preprocessing
    # training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  #no. of examples processed per device (GPU) at a time
        gradient_accumulation_steps=4,  # accumulate gradients over 4 steps before updating weights
        num_train_epochs=1,
        warmup_steps=5,                 # increases learning rate for the first 5 steps
        max_steps=60,                   # limits training to 60 steps
        learning_rate=2e-4,             # learning rate for weight updates (tuned for LoRA fine-tuning)
        fp16=not is_bfloat16_supported(),#  FP16 if BF16 is not supported to speed up training
        bf16=is_bfloat16_supported(),   # BF16 if supported as it has better numerical stability on newer gpu
        logging_steps=10,               # logs training progress every 10 steps
        optim="adamw_8bit",             # memory-efficient in 8-bit mode
        weight_decay=0.01,              # it allows regularization to prevent overfitting on the data set
        lr_scheduler_type="linear",     # uses a linear learning rate
        seed=3407,  #fixed seed for reproducibility
        output_dir="outputs",  #  fine-tuned model checkpoints will be saved in this directopry
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

### Step 4 — Model training

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkarsoham529[0m ([33mkarsoham529-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.9702
20,1.4251
30,1.3964
40,1.3569
50,1.398
60,1.3324


`wandb`: it provides access to weights and biases for tracking our fine-tuning experiment

In [None]:
import wandb
wandb.finish() #saving the fine tuned model

0,1
train/epoch,▁▂▄▅▇██
train/global_step,▁▂▄▅▇██
train/grad_norm,█▃▂▂▁▂
train/learning_rate,█▇▅▄▂▁
train/loss,█▂▂▁▂▁

0,1
total_flos,1.6807568706453504e+16
train/epoch,0.48
train/global_step,60.0
train/grad_norm,0.26694
train/learning_rate,0.0
train/loss,1.3324
train_loss,1.47983
train_runtime,1070.3798
train_samples_per_second,0.448
train_steps_per_second,0.056


### Step 5 — Run model after fine-tuning

In [None]:
question = """A 45-year-old male presents with persistent fatigue, weight gain, cold intolerance, and constipation.
            On examination, he has dry skin, bradycardia, and mild facial puffiness.
            What is the most likely diagnosis, and which laboratory test would best confirm it?"""

FastLanguageModel.for_inference(model_lora)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model_lora.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, let's see what's going on here. This guy is 45 years old, and he's been feeling really tired for a while now. That's a big red flag for me. Plus, he's gained some weight, and he feels cold all the time. That's interesting. He's also having trouble with his bowels, meaning he's constipated. Hmm, constipation is a bit unusual for someone his age. 

Looking at his skin, it's dry. That's kind of worrying. And he's got a slow heartbeat, which is pretty unusual for someone his age. Oh, and he's got some puffiness on his face. This makes me think of something that affects the whole body, not just specific organs. 

Let me think about what could cause all of these symptoms. There's a condition that comes to mind: hypothyroidism. That's when the thyroid gland doesn't produce enough hormones. I've heard about this before. It can cause symptoms like fatigue, weight gain, and cold intolerance, which fit well with what this guy is experiencing.

Now, let's think about the symptoms ag

In [None]:
question = """A 45-year-old man presents to the clinic with a 3-week history of progressive lower back pain, which began after lifting a heavy object.
              He describes the pain as sharp and radiating down the posterior aspect of his left thigh into the lateral calf and dorsum of the foot.
              He also reports occasional numbness and tingling in the same distribution, along with mild weakness when trying to lift his left foot upwards.
              On physical examination, there is reduced sensation over the dorsum of the foot and weakness of ankle dorsiflexion.
              The straight leg raise test on the left side is positive. MRI of the lumbar spine reveals a posterolateral disc herniation at the L4-L5 level.
                Based on the clinical presentation and imaging findings, which nerve root is most likely compressed?"""


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so we have a 45-year-old man who's been having back pain for three weeks now. It started when he lifted a heavy object, which makes sense because that's a common scenario for developing back issues. The pain is pretty sharp and it's going down the back of his left thigh, and even further down into his calf and the back of his foot. That's interesting because it suggests there's something going on with the nerve roots that are responsible for sending signals down those areas.

He also mentions numbness and tingling in the same areas, which tells me that there's definitely something going on with the nerves. And he's having some weakness when he tries to lift his foot up, which points to a problem with the nerve that controls that movement. It's making me think that the nerves at the base of the spine, like the ones that go to the legs, are probably involved.

Looking at the physical exam, the doctor found that there's reduced sensation on the back of his foot and weakness

### Step 6: Saving the Model & Tokenizer



In [None]:
new_model_local = "DeepSeek-R1-Medical-COT"
model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local)

('DeepSeek-R1-Medical-COT/tokenizer_config.json',
 'DeepSeek-R1-Medical-COT/special_tokens_map.json',
 'DeepSeek-R1-Medical-COT/tokenizer.json')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp -r DeepSeek-R1-Medical-COT /content/drive/MyDrive/

In [None]:
print("sk")

sk


**Deployment**

In [None]:
!pip install streamlit

In [None]:
!pip uninstall -y bitsandbytes
!pip install bitsandbytes-cuda116

In [None]:
import torch
print(torch.cuda.is_available())

In [None]:
%%writefile main.py
import streamlit as st
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

@st.cache_resource
def load_model_and_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("DeepSeek-R1-Medical-COT")
    model = AutoModelForCausalLM.from_pretrained(
        "DeepSeek-R1-Medical-COT",
        trust_remote_code=True,
        device_map="cpu",
        torch_dtype=torch.float32
    )
    return tokenizer, model

tokenizer, model = load_model_and_tokenizer()

st.title("🧠 Medical LLM QA (DeepSeek-R1-Medical-COT)")

user_input = st.text_area("Enter your medical question:")

if st.button("Generate Answer"):
    if user_input:
        with st.spinner('Generating...'):
            inputs = tokenizer(user_input, return_tensors="pt")
            outputs = model.generate(**inputs, max_new_tokens=256)
            answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        st.subheader("Answer:")
        st.success(answer)
    else:
        st.warning("Please type a question first!")


Overwriting main.py


In [None]:
!streamlit run main.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠙[1G[0K⠹[1G[0K⠸[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.91.88.242:8501[0m
[0m
[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0Kyour url is: https://tall-ties-care.loca.lt
