# Fine-Tuning DeepSeek R1 (Reasoning Model)
Fine-tuning the world's first open-source reasoning model on the medical chain of thought dataset to build better AI doctors for the future.

Reference:
* [Tutorial](https://www.datacamp.com/tutorial/fine-tuning-deepseek-r1-reasoning-model)

## 1. Setting Up
For this project, we are using Kaggle as our Cloud IDE because it provides free access to GPUs, which are often more powerful than those available in Google Colab. To get started, launch a new Kaggle notebook and add your Hugging Face token and Weights & Biases token as secrets.

You can add secrets by navigating to the `Add-ons` tab in the Kaggle notebook interface and selecting the `Secrets` option.
* **HF_TOKEN**: the Hugging Face token
* **wnb**: the Weight & Bias token

After setting up the secrets, install the unsloth Python package. Unsloth is an open-source framework designed to make fine-tuning large language models (LLMs) 2X faster and more memory-efficient.

In [None]:
%%capture

!pip install unsloth # install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
# Moduls for fine-tuning
from unsloth import FastLanguageModel
import torch # Import PyTorch
from trl import SFTTrainer # Trainer for supervised fine-tuning (SFT)
from unsloth import is_bfloat16_supported # Checks if the hardware supports bfloat16 precision

max_seq_length = 2048 
dtype = None 
load_in_4bit = True 

In [None]:
# Hugging Face modules
from huggingface_hub import login # Let's you login to API
from transformers import TrainingArguments # Defines training hyperparameters
from datasets import load_dataset # Let's you load fine-tuning datasets

Log in to the Hugging Face CLI using the Hugging Face API that we securely extracted from Kaggle Secrets. 

In [None]:
# Import weights and biases
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HF_TOKEN")
login(hf_token)

Log in to Weights & Biases (`wanb`) using your API key and create a new project to track the experiments and fine-tuning progress.

In [None]:
import wandb

wb_token = user_secrets.get_secret("wnb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset', 
    job_type="training", 
    anonymous="allow"
)

## 2. Loading the model and tokenizer
For this project, we are loading the Unsloth version of [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B).  Additionally, we will load the model in 4-bit quantization to optimize memory usage and performance.

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048 
dtype = None 
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, 
)

## 3. Model inference before fine-tuning
To create a prompt style for the model, we will define a system prompt and include placeholders for the question and response generation. The prompt will guide the model to think step-by-step and provide a logical, accurate response.

In [None]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

In this example, we will provide a medical question to the prompt_style, convert it into tokens, and then pass the tokens to the model for response generation. 

In [None]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model) 
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

Even without fine-tuning, our model successfully generated a chain of thought and provided reasoning before delivering the final answer. The reasoning process is encapsulated within the <think></think> tags.

So, why do we still need fine-tuning? The reasoning process, while detailed, was long-winded and not concise. Additionally, the final answer was presented in a bullet-point format, which deviates from the structure and style of the dataset that we want to fine-tune on. 

## 4. Loading and processing the dataset
We will slightly change the prompt style for processing the dataset by adding the third placeholder for the complex chain of thought column. 

In [None]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

Write the Python function that will create a "text" column in the dataset, which consists of the train prompt style. Fill the placeholders with questions, chains of text, and answers. 

In [None]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

We will load the first 500 samples from the [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT?row=46) dataset, which is available on the Hugging Face hub. After that, we will map the text column using the formatting_prompts_func function.

As we can see, the text column has a system prompt, instructions, chain of thought, and the answer. 

In [None]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

## 5. Setting up the model
Using the target modules, we will set up the model by adding the low-rank adopter to the model. 

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)

Next, we will set up the training arguments and the trainer by providing the model, tokenizers, dataset, and other important training parameters that will optimize our fine-tuning process.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

## 6. Model training
Run the following command to start training.  

In [None]:
trainer_stats = trainer.train()

You can view the fill model evaluation report on the Weights and biases dash board by logging into the website and viewing the [project](https://wandb.ai/ryan-wibawa-california-state-university/Fine-tune-DeepSeek-R1-Distill-Llama-8B%20on%20Medical%20COT%20Dataset?nw=nwuserryanwibawa).

## 7. Model inference after fine-tuning
To compare the results, we will ask the fine-tuned model the same question as before to see what has changed.

In [None]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


This is much better and more accurate. The chain of thought was direct, and the answer was straightforward and in one paragraph. The fine-tuning was successful.

## 8. Saving the model locally
Now, let's save the adopter, full model, and tokenizer locally so that we can use them in other projects.

In [None]:
new_model_local = "DeepSeek-R1-Medical-COT"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

## 9. Pushing the model to Hugging Face Hub
We will also push the adopter, tokenizer, and model to Hugging Face Hub so that the AI community can take advantage of this model by integrating it into their systems.

In [None]:
new_model_online = "rwibawa/DeepSeek-R1-Medical-COT"
model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")

The next step in your learning journey is to serve and deploy your model to the cloud. You can follow the [How to Deploy LLMs with BentoML](https://www.datacamp.com/tutorial/deploy-llms-with-bentoml) guide, which provides a step-by-step process for deploying large language models efficiently and cost-effectively using BentoML and tools like vLLM.

Alternatively, if you prefer to use the model locally, you can convert it into GGUF format and run it on your machine. For this, check out the [Fine-tuning Llama 3.2 and Using It Locally](https://www.datacamp.com/tutorial/fine-tuning-llama-3-2) guide, which provides detailed instructions for local usage.

## 10. Converting the Model to Llama.cpp GGUF
We can’t use the safetensors files locally as most local AI chatbots don’t support them. Instead, we'll convert it into the *llama.cpp* GGUF file format.

### Setting up
Install the `llama.cpp` by running the following command in the Kaggle Notebook cell.

In [2]:
%cd /kaggle/working
!git clone --depth=1 https://github.com/ggerganov/llama.cpp.git

/kaggle/working
Cloning into 'llama.cpp'...
remote: Enumerating objects: 1369, done.[K
remote: Counting objects: 100% (1369/1369), done.[K
remote: Compressing objects: 100% (1042/1042), done.[K
remote: Total 1369 (delta 297), reused 1025 (delta 280), pack-reused 0 (from 0)[K
Receiving objects: 100% (1369/1369), 19.43 MiB | 25.25 MiB/s, done.
Resolving deltas: 100% (297/297), done.


### Build `llama.cpp`

In [7]:
%cd /kaggle/working/llama.cpp
!sed -i 's|MK_LDFLAGS   += -lcuda|MK_LDFLAGS   += -L/usr/local/nvidia/lib64 -lcuda|' Makefile
!LLAMA_CUDA=1 make all -j

/kaggle/working/llama.cpp
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.


### Converting Safetensors to GGUF model format
Run the following command in the Kaggle Notebook cell to convert the model into the GGUF format.

The `convert-hf-to-gguf.py` requires an input model directory, output file directory, and out type.

In [9]:
%cd /kaggle/working
!python llama.cpp/convert_hf_to_gguf.py DeepSeek-R1-Medical-COT/ \
    --outfile /kaggle/working/llama-3-8b-chat-doctor.gguf \
    --outtype f16

/kaggle/working
2025-03-12 02:27:45.970174: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-12 02:27:46.297422: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-12 02:27:46.386570: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Writing:  26%|██████▉                    | 4.15G/16.1G [00:31<01:34, 126Mbyte/s]Traceback (most recent call last):
  File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 5117, in <module>
    main()
  File "/kaggle/working/llama.cpp/convert_hf_to_gguf.py", line 5111, in main
    model_instance.write()
  File "/kaggle/working/llama.cpp/conv

## Quantizing the GGUF model
Regular laptops don’t have enough RAM and GPU memory to load the entire model, so we have to quantify the GGUF model, reducing the 16 GB model to around 4-5 GB.

The quantize script requires a GGUF model directory, output file directory, and quantization method. We are converting the model using the `Q4_K_M` method.

In [None]:
%cd /kaggle/working/

!./llama.cpp/llama-quantize \
llama-3-8b-chat-doctor.gguf \
llama-3-8b-chat-doctor-Q4_K_M.gguf \
Q4_K_M

## 11.Pushing the model file to Hugging Face
To push the single file to the Hugging Face Hub, we'll:

### 1. Login to the Hugging Face Hub using the API key.
### 2. Create the API object.
### 3. Upload the file by providing the local path, repo path, repo id, and repo type.

In [None]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
from huggingface_hub import HfApi
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
login(token = hf_token)

api = HfApi()
api.upload_file(
    path_or_fileobj="/kaggle/working/llama-3-8b-chat-doctor-Q4_K_M.gguf",
    path_in_repo="llama-3-8b-chat-doctor-Q4_K_M.gguf",
    repo_id="rwibawa/llama-3-8b-chat-doctor",
    repo_type="model",
)


## Conclusion
Open-source large language models (LLMs) are becoming better, faster, and more efficient, making it easier than ever to fine-tune them on lower compute and memory resources.

In this tutorial, we explored the DeepSeek R1 reasoning model and learned how to fine-tune its distilled version for medical Q&A tasks. A fine-tuned reasoning model not only enhances performance but also enables its application in critical fields such as medicine, emergency services, and healthcare.