## Credits
This notebook's code is a is refered from DataCamp Fine-Tuning DeepSeek R1.

In [None]:
!pip install unsloth



### Unsloth

Unsloth is an open-source framework designed to accelerate LLM fine-tuning while reducing memory usage. It achieves 2x faster training speeds and 70% less GPU memory consumption compared to traditional methods like Hugging Face's Transformers
### Why Use Unsloth?

Speed: Benchmarks show training times reduced by 8.8x (e.g., 23 hours → 2.5 hours on a T4 GPU) 10.

Accessibility: Makes advanced fine-tuning feasible for users without high-end infrastructure 11.

#### PEFT (Parameter-Efficient Fine-Tuning)

PEFT is a library by Hugging Face that enables parameter-efficient adaptation of LLMs. Instead of updating all model weights, it freezes most parameters and trains small "adapters" like LoRA (Low-Rank Adaptation) or QLoRA (Quantized LoRA)

### trl
Transformer Reinforcement Learning from HuggingFace which allows for supervised fine-tuning of the model — we will use the SFFTrainer wrapper
### dataset
to fetch reasoning datasets from the Hugging Face Hub
### torch: Deep learning framework used for training
### W&B (Weights & Biases)
W&B is a machine learning experiment-tracking platform that logs training metrics, hyperparameters, and model outputs.

### transformers Library
Hyperparameter Configuration
The TrainingArguments class from Hugging Face’s transformers library is the *control center for defining training settings.* It standardizes critical configurations to ensure reproducibility and efficiency:

In [None]:
from unsloth import FastLanguageModel
import torch # Import PyTorch
from trl import SFTTrainer # Trainer for supervised fine-tuning (SFT)
from unsloth import is_bfloat16_supported # Checks if the hardware supports bfloat16 precision
# Hugging Face modules
from huggingface_hub import login # Lets you login to API
from transformers import TrainingArguments # Defines training hyperparameters
from datasets import load_dataset # Lets you load fine-tuning datasets
# Import weights and biases
import wandb
# Import kaggle secrets
# from kaggle_secrets import UserSecretsClient

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill',
    job_type="training",
    anonymous="allow"
)

we load the DeepSeek R1 model and its tokenizer using FastLanguageModel.from_pretrained(). We also configure key parameters for efficient inference and fine-tuning. We will be using a distilled 8B version of R1 for faster computation.

### Intuition behind 4-bit quantization

Imagine compressing a high-resolution image to a smaller size—it takes up less space but still looks good enough. Similarly, 4-bit quantization reduces the precision of model weights, making the model smaller and faster while keeping most of its accuracy. Instead of storing precise 32-bit or 16-bit numbers, we compress them into 4-bit values. This allows large language models to run efficiently on consumer GPUs without needing massive amounts of memory.

In [None]:
# Set parameters
max_seq_length = 2048 # Define the maximum sequence length a model can handle (i.e. how many tokens can be processed at once)
dtype = None # Set to default
load_in_4bit = True # Enables 4 bit quantization — a memory saving optimization

# Load the DeepSeek R1 model and tokenizer using unsloth — imported using: from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Load the pre-trained DeepSeek R1 model (8B parameter version)
    max_seq_length=max_seq_length, # Ensure the model can process up to 2048 tokens at once
    dtype=dtype, # Use the default data type (e.g., FP16 or BF16 depending on hardware support)
    load_in_4bit=load_in_4bit, # Load the model in 4-bit quantization to save memory
    token=hugging_face_token, # Use hugging face token
)

==((====))==  Unsloth 2025.2.5: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.9k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
# Define a system prompt under prompt_style
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

## Testing DeepSeek R1 on a medical use-case before fine-tuning

--> Define a test question related to a medical case

--> Format the question using the structured prompt (prompt_style) to ensure the model follows a logical reasoning process.

--> Tokenize the input and move it to the GPU (cuda) for faster inference.

--> Generate a response using the model, specifying key parameters like max_new_tokens=1200 (limits response length).

--> Decode the output tokens back into text to obtain the final readable answer.

In [None]:
ques = ' how to reduce the body heat'

FastLanguageModel.for_inference(model)

inputs = tokenizer([prompt_style.format(ques, "")], return_tensors="pt").to("cuda")  # Convert input to PyTorch tensor & move to GPU

outputs = model.generate(
    input_ids=inputs.input_ids, # Tokenized input question
    attention_mask=inputs.attention_mask, # Attention mask to handle padding
    max_new_tokens=1200, # Limit response length to 1200 tokens (to prevent excessive output)
    use_cache=True, # Enable caching for faster inference
)


# Decode the generated output tokens into human-readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the relevant response part (after "### Response:")
print(response[0].split("### Response:")[1])


<think>
Okay, so I need to figure out how to reduce body heat. Hmm, I remember that when it's hot outside or after exercising, our bodies generate more heat. I think it has something to do with sweating. So, the body sweats to cool down, right? But I'm not exactly sure about all the methods. Let me think step by step.

First, I know that drinking water helps because when you're dehydrated, your body retains more heat. So, staying hydrated is important. But how much water should I drink? Maybe a few glasses a day? I'm not sure, but I think it's something to keep in mind.

Next, I've heard that taking cool showers or baths can help lower your body temperature. But does that work for everyone? Maybe some people prefer just splashing their face with cold water. I guess it's about how you feel comfortable. But I'm not a medical expert, so I'm not certain about the exact benefits.

Then there's the idea of wearing loose, breathable clothing. I remember that fabrics like cotton can help evap

####  Why Different Prompt Types Are Needed and the Role of Prompt Modification Before Fine-Tuning
Large language models (LLMs) require structured guidance to produce accurate, context-aware outputs, especially for specialized tasks like medical reasoning.

*  Each question is paired with chain-of-thought reasoning and the final response


*   Ensures every training example follows a consistent pattern.
*   Prevents the model from continuing beyond the expected response lengt by adding the EOS token.


In [None]:
# Updated training prompt style to add </think> tag
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""


In [None]:
# Download the dataset using Hugging Face — function imported using from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True) # Keep only first 500 rows
dataset

README.md:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/74.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25371 [00:00<?, ? examples/s]

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 500
})

In [None]:
dataset[33]

{'Question': 'A 3-year-old child presents with tall stature, developmental delay, joint hypermobility, hyperelastic skin, fair complexion, prominent sternum, and downward lens subluxation in the right eye. Considering these features, what complication is this child most likely to develop?',
 'Complex_CoT': "Alright, let's think about this. We've got a 3-year-old child showing quite a few distinct features: tall stature, developmental delay, joint hypermobility, hyperelastic skin, a fair complexion, a prominent sternum, and a curious lens issue—it's subluxed downward in the right eye. Hmm... these seem to be pointing towards something genetic, maybe a connective tissue disorder?\n\nNow, I know Marfan syndrome often pops up when we talk tall stature and joint flexibility. But wait, Marfan's typically has upward lens dislocation, right? This is downward. Oh, and developmental delay isn't something we strongly associate with Marfan syndrome, especially not at this age.\n\nThat brings me to

In [None]:
# We need to format the dataset to fit our prompt training style
EOS_TOKEN = tokenizer.eos_token  # Define EOS_TOKEN which the model when to stop generating text during training
EOS_TOKEN

'<｜end▁of▁sentence｜>'

In [None]:
def formating_prompts_function(examples):
  inputs = examples['Question']
  cots = examples['Complex_CoT']
  outputs = examples['Response']

  texts =[]

  for input, cot, output in zip(inputs, cots, outputs):
    text =train_prompt_style.format(input, cot, output) + EOS_TOKEN
    texts.append(text)

  return {
    "text": texts,
}

In [None]:
# Update dataset formatting
dataset_finetune = dataset.map(formating_prompts_function, batched = True)
dataset_finetune["text"][0]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

"Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdomi

##  Setting up the model using LoRA¶

Large language models (LLMs) have millions or even billions of weights that determine how they process and generate text. When fine-tuning a model, we usually update all these weights, which requires massive computational resources and memory.

LoRA (Low-Rank Adaptation) allows to fine-tune efficiently by:

* Instead of modifying all weights, LoRA adds small, trainable adapters to specific layers.
* These adapters capture task-specific knowledge while leaving the original model unchanged.
* This reduces the number of trainable parameters by more than 90%, making fine-tuning faster and more memory-efficient.

Think of an LLM as a complex factory. Instead of rebuilding the entire factory to produce a new product, LoRA adds small, specialized tools to existing machines. This allows the factory to adapt quickly without disrupting its core structure.

Below, we will use the get_peft_model() function which stands for Parameter-Efficient Fine-Tuning — this function wraps the base model (model) with LoRA modifications, ensuring that only specific parameters are trained.

In [None]:
# Apply LoRA (Low-Rank Adaptation) fine-tuning to the model
finetuned_model_lora = FastLanguageModel.get_peft_model(
    model,
    r=8,  # LoRA rank: Determines the size of the trainable adapters (higher = more parameters, lower = more efficiency)
    target_modules=[  # List of transformer layers where LoRA adapters will be applied
        "q_proj",   # Query projection in the self-attention mechanism
        "k_proj",   # Key projection in the self-attention mechanism
        "v_proj",   # Value projection in the self-attention mechanism
        "o_proj",   # Output projection from the attention layer
        "gate_proj",  # Used in feed-forward layers (MLP)
        "up_proj",    # Part of the transformer’s feed-forward network (FFN)
        "down_proj",  # Another part of the transformer’s FFN
    ],
    lora_alpha=16,  # Scaling factor for LoRA updates (higher values allow more influence from LoRA layers)
    lora_dropout=0,  # Dropout rate for LoRA layers (0 means no dropout, full retention of information)
    bias="none",  # Specifies whether LoRA layers should learn bias terms (setting to "none" saves memory)
    use_gradient_checkpointing="unsloth",  # Saves memory by recomputing activations instead of storing them (recommended for long-context fine-tuning)
    random_state=3407,  # Sets a seed for reproducibility, ensuring the same fine-tuning behavior across runs
    use_rslora=False,  # Whether to use Rank-Stabilized LoRA (disabled here, meaning fixed-rank LoRA is used)
    loftq_config=None,  # Low-bit Fine-Tuning Quantization (LoFTQ) is disabled in this configuration
)

Unsloth 2025.2.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
# Initialize the fine-tuning trainer — Imported using from trl import SFTTrainer
trainer = SFTTrainer(
    model=finetuned_model_lora,  # The model to be fine-tuned
    tokenizer=tokenizer,  # Tokenizer to process text inputs
    train_dataset=dataset_finetune,  # Dataset used for training
    dataset_text_field="text",  # Specifies which field in the dataset contains training text
    max_seq_length=max_seq_length,  # Defines the maximum sequence length for inputs
    dataset_num_proc=2,  # Uses 2 CPU threads to speed up data preprocessing

    # Define training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Number of examples processed per device (GPU) at a time
        gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps before updating weights
        num_train_epochs=1, # Full fine-tuning run
        warmup_steps=5,  # Gradually increases learning rate for the first 5 steps
        max_steps=60,  # Limits training to 60 steps (useful for debugging; increase for full fine-tuning)
        learning_rate=2e-4,  # Learning rate for weight updates (tuned for LoRA fine-tuning)
        fp16=not is_bfloat16_supported(),  # Use FP16 (if BF16 is not supported) to speed up training
        bf16=is_bfloat16_supported(),  # Use BF16 if supported (better numerical stability on newer GPUs)
        logging_steps=10,  # Logs training progress every 10 steps
        optim="adamw_8bit",  # Uses memory-efficient AdamW optimizer in 8-bit mode
        weight_decay=0.01,  # Regularization to prevent overfitting
        lr_scheduler_type="linear",  # Uses a linear learning rate schedule
        seed=3407,  # Sets a fixed seed for reproducibility
        output_dir="outputs",  # Directory where fine-tuned model checkpoints will be saved
    ),
)


Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
trainer_status = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 20,971,520


Step,Training Loss
10,1.9129
20,1.4672
30,1.4083
40,1.3152
50,1.3509
60,1.3212


In [None]:
# Save the fine-tuned model
wandb.finish()

0,1
train/epoch,▁▂▄▅▇██
train/global_step,▁▂▄▅▇██
train/grad_norm,█▂▂▁▂▂
train/learning_rate,█▇▅▄▂▁
train/loss,█▃▂▁▁▁

0,1
total_flos,1.787791692201984e+16
train/epoch,0.96
train/global_step,60.0
train/grad_norm,0.37273
train/learning_rate,0.0
train/loss,1.3212
train_loss,1.46263
train_runtime,1192.3215
train_samples_per_second,0.403
train_steps_per_second,0.05


In [None]:
question = """how to reduce the body heat"""

# Load the inference model using FastLanguageModel (Unsloth optimizes for speed)
FastLanguageModel.for_inference(finetuned_model_lora)  # Unsloth has 2x faster inference!

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = finetuned_model_lora.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[1])


<think>
Okay, so I want to reduce body heat. Hmm, I remember from school that staying cool is important, especially during the summer when it's really hot out. I've heard people talk about things like drinking water a lot, wearing light clothing, and maybe even using fans or air conditioning. But I'm not exactly sure how all of that works together. Let me think about each one.

First, water. I know it's a basic thing, but why does it help? I guess it's because drinking water keeps you hydrated, and when you're hydrated, you don't feel as thirsty. When you're really hot, your body sweats to cool down, but that makes you lose water. If you don't replace it, you might get dehydrated and feel worse. So, keeping well hydrated is important for staying cool.

Next, light clothing. I think about it as being like a layer that doesn't trap heat. The idea is that when it's hot outside, you want something that allows your skin to release heat easily. Thick or dark clothes trap more heat because t

In [None]:
model.save_pretrained("fine-tuned-Deepseek-medical")
tokenizer.save_pretrained("fine-tuned-Deepseek-medical")

model.save_pretrained_merged("fine-tuned-Deepseek-medical-merged",tokenizer, save_method = "merged_16bit",)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.26 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 34%|███▍      | 11/32 [00:01<00:01, 13.32it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [07:18<00:00, 13.69s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving fine-tuned-Deepseek-medical-merged/pytorch_model-00001-of-00004.bin...
Unsloth: Saving fine-tuned-Deepseek-medical-merged/pytorch_model-00002-of-00004.bin...
Unsloth: Saving fine-tuned-Deepseek-medical-merged/pytorch_model-00003-of-00004.bin...
Unsloth: Saving fine-tuned-Deepseek-medical-merged/pytorch_model-00004-of-00004.bin...
Done.


In [None]:
s='/content/fine-tuned-Deepseek-medical'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

https://www.kaggle.com/code/sharadnaik01/fine-tuning-deepseek-r1-reasoning-model-youtube/edit