# Fine-Tuning DeepSeek R1

Fine-tuning the **DeepSeek R1** model can be done efficiently by following these steps.

## 1. Setting Up

For this project, we will use **Kaggle** as our Cloud IDE. Kaggle offers **free access to GPUs**, which are often more powerful than those available in **Google Colab**. 

### Steps to Get Started:
1. **Launch a new Kaggle notebook**.
2. **Add your Hugging Face and Weights & Biases tokens as secrets**:
   - Navigate to the **Add-ons** tab in the Kaggle notebook interface.
   - Select the **Secrets** option.
   - Add the required API tokens.

### Installing Dependencies

Once the secrets are set up, install the **Unsloth** Python package. Unsloth is an open-source framework that enables **2X faster and more memory-efficient fine-tuning** of large language models (LLMs).

```python
!pip install unsloth

In [24]:
%%capture
# Install the Unsloth library for efficient LLM fine-tuning  
!pip install unsloth

# Force reinstall the latest nightly version of Unsloth from GitHub  
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Code Explanation:
	•	%%capture suppresses output to keep the notebook clean.
	•	The first command installs Unsloth, an optimized library for fine-tuning large language models.
	•	The second command force-reinstalls the latest nightly version from GitHub, ensuring up-to-date features and fixes.

In [2]:
from unsloth import FastLanguageModel # Import Unsloth's optimized LLM module
import torch  # Import PyTorch for tensor operations and model handling

# Define maximum sequence length for the model
max_seq_length = 2048 

# Data type for model weights (None allows default settings)
dtype = None 

# Enable 4-bit quantization for memory-efficient model loading
load_in_4bit = True 

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 2. Logging into Hugging Face CLI

To authenticate with Hugging Face, we will use the **Hugging Face API key** stored securely in **Kaggle Secrets**.

### Steps to Log In:
1. Retrieve the API key from **Kaggle Secrets**.
2. Use the key to log in to the Hugging Face CLI.

In [4]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HF_TOKEN")
login(hf_token)

## 3. Logging into Weights & Biases (wandb) and Setting Up a Project

We will use **Weights & Biases (wandb)** to track our fine-tuning experiments, visualize training metrics, and manage logs efficiently.

### Steps to Set Up:
1. **Retrieve the Weights & Biases API key** stored securely in **Kaggle Secrets**.
2. **Log in to wandb** using the API key.
3. **Initialize a new wandb project** for experiment tracking.

In [25]:
import wandb  # Import Weights & Biases for experiment tracking

# Retrieve the Weights & Biases (wandb) API token from Kaggle Secrets
wb_token = user_secrets.get_secret("wandb")

# Log in to Weights & Biases using the retrieved API token
wandb.login(key=wb_token)

# Initialize a new Weights & Biases run for tracking fine-tuning progress
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset',  # Project name
    job_type="training",  # Define the job type as training
    anonymous="allow"  # Allow anonymous logging if needed
)



## 4. Loading the Model and Tokenizer

For this project, we will use the **Unsloth version** of `DeepSeek-R1-Distill-Llama-8B`.  
To optimize **memory usage** and **performance**, we will load the model using **4-bit quantization**.

### Steps to Load the Model:
1. **Ensure dependencies are installed** (Unsloth, etc.).
2. **Load the tokenizer and model** from Unsloth with **4-bit quantization** for efficiency.
3. **Move the model to GPU** if available.

In [7]:
from unsloth import FastLanguageModel  # Import Unsloth's optimized LLM module

# Define model configuration
max_seq_length = 2048  # Maximum sequence length for input processing
dtype = None  # Data type for model weights (None allows default settings)
load_in_4bit = True  # Enable 4-bit quantization for memory efficiency

# Load the pre-trained DeepSeek-R1-Distill-Llama-8B model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Model name from Hugging Face
    max_seq_length=max_seq_length,  # Set sequence length for model processing
    dtype=dtype,  # Use default data type settings
    load_in_4bit=load_in_4bit,  # Load model with 4-bit quantization
    token=hf_token,  # Use the Hugging Face authentication token for access
)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 5. Model Inference Before Fine-Tuning

Before fine-tuning the model, we will test its inference capabilities by defining a structured **prompt style**.  
This prompt will include:
- A **system prompt** to set the context.
- **Placeholders** for the question and response generation.
- A **step-by-step approach** to encourage logical and accurate answers.

### Steps:
1. **Define a structured prompt format**.
2. **Run an initial inference using the pre-trained model**.
3. **Analyze the model's response before fine-tuning**.

In [8]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

Before fine-tuning the model, we will test its inference capabilities using a **medical question**.  
We will follow these steps:
1. **Define a structured prompt style** with a **medical question**.
2. **Convert the prompt into tokens**.
3. **Pass the tokens to the model for response generation**.
4. **Analyze the model's response** before fine-tuning.

In [9]:
# Define a medical question for model inference
question = ("A 61-year-old woman with a long history of involuntary urine loss "
            "during activities like coughing or sneezing but no leakage at night "
            "undergoes a gynecological exam and Q-tip test. Based on these findings, "
            "what would cystometry most likely reveal about her residual volume and detrusor contractions?")

# Optimize the model for inference using Unsloth's faster inference method
FastLanguageModel.for_inference(model)  # Enables 2x faster inference

# Tokenize the input question using the predefined prompt format
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")  # Move input to GPU

# Generate model response with specified constraints
outputs = model.generate(
    input_ids=inputs.input_ids,  # Input token IDs
    attention_mask=inputs.attention_mask,  # Attention mask for valid tokens
    max_new_tokens=1200,  # Maximum number of tokens the model can generate
    use_cache=True,  # Enables caching for efficient decoding
)

# Decode and extract the model's response
response = tokenizer.batch_decode(outputs)  # Convert tokenized output into text
print(response[0].split("### Response:")[1])  # Extract and print only the response content


<think>
Okay, so I'm trying to figure out what cystometry would show for this 61-year-old woman. Let me break this down step by step.

First, the patient has a history of involuntary urine loss when she coughs or sneezes. That makes me think of stress urinary incontinence, especially since she doesn't leak at night. So, her main issue is likely related to bladder control during activities that put pressure on the bladder.

She undergoes a gynecological exam and a Q-tip test. I'm a bit fuzzy on what exactly the Q-tip test entails, but I think it's a diagnostic tool used to assess urethral function. It probably involves inserting a Q-tip catheter into the urethra to measure pressure or other parameters. If the test shows abnormal results, it could indicate urethral dysfunction, which might be contributing to her incontinence.

Now, considering the possible findings, if there's involuntary leakage on coughing, it suggests that the bladder's ability to resist pressure isn't working proper

### Model Inference Analysis: Why Fine-Tuning is Necessary?

Even without fine-tuning, our model successfully generated a **chain of thought**, providing reasoning before delivering the final answer. The reasoning process was encapsulated within the `<think></think>` tags, demonstrating the model’s ability to structure its response logically.

### So, why do we still need fine-tuning?

1. **Lack of Conciseness**  
   - While the reasoning process was detailed, it was also **long-winded** and not as concise as we would like.
   
2. **Formatting Issues**  
   - The final answer was presented in a **bullet-point format**, which **deviates** from the structured, narrative style of the dataset we aim to fine-tune on.

Fine-tuning will help the model:
- **Refine its reasoning process** to be **clear and to the point**.
- **Adapt its response style** to match the desired **dataset structure**.
- **Ensure coherence** and **consistency** across all generated outputs.

With this understanding, we can now move forward to **preparing the dataset for fine-tuning**! 

## 6. Loading and Processing the Dataset

To ensure the model aligns with our desired response style, we will **slightly modify the prompt structure** when processing the dataset. Specifically, we will **add a third placeholder** for a **complex chain of thought** column.

### Why is this necessary?
- **Enhances structured reasoning:** The additional placeholder allows the model to generate a more **well-structured and logical response**.
- **Maintains coherence with the dataset format:** By incorporating a **complex reasoning step**, we ensure that the model follows a **consistent thought process** before producing the final answer.
- **Improves fine-tuning efficiency:** A structured prompt style makes it easier to train the model on **step-by-step explanations** while ensuring the generated responses adhere to the desired format.

With this modified prompt structure in place, we can now proceed to **loading and preparing the dataset** for fine-tuning! 

In [10]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

## 7. Downloading the Dataset

We will use the **medical-o1-reasoning-SFT** dataset from **Hugging Face**, available at:  
🔗 [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT)

### About the Dataset:
- This dataset was used to fine-tune **HuatuoGPT-o1**, a **medical LLM** designed for **advanced medical reasoning**.
- Constructed using **GPT-4o**, the dataset includes solutions to **verifiable medical problems**.
- All solutions are validated through a **medical verifier**, ensuring accuracy and reliability.

### Why use this dataset?
- It enhances the model's ability to **reason through complex medical cases**.
- It provides structured, step-by-step **medical problem-solving** examples.
- It ensures that **generated responses** are **factually accurate** and aligned with medical best practices.

### Formatting the Dataset
To achieve this, we will write a python function to create a new `"text"` column in the dataset. This column will follow the **training prompt style**, where placeholders will be dynamically filled with:
1. **Questions**
2. **Chains of Thought** (step-by-step reasoning)
3. **Answers** (final response)

In [12]:
# Define the End-of-Sequence (EOS) token to properly terminate responses
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN to indicate response completion

def formatting_prompts_func(examples):
    """
    Formats the dataset examples using the predefined training prompt style.
    
    Args:
        examples (dict): A dictionary containing batches of questions, 
                         chain-of-thought reasoning (CoT), and responses.
    
    Returns:
        dict: A dictionary with a new 'text' column containing formatted prompts.
    """
    # Extract inputs from dataset columns
    inputs = examples["Question"]  # The question prompt
    cots = examples["Complex_CoT"]  # The complex chain-of-thought reasoning
    outputs = examples["Response"]  # The final answer

    texts = []  # List to store formatted text entries

    # Format each example using the training prompt style
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN  # Append EOS token
        texts.append(text)

    # Return a dictionary containing the formatted text data
    return {"text": texts}

In [13]:
from datasets import load_dataset  # Import Hugging Face's dataset loading function

# Load the medical-o1-reasoning-SFT dataset (English subset) from Hugging Face
dataset = load_dataset(
    "FreedomIntelligence/medical-o1-reasoning-SFT",  # Dataset repository
    "en",  # Load the English language subset
    split="train[0:500]",  # Load only the first 500 training examples
    trust_remote_code=True  # Allow execution of dataset-defined preprocessing code
)

# Apply the formatting function to transform dataset into training-ready text format
dataset = dataset.map(formatting_prompts_func, batched=True)

# Display the first formatted training example
dataset["text"][0]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

"Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her ab

## 8. Setting Up the Model

To enhance the model’s adaptability and efficiency, we will **set up the model** by adding a **low-rank adapter (LoRA)** using the **target modules**.

### Why use a Low-Rank Adapter?
- **Reduces computational cost**: LoRA enables efficient fine-tuning by training only a small subset of parameters.
- **Improves memory efficiency**: Instead of updating the full model, it modifies low-rank matrices, making it feasible for large-scale models.
- **Enhances generalization**: By focusing on specific target modules, LoRA allows the model to **retain pre-trained knowledge** while adapting to new tasks.

### Steps to Set Up:
1. **Define the target modules** for LoRA integration.
2. **Attach the low-rank adapter** to these modules.
3. **Prepare the model for fine-tuning** with optimized parameter updates.

In [14]:
# Apply Parameter-Efficient Fine-Tuning (PEFT) using LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Rank of the low-rank adaptation (controls trainable parameters)
    target_modules=[  # Layers to apply LoRA modifications for efficiency
        "q_proj",  # Query projection in attention mechanism
        "k_proj",  # Key projection in attention mechanism
        "v_proj",  # Value projection in attention mechanism
        "o_proj",  # Output projection in attention mechanism
        "gate_proj",  # Gate projection in MLP layers
        "up_proj",  # Up projection in MLP layers
        "down_proj",  # Down projection in MLP layers
    ],
    lora_alpha=16,  # Scaling factor for LoRA weights (higher values increase adaptation strength)
    lora_dropout=0,  # Dropout rate (0 means no dropout, useful for stable training)
    bias="none",  # No additional bias terms in LoRA layers
    use_gradient_checkpointing="unsloth",  # Enables memory-efficient training for long sequences
    random_state=3407,  # Seed for reproducibility
    use_rslora=False,  # Disables Rank-Stabilized LoRA (RSLora), using standard LoRA instead
    loftq_config=None,  # No quantization applied to LoRA layers
)

Unsloth 2025.1.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Configuring Training Arguments and Trainer

With the model set up, we will now define the **training arguments** and initialize the **trainer**. This step involves specifying key parameters that will **optimize the fine-tuning process**.

### What will we configure?
1. **Model and Tokenizer**  
   - Ensure compatibility between the model and input text processing.
   
2. **Dataset**  
   - Load the preprocessed **medical-o1-reasoning-SFT** dataset.

3. **Training Parameters**  
   - Define important hyperparameters such as:
     - **Learning Rate:** Controls how much the model updates weights.
     - **Batch Size:** Determines how many examples are processed per step.
     - **Epochs:** Number of times the model goes through the entire dataset.
     - **Gradient Accumulation:** Helps when working with large models on limited resources.
     - **Weight Decay:** Prevents overfitting by applying regularization.

4. **Trainer Setup**  
   - Initialize the **Hugging Face Trainer** to efficiently manage model training.

### Why is this important?
- **Optimizes model performance** by fine-tuning the right parameters.
- **Ensures stability during training** with proper gradient accumulation.
- **Speeds up training** while preventing overfitting.

In [19]:
from trl import SFTTrainer  # Import the trainer for supervised fine-tuning (SFT)
from transformers import TrainingArguments  # Import training arguments for fine-tuning
from unsloth import is_bfloat16_supported  # Check if bfloat16 is supported for training

# Initialize the Supervised Fine-Tuning (SFT) Trainer
trainer = SFTTrainer(
    model=model,  # Fine-tuned model
    tokenizer=tokenizer,  # Tokenizer for processing input text
    train_dataset=dataset,  # Training dataset
    dataset_text_field="text",  # The column in the dataset containing formatted prompts
    max_seq_length=max_seq_length,  # Maximum sequence length for processing input
    dataset_num_proc=2,  # Number of processes for dataset pre-processing

    # Define training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Number of samples per device per batch
        gradient_accumulation_steps=4,  # Accumulate gradients over multiple steps
        num_train_epochs=5,  # Number of training epochs
        warmup_steps=5,  # Steps for warm-up phase before full learning rate is applied
        max_steps=60,  # Maximum number of training steps (use -1 for full dataset training)
        learning_rate=2e-4,  # Initial learning rate
        fp16=not is_bfloat16_supported(),  # Use fp16 (16-bit precision) if bfloat16 is not supported
        bf16=is_bfloat16_supported(),  # Use bf16 if available (better for certain hardware)
        logging_steps=10,  # Log metrics every 10 steps
        optim="adamw_8bit",  # Use 8-bit AdamW optimizer for memory-efficient training
        weight_decay=0.01,  # Weight decay for preventing overfitting
        lr_scheduler_type="linear",  # Use a linear learning rate scheduler
        seed=3407,  # Random seed for reproducibility
        output_dir="outputs",  # Directory to save model checkpoints and logs
    ),
)

## 9. Setting Up the Trainer for Fine-Tuning

To fine-tune our model efficiently, we use **Supervised Fine-Tuning (SFT)** with `SFTTrainer`.  
This setup ensures **optimized training performance** while keeping **memory usage low**.

### Key Components:
1. **SFTTrainer** → Handles supervised fine-tuning efficiently.
2. **Gradient Accumulation** → Reduces memory usage by accumulating gradients over multiple steps.
3. **Mixed Precision (fp16/bf16)** → Improves training speed and efficiency based on hardware support.
4. **8-bit AdamW Optimizer (`adamw_8bit`)** → Uses less memory while maintaining performance.
5. **Linear Learning Rate Scheduler (`lr_scheduler_type="linear"`)** → Gradually reduces learning rate for stability.

### Training Arguments:
| Parameter | Description |
|-----------|------------|
| `per_device_train_batch_size=2` | Trains with **2 samples per batch** per device. |
| `gradient_accumulation_steps=4` | Accumulates gradients over **4 steps** before updating. |
| `num_train_epochs=5` | Runs for **5 full training epochs**. |
| `warmup_steps=5` | Uses **5 warm-up steps** before applying the learning rate. |
| `max_steps=60` | Trains for **60 steps** (set `-1` for full dataset training). |
| `learning_rate=2e-4` | Sets the **initial learning rate**. |
| `fp16=not is_bfloat16_supported()` | Uses **16-bit precision** if `bfloat16` is unavailable. |
| `bf16=is_bfloat16_supported()` | Uses **bfloat16 precision** if supported (better for some hardware). |
| `logging_steps=10` | Logs metrics **every 10 steps**. |
| `optim="adamw_8bit"` | Uses **8-bit AdamW optimizer** for memory-efficient training. |
| `weight_decay=0.01` | Applies **L2 regularization** to prevent overfitting. |
| `lr_scheduler_type="linear"` | Uses a **linear learning rate decay** for stability. |
| `seed=3407` | Ensures **reproducibility**. |
| `output_dir="outputs"` | Saves model checkpoints and logs in the `"outputs"` directory. |

###  Why This Matters:
- **Memory Optimization** → **8-bit optimization & gradient accumulation** allow training on limited hardware.
- **Faster Training** → **Mixed precision (`fp16/bf16`)** speeds up computation.
- **Stable Learning** → **Linear LR scheduler** ensures smooth training progress.
- **Reproducibility** → Setting **seeds** guarantees consistent results across runs.

### 10. Model Training

In [20]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.9188
20,1.4615
30,1.4023
40,1.3088
50,1.3443
60,1.314


In [27]:
# Save the fine-tuned model
wandb.finish()

0,1
train/epoch,▁▂▄▅▇██
train/global_step,▁▂▄▅▇██
train/grad_norm,█▂▂▁▂▂
train/learning_rate,█▇▅▄▂▁
train/loss,█▃▂▁▁▁

0,1
total_flos,1.8014312853602304e+16
train/epoch,0.96
train/global_step,60.0
train/grad_norm,0.26026
train/learning_rate,0.0
train/loss,1.314
train_loss,1.45828
train_runtime,1239.958
train_samples_per_second,0.387
train_steps_per_second,0.048


## 11. Model Inference After Fine-Tuning

Now that we have fine-tuned the model, we will evaluate its performance by running **inference on the same question** we used earlier. This comparison helps us understand the **impact of fine-tuning** on the model’s reasoning and response quality.

### Why Compare Before and After?
- **Assess Improvements**: Check if the model’s response is more **concise, structured, and aligned** with the dataset format.
- **Evaluate Reasoning**: Determine if the fine-tuned model provides a **more logical and accurate chain of thought**.
- **Measure Formatting Consistency**: Ensure that the output follows the **desired response style** instead of deviating into bullet points or unstructured answers.

With this post-training inference test, we can validate whether our **fine-tuning process successfully improved the model’s performance**!

In [28]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])



<think>
Okay, so let's think about this. We have a 61-year-old woman who's been dealing with involuntary urine loss during things like coughing or sneezing, but she's not leaking at night. That suggests she might have some kind of problem with her pelvic floor muscles or maybe her bladder.

Now, she's got a gynecological exam and a Q-tip test. Let's break that down. The Q-tip test is usually used to check for urethral obstruction. If it's positive, that means there's something blocking the urethra, like a urethral stricture or something else.

Given that she's had a positive Q-tip test, it's likely there's a urethral obstruction. That would mean her urethra is narrow, maybe due to a stricture or some kind of narrowing. So, her bladder can't empty properly during activities like coughing because the urethral obstruction is making it hard.

Now, let's think about what happens when her bladder can't empty. If there's a urethral obstruction, the bladder is forced to hold more urine, incre

### 12. Evaluating the Fine-Tuned Model

After running inference on the fine-tuned model, we observe significant improvements:

**More Accurate Reasoning**: The chain of thought is **direct and structured**, avoiding unnecessary details.  
**Concise and Clear Answer**: The response is **straight to the point**, without excessive explanation.  
**Improved Formatting**: The answer is presented in **paragraph**, rather than a scattered or bullet-point format.  

### Conclusion:
The fine-tuning process was **successful**, as the model now generates **more refined, structured, and contextually appropriate responses**. This confirms that our adjustments to the **dataset, prompt format, and training parameters** have effectively enhanced model performance.

Next, we can proceed with **evaluating performance metrics** and, if needed, fine-tune further for additional improvements! 

## 10. Saving the Model Locally

Now that our fine-tuning process is complete, we will save the following components **locally** for future use:

- **Adapter**: Stores the **fine-tuned LoRA adapter** separately, allowing efficient loading and reuse.  
- **Full Model**: Saves the **entire fine-tuned model** to preserve the learned weights.  
- **Tokenizer**: Ensures that tokenization remains consistent across different projects.

### Why Save the Model?
- **Reuse in Other Projects**: Load the fine-tuned model without retraining from scratch.
- **Deployment Readiness**: Prepare the model for use in applications, inference pipelines, or APIs.
- **Experiment Tracking**: Maintain a checkpoint of the fine-tuned model for comparison with future versions.

With the model, adapter, and tokenizer saved, we can now use them seamlessly in **other projects and deployments**!

In [29]:
# Define the local directory name for saving the fine-tuned model
new_model_local = "DeepSeek-R1-Medical-COT"

# Save the fine-tuned model and its weights locally
model.save_pretrained(new_model_local)  

# Save the tokenizer to ensure consistent tokenization during future use
tokenizer.save_pretrained(new_model_local)

# Save the merged version of the fine-tuned model in 16-bit format
# This allows efficient storage and faster inference while retaining precision
model.save_pretrained_merged(new_model_local, tokenizer, save_method="merged_16bit")

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 18.53 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 34%|███▍      | 11/32 [00:00<00:01, 14.77it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:25<00:00,  1.25it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00001-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00002-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00003-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00004-of-00004.bin...
Done.


## 11. Pushing the Model to Hugging Face Hub

To make our fine-tuned model accessible to the **AI community**, we will **push the adapter, tokenizer, and full model** to **Hugging Face Hub**. This allows developers and researchers to **easily integrate** the model into their systems.

### Why Push to Hugging Face?
**Global Accessibility** → Anyone can download and use the model in their projects.  
**Collaboration** → Enables the AI community to fine-tune or improve upon our model.  
**Version Control** → Stores different versions for tracking improvements over time.  
**Seamless Deployment** → Can be directly integrated into **Hugging Face Pipelines, APIs, and other frameworks**.

Once uploaded, our fine-tuned **DeepSeek-R1-Medical-COT** model will be **publicly available** for research and development!

In [None]:
new_model_online = "sabahatatta/DeepSeek-R1-Medical-COT"
model.push_to_hub(new_model_online) # Online saving
tokenizer.push_to_hub(new_model_online) # Online saving

In [None]:
model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")

## 12. Conclusion

The field of **Artificial Intelligence** is evolving at an unprecedented pace. The **open-source AI movement** is now challenging the dominance of proprietary models that have ruled the landscape for the past three years.

### The Rise of Open-Source LLMs  
- Open-source **Large Language Models (LLMs)** are becoming **more powerful, efficient, and accessible**.
- Fine-tuning these models is now possible **even on lower compute and memory resources**.
- This democratization of AI enables researchers and developers to **build custom, domain-specific solutions** without relying on closed systems.

### What We Achieved in This Tutorial  
In this tutorial, we explored:
**DeepSeek R1** as a reasoning model and its **distilled version**.  
How to **fine-tune** the model for **medical Q&A tasks**.  
The importance of **structured reasoning models** for applications in **medicine, emergency services, and healthcare**.  

### The Future of AI  
To counter the release of **DeepSeek R1**, OpenAI has introduced:  
- **OpenAI’s o3** → A more advanced **reasoning-focused LLM**.  
- **Operator AI Agent (CUA)** → A **computer-using AI agent** capable of **autonomously navigating websites and performing tasks**.  

These advancements signal a shift toward **more intelligent, autonomous, and highly capable AI systems**. With continued innovation, **open-source AI will shape the future**, bringing powerful models into the hands of researchers, developers, and enterprises worldwide.