# Fine-tune Deepseek-R1

- Fine-Tune DeepSeek R1 1.5B on Free GCP Colab T4: A Hands-On Guide with LoRA, [link](https://www.linkedin.com/pulse/fine-tune-deepseek-r1-15b-free-gcp-colab-t4-hands-on-konathala-phd--4bluf/)
- SFT Trainer [link](https://huggingface.co/docs/trl/en/sft_trainer)

## Installation

In [None]:
# install the necessary packages
! pip install transformers datasets peft torch



## Imports

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model name
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
# model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B" # requires memory

# Load pre-trained model & tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

## Generate Domain-specific Document

In [None]:
text = """
user_prompt with score 3. userprompt2 with score 2.
Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making. The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends. As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively.
"""

## Convert Text Data into HuggingFace Data

In [None]:
from datasets import Dataset

# Split the text into sentences for better learning
sentences = text.split(". ")

# Create a Hugging Face dataset
dataset = Dataset.from_dict({"text": sentences})

Examine the data schema

In [None]:
print(type(dataset))

<class 'datasets.arrow_dataset.Dataset'>


In [None]:
dataset

Dataset({
    features: ['text'],
    num_rows: 4
})

### Setup Tokenizer

In [None]:
def preprocess_function(examples, key='text'):
    inputs = tokenizer(
        examples[key], truncation=True, padding="max_length", max_length=512
    )

    # Labels must be a shifted version of input_ids for causal LM training
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 4
})

In [None]:
tokenized_dataset['text']

['\nArtificial Intelligence (AI) is transforming industries across the globe',
 'From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making',
 'The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends',
 'As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively.\n']

## Generate Domain-specific Document: Real Dataset

In [None]:
from datasets import load_dataset

# dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train[0:100]", trust_remote_code=True)
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train", trust_remote_code=True)

README.md:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/74.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25371 [00:00<?, ? examples/s]

In [None]:
# Format the dataset
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

In [None]:
def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + tokenizer.eos_token
        texts.append(text)
    return {
        "text": texts,
    }

dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/25371 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['Question', 'Complex_CoT', 'Response', 'text'],
    num_rows: 25371
})

In [None]:
dataset['text'][0]

"Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdomi

In [None]:
dataset['text'][0]

"Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdomi

In [None]:
def preprocess_function(examples, key='text'):
    inputs = tokenizer(
        examples[key], truncation=True, padding="max_length", max_length=512
    )

    # Labels must be a shifted version of input_ids for causal LM training
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

In [None]:
# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/25371 [00:00<?, ? examples/s]

## Setup LORA for Efficient Fine-tuning

In [None]:
from peft import get_peft_model, LoraConfig, TaskType

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)

## Configuration of Training Hyperparameters

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Adjusted for GPU memory limitations
    gradient_accumulation_steps=8,  # To simulate a larger batch size
    warmup_steps=200,
    # max_steps=100, # Remove it so num_train_epochs controls training
    num_train_epochs=1000,
    learning_rate=2e-4,
    fp16=True,  # Enable mixed precision training
    logging_steps=10,
    output_dir="outputs",
    report_to="none",
    remove_unused_columns=False,
)

## Initialize `trainer` and Free Up Memory

In [None]:
# Move model to CPU to free memory before training
model = model.to("cpu")

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Free up memory before training
import torch
import gc

gc.collect()  # Garbage collection
torch.cuda.empty_cache()  # Clears CUDA cache

# Optimize model with torch.compile (improves execution speed)
model = torch.compile(model)

# Move model back to GPU for training
model = model.to("cuda")

In [None]:
%%time

# Start training
trainer.train()

Step,Training Loss
10,2.6087
20,2.7304
30,2.6445
40,2.5709
50,2.4375
60,2.4186
70,2.3112
80,2.2021
90,1.9731
100,1.8849


Step,Training Loss
10,2.6087
20,2.7304
30,2.6445
40,2.5709
50,2.4375
60,2.4186
70,2.3112
80,2.2021
90,1.9731
100,1.8849


## Save Models

In [None]:
domain = "medical-o1-reasoning"
model.save_pretrained(f"fine-tuned-deepseek-r1-1.5b-{domain}")
tokenizer.save_pretrained(f"fine-tuned-deepseek-r1-1.5b-{domain}")

## Push Artifact to HuggingFace

In [None]:
# from huggingface_hub import login

# login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import HfApi

repo_name = f"fine-tuned-deepseek-r1-1.5b-{domain}"  # Change this to your desired repo name
username = "eagle0504"  # Replace with your Hugging Face username

api = HfApi()
repo_url = api.create_repo(repo_id=f"{username}/{repo_name}", exist_ok=True)
print(f"Model repository created at: {repo_url}")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import HfApi

# Define paths
local_dir = f"fine-tuned-deepseek-r1-1.5b-{domain}"
repo_id = f"{username}/{repo_name}"  # Your Hugging Face repo ID

# Load the model and tokenizer from local storage
model = AutoModelForCausalLM.from_pretrained(local_dir)
tokenizer = AutoTokenizer.from_pretrained(local_dir)

# Push to Hugging Face Hub
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

print(f"Model successfully uploaded to: https://huggingface.co/{repo_id}")

adapter_model.safetensors:   0%|          | 0.00/8.73M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Model successfully uploaded to: https://huggingface.co/eagle0504/fine-tuned-deepseek-r1-1.5b-some-ai-domain-v3


## Load Fine-tuned Model

https://huggingface.co/eagle0504/fine-tuned-deepseek-r1-1.5b-medical-o1-v2

If this is the first time, you'll need to login.

In [None]:
# from huggingface_hub import login

# login()

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Define the path where the fine-tuned model is saved
model_path = f"eagle0504/fine-tuned-deepseek-r1-1.5b-medical-o1-v2"

# Load the fine-tuned model and tokenizer
# Load model directly
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Move model to CPU (or GPU if needed)
device = "cuda" if torch.cuda.is_available() else "cpu"
# model.to("cpu")  # Keeping it on CPU for now

adapter_config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.73M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.96k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/485 [00:00<?, ?B/s]

## Inference

In [None]:
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to("cpu")

    with torch.no_grad():
        output = model.generate(**inputs, max_length=max_length, temperature=0.7, top_k=50, top_p=0.9)

    return tokenizer.decode(output[0], skip_special_tokens=True)

```
text = """
Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are revolutionizing the way we approach problem-solving and decision-making. The integration of AI into daily operations enhances efficiency, accuracy, and the ability to predict future trends. As AI technology continues to evolve, it is crucial for professionals to stay informed about the latest developments and understand how to leverage these tools effectively.
"""
```

In [None]:
%%time

# Test
prompt = "Artificial Intelligence (AI) is transforming industries"
output = generate_text(prompt, max_length=1024)
print(output)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are widespread. However, the integration of AI into existing systems poses challenges. One of these challenges is the difficulty in accurately estimating the number of existing AI systems in a given region. This estimation is crucial for planning the deployment of AI tools that can enhance the efficiency and effectiveness of these systems. How can we address this challenge?

The answer should be a paragraph that starts with "Artificial Intelligence is transforming industries..." and ends with "The answer is...". It should be concise and clear, avoiding any unnecessary details.

The answer should mention at least three different methods to estimate the number of existing AI systems in a region, and for each method, provide a brief explanation and example of how it could be implemented.
To start, I can think about how we can observe existing AI systems in specific regions