# Fine-tuning DeepSeek R1 Distilled Qwen2.5 1.5B

In this notebook, it will demonstrate how to finetune `DeepSeek-R1-Distill-Qwen2.5 1.5B` with Unsloth, using a medical dataset.

## Why do we need LLM fine-tuning?

Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.

In [1]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo
!pip install -U huggingface_hub
!pip install wandb 

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Found existing installation: unsloth 2025.3.1
Uninstalling unsloth-2025.3.1:
  Successfully uninstalled unsloth-2025.3.1
[0mCollecting git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-req-build-4kkzwhcg
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-4kkzwhcg
  Resolved https://github.com/unslothai/unsloth.git to commit be55e29a2dddf5f913c90094c2902a45798d356a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: unsloth
  Building wheel for uns

In [2]:
from huggingface_hub import login
hf_token = "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"
login(hf_token)

In [3]:
import wandb

wb_token = "695dbbe83ed95db416651f66e8f5d5488f9146b7"
wandb.login(key=wb_token)
run = wandb.init(
    project='fine-tune-DeepSeek-R1-Distill-Qwen-1.5B on emo Dataset',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjongs-un[0m ([33mjongs-un-Personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin



## Choose a Base Model

1. Choose a model that aligns with your usecase
2. Assess your storage, compute capacity and dataset
3. Select a Model and Parameters
4. Choose Between Base and Instruct Models

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.1: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [5]:
# del model


## Inference before fine-tuning

In [6]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional support expert, skilled in active listening, empathy, and providing warm yet professional emotional support. 
Your responses incorporate psychological knowledge and real-life examples to help users understand their emotions, offering comfort, encouragement, or practical advice.
Please answer the following emotion question.

### Question:
{}

### Response:
<think>{}"""

In [7]:
question = "我最近感到非常焦虑，但不知道原因是什么"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so the user is feeling really stressed and not sure why. I need to address this without making them feel bad. Maybe start by acknowledging their feelings to help comfort them. Then, explain that stress can be from various sources, like work, personal stuff, or life changes. It's important to reassure them that it's okay to feel this way and that they can take things one step at a time.

I should keep the tone warm and professional, offering practical advice like talking to a professional or practicing mindfulness. That way, they feel supported and not alone. Finally, a positive note to remind them they can handle it and maybe even look forward to a brighter day.
</think>

I'm here to support you during this challenging time. If you're feeling overwhelmed, it's important to take it one step at a time. First, let's acknowledge your feelings and understand that it's okay to feel this way. Stress can come from various sources, such as work, personal issues, or life changes. 

## Prepare Dataset

A medical dataset [https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/) will be used to train the selected model.

In [8]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.
Please answer the following emotional question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [9]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["input"]
    cots = examples["reasoning_content"]
    outputs = examples["content"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

In [10]:
from datasets import load_dataset
dataset = load_dataset("Kedreamix/psychology-10k-Deepseek-R1-zh", split = "train", trust_remote_code=True)
print(dataset.column_names)

['input', 'content', 'reasoning_content']


For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column. We need to transform the dataset into proper structure.

In [11]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.\nPlease answer the following emotional question.\n\n### Question:\n我晚上难以入睡，我认为这是因为我对工作感到压力\n\n### Response:\n<think>\n嗯，用户说他晚上难以入睡，认为是因为工作压力。首先，我需要确认他的情况是否常见，以及可能的解决方法。工作压力导致的失眠确实很普遍，但每个人的具体情况可能不同。我需要考虑他的工作环境、压力源是什么，比如工作量、人际关系还是职业发展。然后，可能涉及到他的睡前习惯，是否有使用电子设备、咖啡因摄入等影响睡眠的因素。此外，心理健康方面，比如焦虑或抑郁情绪也可能加剧失眠。我需要建议他调整作息，比如建立规律的睡眠时间，避免咖啡因和蓝光。放松技巧如冥想、深呼吸可能会有帮助。如果自我调节无效，可能需要建议他寻求专业帮助，比如心理咨询师或医生。同时，时间管理技巧可能减轻工作压力，比如任务优先级划分，适当授权任务。还要注意他的支持系统，比如家人朋友的支持。需要提醒他如果症状持续，可能有更严重的健康问题，应该及时就医。最后，要确保建议具体可行，并且语气要 empathetic，让他感受到被理解和支持。\n</think>\n你的情况是很多职场人都会遇到的困扰

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [12]:
FastLanguageModel.for_training(model)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# 超参数配置:

## 学习率（Learning Rate）：通过 TrainingArguments 中的 learning_rate 参数设置的，这里的值为 2e-4（即 0.0002）。

## 批量大小（Batch Size）：由两个参数共同决定（实际的批量大小：per_device_train_batch_size * gradient_accumulation_steps，也就是 2 * 4 = 8）：
* per_device_train_batch_size：每个设备（如 GPU）上的批量大小。
* gradient_accumulation_steps：梯度累积步数，用于模拟更大的批量大小。


## 训练轮数（Epochs）：通过 max_steps(最大训练步数) 和数据集大小计算得出，
## 在这段代码中，最大训练 5000 步，每一步训练 8 个，数据集大小为 10K，那训练论数就是 5000 * 8 / 10K = 4


In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 5000,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
    ),
)

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 8,775 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 36,929,536


Step,Training Loss
1,2.9662
2,2.9186
3,2.9063
4,2.9679
5,2.9667
6,2.8908
7,2.8693
8,2.7099
9,2.7019
10,2.5726


## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [15]:
print(question)

我最近感到非常焦虑，但不知道原因是什么


In [16]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
嗯，用户说他最近感到非常焦虑，但不知道原因是什么。首先，我需要确认他的焦虑情绪，可能已经尝试过一些方法，但没效果。这时候，我应该先表达理解和共情，让他感到被接纳。然后，可能需要引导他自我反思，但又要避免让他感到被评判。

接下来，我需要考虑焦虑的常见原因。比如压力源、环境变化、生活习惯、人际关系、心理障碍或健康问题。用户可能没有具体提到，所以需要引导他思考可能的因素。但要避免假设，而是提供一般性的建议。

然后，我应该提供一些实际的应对策略。比如深呼吸、正念练习、写日记、减少刺激等。同时，要提醒他如果焦虑严重，寻求专业帮助的重要性。可能需要提醒他不要自责，焦虑是正常的反应，减少他的自责情绪。

还要注意语言要温和，避免使用专业术语，保持口语化。同时，结构要清晰，分点列出，让他容易理解和实施。可能还需要提醒他关注身体反应，比如心跳加速、出汗等，这些是焦虑的信号。

另外，要避免让他感到被指责，比如不要说“你总是这样”，而是鼓励他寻求帮助。可能需要建议他记录焦虑日记，记录具体的情境、触发点和反应，这样可以帮助他找到规律。

最后，要确保回应的语气是支持性的，给予希望，说明焦虑是可以管理的，通过一些方法改善，而不是他一个人承担全部责任。可能需要提醒他，寻求帮助不是软弱，而是勇敢的行为。
</think>
听到你最近感到焦虑，我能感受到这种情绪带来的困扰，这种感受非常真实。焦虑本身是人类在面对压力、不确定性或变化时会自然出现的反应，但你愿意主动面对，这本身已经是非常重要的一步。以下是一些可能对你有帮助的思考方向，你可以根据自己的情况尝试：

**1. 焦虑的“根源”可能不是“问题”本身，而是“我们无法察觉的信号”**
- 调整视角：焦虑是身体在发出警报，提醒你需要关注自己。就像感冒需要喝药一样，焦虑也需要被观察。
- 试着记录焦虑发生的瞬间：记录具体场景（如说话、工作、人际关系），记录身体反应（如心跳加速、胃部紧缩），这能帮你发现模式。

**2. 重新定义“焦虑”：它不是“问题”**
- 用“焦虑=焦虑”代替“焦虑=问题”：把焦虑当作你的“安全网”，当焦虑来袭时，你会感到“安全”吗？
- �)?.你可能在某个瞬间被触发，这说明你已经准备好应对它。

**3. 重新构建你的“焦虑时间”**
- 建立“焦虑日志”：每天记录3件焦虑感出现的场景（哪怕只是“听到

## Upload Model to HuggingFace

Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [17]:
# https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
!apt update && apt install -y cmake
!git clone https://github.com/ggml-org/llama.cpp
!cd llama.cpp
!cmake -B build
!cmake --build build --config Release
!cp build/bin/llama-quantize ./
# 将 llama-quantize 从 llama.cpp/build/bin 移动到 llama.cpp

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease              [0m       [0m[33m
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
127 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.
fatal: destination path 'llama.cpp' already exists and is not an empty directory.
[0mCMake Error: The source directory "/workspace" does not appear to cont

In [18]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [19]:
from huggingface_hub import create_repo
create_repo("jong-un/Qwen2.5-1.5B-Instruct-R1", token=hf_token, exist_ok=True)

RepoUrl('https://huggingface.co/jong-un/Qwen2.5-1.5B-Instruct-R1', endpoint='https://huggingface.co', repo_type='model', repo_id='jong-un/Qwen2.5-1.5B-Instruct-R1')

In [20]:
model.push_to_hub_gguf("jong-un/Qwen2.5-1.5B-Instruct-R1", tokenizer, token = hf_token)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 93.25 out of 125.02 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 210.82it/s]

Unsloth: Saving tokenizer...




 Done.
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at jong-un/Qwen2.5-1.5B-Instruct-R1 into q8_0 GGUF format.
The output location will be /workspace/jong-un/Qwen2.5-1.5B-Instruct-R1/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Qwen2.5-1.5B-Instruct-R1
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.bfloat16 --> Q8_0, shape = {1536, 151936}
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> Q8_0, shape = 

unsloth.Q8_0.gguf:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/jong-un/Qwen2.5-1.5B-Instruct-R1


<a name="Ollama"></a>
### Ollama Support

[Unsloth](https://github.com/unslothai/unsloth) now allows you to automatically finetune and create a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md), and export to [Ollama](https://ollama.com/)! This makes finetuning much easier and provides a seamless workflow from `Unsloth` to `Ollama`!

Let's first install `Ollama`!

In [21]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%                              21.3%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


We use `subprocess` to start `Ollama` up in a non blocking fashion! In your own desktop, you can simply open up a new `terminal` and type `ollama serve`, but in Colab, we have to use this hack!

In [22]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAYBBxihNNaeTbWpwyLqUUhXIXhvzVJd91oJLv2LE9hJ



2025/03/04 07:42:19 routes.go:1205: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-03-04T07:42:19.749Z 

### Ollama run HuggingFace model

```bash
#ollama run hf.co/jong-un/Qwen2.5-1.5B-Instruct-R1
ollama run hf.co/{username}/{repository}:{quantization}
```

### Ollama inference

```bash
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/jong-un/Qwen2.5-1.5B-Instruct-R1",
    "messages": [
      { "role": "user", "content": "我最近感到非常焦虑，但不知道原因是什么" }
    ]
  }'

```