# Fine-tuning DeepSeek R1 Distilled Qwen2.5 1.5B

In this notebook, it will demonstrate how to finetune `DeepSeek-R1-Distill-Qwen2.5 1.5B` with Unsloth, using a medical dataset.

## Why do we need LLM fine-tuning?

Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.

In [1]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo
!pip install -U huggingface_hub
!pip install wandb 

Collecting unsloth
  Downloading unsloth-2025.3.5-py3-none-any.whl.metadata (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.3.2 (from unsloth)
  Downloading unsloth_zoo-2025.3.3-py3-none-any.whl.metadata (16 kB)
Collecting torch>=2.4.0 (from unsloth)
  Downloading torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.16-py3-none-any.whl.metadata (9.4 kB)
Collecting transformers!=4.47.0,>=4.46.1 (from unsloth

In [2]:
from huggingface_hub import login
hf_token = "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"
login(hf_token)

In [3]:
import wandb

wb_token = "695dbbe83ed95db416651f66e8f5d5488f9146b7"
wandb.login(key=wb_token)
run = wandb.init(
    project='fine-tune-DeepSeek-R1-Distill-Qwen-1.5B on emo Dataset',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjongs-un[0m ([33mjongs-un-Personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin



## Choose a Base Model

1. Choose a model that aligns with your usecase
2. Assess your storage, compute capacity and dataset
3. Select a Model and Parameters
4. Choose Between Base and Instruct Models

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
    trust_remote_code=True
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.3.5: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.78k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

In [5]:
# del model


## Inference before fine-tuning

In [6]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional support expert, skilled in active listening, empathy, and providing warm yet professional emotional support. 
Your responses incorporate psychological knowledge and real-life examples to help users understand their emotions, offering comfort, encouragement, or practical advice.
Please answer the following emotion question.

### Question:
{}

### Response:
<think>{}"""

In [7]:
question = "我最近感到非常焦虑，但不知道原因是什么"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so the user is feeling really anxious, and they don't know why. I need to figure out how to support them without making them feel bad or like they're just throwing things out.

First, I should acknowledge their feelings. It's important to validate their experience. They might be going through something tough, like a crisis or a difficult situation. I should let them know that it's okay to feel this way and that it's normal.

Next, I should encourage them to talk to someone they trust. Maybe they can open up about what's happening. It's crucial to let them know that they don't need to hide their feelings or fear. They're not alone, and it's okay to reach out.

I should also remind them that it's okay to change their behavior. They might need to take some time off or change their routine. Emphasizing that their feelings are valid and that they can take it one day at a time could help.

I should keep the tone supportive and positive, making sure they feel understood and emp

## Prepare Dataset

A medical dataset [https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/) will be used to train the selected model.

In [8]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.
Please answer the following emotional question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [9]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["input"]
    cots = examples["reasoning_content"]
    outputs = examples["content"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

In [10]:
from datasets import load_dataset
dataset = load_dataset("Kedreamix/psychology-10k-Deepseek-R1-zh", split = "train", trust_remote_code=True)
print(dataset.column_names)

README.md:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

distill_psychology-10k-r1.json:   0%|          | 0.00/45.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8775 [00:00<?, ? examples/s]

['input', 'content', 'reasoning_content']


For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column. We need to transform the dataset into proper structure.

In [11]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]

Map:   0%|          | 0/8775 [00:00<?, ? examples/s]

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.\nPlease answer the following emotional question.\n\n### Question:\n我晚上难以入睡，我认为这是因为我对工作感到压力\n\n### Response:\n<think>\n嗯，用户说他晚上难以入睡，认为是因为工作压力。首先，我需要确认他的情况是否常见，以及可能的解决方法。工作压力导致的失眠确实很普遍，但每个人的具体情况可能不同。我需要考虑他的工作环境、压力源是什么，比如工作量、人际关系还是职业发展。然后，可能涉及到他的睡前习惯，是否有使用电子设备、咖啡因摄入等影响睡眠的因素。此外，心理健康方面，比如焦虑或抑郁情绪也可能加剧失眠。我需要建议他调整作息，比如建立规律的睡眠时间，避免咖啡因和蓝光。放松技巧如冥想、深呼吸可能会有帮助。如果自我调节无效，可能需要建议他寻求专业帮助，比如心理咨询师或医生。同时，时间管理技巧可能减轻工作压力，比如任务优先级划分，适当授权任务。还要注意他的支持系统，比如家人朋友的支持。需要提醒他如果症状持续，可能有更严重的健康问题，应该及时就医。最后，要确保建议具体可行，并且语气要 empathetic，让他感受到被理解和支持。\n</think>\n你的情况是很多职场人都会遇到的困扰

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [12]:
FastLanguageModel.for_training(model)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# 超参数配置:

## 学习率（Learning Rate）：通过 TrainingArguments 中的 learning_rate 参数设置的，这里的值为 2e-4（即 0.0002）。

## 批量大小（Batch Size）：由两个参数共同决定（实际的批量大小：per_device_train_batch_size * gradient_accumulation_steps，也就是 2 * 4 = 8）：
* per_device_train_batch_size：每个设备（如 GPU）上的批量大小。
* gradient_accumulation_steps：梯度累积步数，用于模拟更大的批量大小。


## 训练轮数（Epochs）：通过 max_steps(最大训练步数) 和数据集大小计算得出，
## 在这段代码中，最大训练 5000 步，每一步训练 8 个，数据集大小为 10K，那训练论数就是 5000 * 8 / 10K = 4


In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 5000,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
    ),
)

Tokenizing to ["text"] (num_proc=2):   0%|          | 0/8775 [00:00<?, ? examples/s]

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,775 | Num Epochs = 5 | Total steps = 5,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 36,929,536/1,224,783,360 (3.02% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.9662
2,2.9186
3,2.9064
4,2.9679
5,2.9676
6,2.8912
7,2.8688
8,2.7099
9,2.7022
10,2.5722


## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [15]:
print(question)

我最近感到非常焦虑，但不知道原因是什么


In [16]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
嗯，用户说他最近感到焦虑，但不知道原因是什么。首先，我需要确认他的情况是否紧急，有没有自伤或自杀的风险。不过根据他的描述，可能只是暂时的情绪困扰。接下来，我应该考虑可能的原因，比如压力累积、生活变化、健康问题，或者潜意识里的担忧。

用户可能没有意识到具体的原因，所以需要引导他自我反思。比如询问最近的生活变化，是否有工作、学习或人际关系上的变动。另外，身体因素也很重要，比如睡眠不足、饮食不均衡、缺乏运动，这些都可能影响情绪。

还要考虑潜在的心理因素，比如过去的创伤、未处理的情绪，或者长期的压力积累。用户可能需要帮助识别这些潜在因素，或者寻找缓解焦虑的方法，比如运动、冥想、与人交流等。

我需要确保回应用户时，表现出同理心，避免评判，提供实际的建议，同时鼓励他寻求专业帮助，如果情况严重的话。要避免使用过于专业的术语，保持语言亲切易懂，让用户感到被理解和支持。
</think>
听到你最近感到焦虑，这一定让你有些困扰吧。焦虑有时会像一团迷雾，明明存在却找不到源头，但请相信，这种不确定性本身就已经在提醒我们：需要停下来好好照顾自己了。

或许我们可以一起试着梳理一下：
1. **身体信号**：最近睡眠质量如何？是否经常熬夜？饮食有没有变化？这些生理因素常常会悄悄影响情绪。
2. **生活变化**：近期是否有看似微小的变化（比如搬家、换工作）却潜移默化地影响了你？有时候看似无关的事件，会累积成情绪压力。
3. **隐形压力源**：是否在担心自己无法控制的事情？比如工作中的某些不确定性、人际关系中的微妙摩擦，这些都可能成为焦虑的来源。
4. **思维陷阱**：是否经常出现"如果...怎么办"的灾难化想象？这种模糊的担忧容易放大焦虑感，反而让情绪更沉重。

**你可以尝试的小练习**：
- 📝 **给焦虑贴标签**：把困扰你的事情写下来，不用评判，只是观察它们。有时候我们会发现，焦虑背后其实藏着更深层的担忧。
- 🧘 **5分钟着陆练习**：当焦虑袭来时，立刻停下来，做三次深呼吸，想象把焦虑想象成一个物体（比如气球/石头），观察它的形状和颜色，不评判它的大小。
- 🌱 **绘制情绪地图**：连续三天记录每次焦虑出现的时间、持续时间、伴随的身体感受（如心跳加快/肩颈紧绷），寻找规律。

如果这种状态持续两周以上，或者开始影响日常生活（比如无法集中注意力、回避社

## Upload Model to HuggingFace

Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [17]:
# https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
!apt update && apt install -y cmake
!git clone https://github.com/ggml-org/llama.cpp
!cd llama.cpp
!cmake -B build
!cmake --build build --config Release
!cp /workspace/llama.cpp/build/bin/llama-quantize ./

Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1351 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]                [0m[33m
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]      
Get:5 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]3m[33m[33m
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]m[33m[33m
Get:7 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2682 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]33m
Get:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [34.0 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]    [0m[33m[33m
Get:11 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [3755 kB]
Ge

In [18]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [19]:
from huggingface_hub import create_repo
create_repo("jong-un/Qwen2.5-1.5B-Instruct-think", token=hf_token, exist_ok=True)

RepoUrl('https://huggingface.co/jong-un/Qwen2.5-1.5B-Instruct-think', endpoint='https://huggingface.co', repo_type='model', repo_id='jong-un/Qwen2.5-1.5B-Instruct-think')

In [21]:
model.push_to_hub_gguf("jong-un/Qwen2.5-1.5B-Instruct-think", tokenizer, token = hf_token)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 172.11 out of 251.52 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 184.16it/s]

Unsloth: Saving tokenizer...




 Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at jong-un/Qwen2.5-1.5B-Instruct-think into q8_0 GGUF format.
The output location will be /workspace/jong-un/Qwen2.5-1.5B-Instruct-think/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Qwen2.5-1.5B-Instruct-think
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.bfloat16 --> Q8_0, shape = {1536, 151936}
INFO:hf-to-gguf:token_embd.weight,         torch.bfloa

unsloth.Q8_0.gguf:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/jong-un/Qwen2.5-1.5B-Instruct-think


<a name="Ollama"></a>
### Ollama Support

[Unsloth](https://github.com/unslothai/unsloth) now allows you to automatically finetune and create a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md), and export to [Ollama](https://ollama.com/)! This makes finetuning much easier and provides a seamless workflow from `Unsloth` to `Ollama`!

Let's first install `Ollama`!

In [22]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%############################                80.8%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


We use `subprocess` to start `Ollama` up in a non blocking fashion! In your own desktop, you can simply open up a new `terminal` and type `ollama serve`, but in Colab, we have to use this hack!

In [23]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

2025/03/06 08:54:21 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy:

[GIN] 2025/03/06 - 08:54:34 | 200 |       74.54µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/06 - 08:54:34 | 404 |     455.901µs |       127.0.0.1 | POST     "/api/show"


time=2025-03-06T08:54:34.801Z level=INFO source=download.go:176 msg="downloading 6e64abfcd15d in 16 118 MB part(s)"


[GIN] 2025/03/06 - 08:54:41 | 200 |  6.974462543s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2025/03/06 - 08:54:41 | 200 |   20.957836ms |       127.0.0.1 | POST     "/api/show"


time=2025-03-06T08:54:41.610Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-06T08:54:41.610Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-06T08:54:41.610Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6e64abfcd15d9902dceed9b0ee2a0d47ea9351db40457d89d66b3e840ed08f9d gpu=GPU-86ee83bd-04e0-da56-c605-8629de702472 parallel=4 available=22084321280 required="2.5 GiB"
time=2025-03-06T08:54:41.757Z level=INFO source=server.go:97 msg="system memory" total="251.5 GiB" free="195.2 GiB" free_swap="0 B"
time=2025-03-06T08:54:41.757Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-06T08:54:41.757Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-06T08:54:41.757Z level=INFO so

[GIN] 2025/03/06 - 08:54:42 | 200 |  1.380658929s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/06 - 08:54:55 | 200 |  1.927977164s |       127.0.0.1 | POST     "/api/chat"


llama_model_loader: loaded meta data with 28 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-6e64abfcd15d9902dceed9b0ee2a0d47ea9351db40457d89d66b3e840ed08f9d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek R1 Distill Qwen 1.5b Unsloth...
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune str              = unsloth-bnb-4bit
llama_model_loader: - kv   5:                           general.basename str              = deepseek-r1-distill-qwen
llama_model_loader: - kv   6:          

[GIN] 2025/03/06 - 08:55:50 | 200 |  2.520322272s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 08:56:20 | 200 |  1.485674521s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 08:56:38 | 200 |  903.084669ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 08:56:57 | 200 |  835.983628ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 08:57:13 | 200 |  1.021369411s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 08:57:28 | 200 |  1.440906036s |       127.0.0.1 | POST     "/api/chat"


### Ollama run HuggingFace model

```bash
#ollama run hf.co/jong-un/Qwen2.5-1.5B-Instruct-think
ollama run hf.co/{username}/{repository}:{quantization}
```

### Ollama inference

```bash
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/jong-un/Qwen2.5-1.5B-Instruct-think",
    "messages": [
      { "role": "user", "content": "我最近感到非常焦虑，但不知道原因是什么" }
    ]
  }'

```

# evaluation and benchmarks

In [None]:
!git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
!cd lm-evaluation-harness
!pip install -e .

In [24]:
!pip install lm_eval[wandb]

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [28]:
# bash
lm_eval \
    --model hf \
    --model_args pretrained=/workspace/jong-un/Qwen2.5-1.5B-Instruct-think,trust_remote_code=True \
    --tasks winogrande,mmlu,gsm8k,triviaqa,truthfulqa,hellaswag,openbookqa,arc_easy,sst2,boolq \
    --device cuda:0 \
    --batch_size 8 \
    --output_path output/qwen-think \
    --limit 10 \
    --wandb_args project=lm-eval-harness-integration \
    --log_samples

SyntaxError: invalid decimal literal (2576882608.py, line 3)

[GIN] 2025/03/06 - 09:08:41 | 200 |        22.5µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/06 - 09:08:42 | 200 |   18.936594ms |       127.0.0.1 | POST     "/api/show"


time=2025-03-06T09:08:42.197Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-06T09:08:42.197Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-06T09:08:42.197Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6e64abfcd15d9902dceed9b0ee2a0d47ea9351db40457d89d66b3e840ed08f9d gpu=GPU-86ee83bd-04e0-da56-c605-8629de702472 parallel=4 available=22084321280 required="2.5 GiB"
time=2025-03-06T09:08:42.325Z level=INFO source=server.go:97 msg="system memory" total="251.5 GiB" free="208.6 GiB" free_swap="0 B"
time=2025-03-06T09:08:42.325Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-06T09:08:42.325Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-06T09:08:42.326Z level=INFO so

[GIN] 2025/03/06 - 09:08:43 | 200 |  1.068905888s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/06 - 09:08:47 | 200 |  827.403094ms |       127.0.0.1 | POST     "/api/chat"


llama_model_loader: loaded meta data with 28 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-6e64abfcd15d9902dceed9b0ee2a0d47ea9351db40457d89d66b3e840ed08f9d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek R1 Distill Qwen 1.5b Unsloth...
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune str              = unsloth-bnb-4bit
llama_model_loader: - kv   5:                           general.basename str              = deepseek-r1-distill-qwen
llama_model_loader: - kv   6:          

[GIN] 2025/03/06 - 09:09:02 | 200 |  1.366843741s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:09:26 | 200 |  742.218063ms |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:09:46 | 200 |  1.860550556s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:10:20 | 200 |  1.733860604s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:10:43 | 200 |   1.57268082s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:11:04 | 200 |  1.568904683s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:11:43 | 200 |  1.571097626s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/06 - 09:12:18 | 200 |  1.917658693s |       127.0.0.1 | POST     "/api/chat"
