# Fine-tuning DeepSeek R1 Distilled Qwen2.5 7B

In this notebook, it will demonstrate how to finetune `DeepSeek-R1-Distill-Qwen2.5 7B` with Unsloth, using a medical dataset.

## Why do we need LLM fine-tuning?

Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.

In [1]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo
!pip install -U huggingface_hub
!pip install wandb 

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Found existing installation: unsloth 2025.3.9
Uninstalling unsloth-2025.3.9:
  Successfully uninstalled unsloth-2025.3.9
[0mCollecting git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-req-build-m_5o53zx
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-m_5o53zx
  Resolved https://github.com/unslothai/unsloth.git to commit 2b5d81d75281c02480927cf3ca0dea7c8e98d484
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: unsloth
  Building wheel for uns

In [2]:
from huggingface_hub import login
hf_token = "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"
login(hf_token)

In [3]:
import wandb

wb_token = "695dbbe83ed95db416651f66e8f5d5488f9146b7"
wandb.login(key=wb_token)
run = wandb.init(
    project='fine-tune-DeepSeek-R1-Distill-Qwen-7B on emo Dataset',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjongs-un[0m ([33mjongs-un-Personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin



## Choose a Base Model

1. Choose a model that aligns with your usecase
2. Assess your storage, compute capacity and dataset
3. Select a Model and Parameters
4. Choose Between Base and Instruct Models

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-7B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
    trust_remote_code=True
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Standard import failed for UnslothDPOTrainer: No module named 'UnslothDPOTrainer'. Using tempfile instead!
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.209 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
# del model


## Inference before fine-tuning

In [6]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional support expert, skilled in active listening, empathy, and providing warm yet professional emotional support. 
Your responses incorporate psychological knowledge and real-life examples to help users understand their emotions, offering comfort, encouragement, or practical advice.
Please answer the following emotion question.

### Question:
{}

### Response:
<think>{}"""

In [7]:
question = "我最近感到非常焦虑，但不知道原因是什么"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
好的，我现在要帮用户分析最近的焦虑情绪。首先，我得理解用户的感受，焦虑通常是让人感到不安和紧张，可能影响到日常生活。然后，我应该考虑可能的原因，比如压力、生活变化、健康问题或者人际关系。接下来，我可以建议用户记录情绪，这样更容易找到触发因素。同时，提供一些放松技巧，比如深呼吸或者运动，帮助缓解焦虑。最后，提醒用户寻求专业帮助，如果情况严重的话。整个过程要保持温暖和支持的语气，让用户感到被理解和支持。
</think>

焦虑情绪是人正常的反应，但当它变得过于强烈或影响到日常生活时，确实需要关注。以下是一些可能的原因以及应对建议：

**可能的原因：**
1. **生活压力**：工作、学业或其他重要事务的突然变化可能导致焦虑。
2. **人际关系**：与朋友、家人或同事的关系紧张，可能导致情绪波动。
3. **健康问题**：焦虑常常与身体状况有关，如睡眠困难、饮食问题或慢性疾病。
4. **自我评价**：对自己能力的过高或过低估计，可能导致焦虑。
5. **环境变化**：突然的环境变化，如搬家、换工作等，可能引发焦虑。

**应对建议：**
1. **识别触发因素**：尝试记录每天发生焦虑的事件，找出是否有共同点。
2. **放松技巧**：深呼吸、冥想、瑜伽等方法可以帮助缓解焦虑。
3. **与人交流**：与信任的朋友或家人倾诉，可能带来情感支持。
4. **设定小目标**：将大目标分解为小步骤，逐步完成，减少压力。
5. **关注健康**：确保睡眠充足，保持良好的饮食习惯，适当锻炼，帮助缓解焦虑。

如果你的焦虑持续时间较长，影响到日常生活，建议咨询专业的心理医生或治疗师。他们可以帮助你找到问题的根源，并提供更专业的治疗方案。记住，焦虑并不可怕，重要的是学会管理它。<｜end▁of▁sentence｜>


## Prepare Dataset

A medical dataset [https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/) will be used to train the selected model.

In [8]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.
Please answer the following emotional question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [9]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["input"]
    cots = examples["reasoning_content"]
    outputs = examples["content"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

In [10]:
from datasets import load_dataset
dataset = load_dataset("Kedreamix/psychology-10k-Deepseek-R1-zh", split = "train", trust_remote_code=True)
print(dataset.column_names)

# from datasets import load_dataset, concatenate_datasets

# # 加载多个数据集
# dataset1 = load_dataset("Kedreamix/psychology-10k-Deepseek-R1-zh", split="train", trust_remote_code=True)
# dataset2 = load_dataset("Congliu/Chinese-DeepSeek-R1-Distill-data-110k", split="train", trust_remote_code=True)

# # 检查每个数据集的列名
# print("Dataset 1 columns:", dataset1.column_names)
# print("Dataset 2 columns:", dataset2.column_names)

# # 找出公共列
# common_columns = list(set(dataset1.column_names) & set(dataset2.column_names))
# print("Common columns:", common_columns)

# # 只前三个公共列，或者手动指定
# common_columns = common_columns[:3]  # 取前三个
# print("common columns found, using:", common_columns)

# # 对齐数据集，只保留公共列
# dataset1_aligned = dataset1.select_columns(common_columns)
# dataset2_aligned = dataset2.select_columns(common_columns)

# # 合并数据集
# combined_dataset = concatenate_datasets([dataset1_aligned, dataset2_aligned])

# # 合并数据集
# combined_dataset = concatenate_datasets([dataset1, dataset2])

# # 检查合并后的结果
# print("Combined dataset columns:", combined_dataset.column_names)
# print("Combined dataset size:", len(combined_dataset))

# 可选：保存合并后的数据集到本地
#combined_dataset.save_to_disk("./combined_dataset")

['input', 'content', 'reasoning_content']


For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column. We need to transform the dataset into proper structure.

In [11]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are an emotional expert with advanced knowledge in active listening, empathy, and providing warm yet professional emotional support.\nPlease answer the following emotional question.\n\n### Question:\n我晚上难以入睡，我认为这是因为我对工作感到压力\n\n### Response:\n<think>\n嗯，用户说他晚上难以入睡，认为是因为工作压力。首先，我需要确认他的情况是否常见，以及可能的解决方法。工作压力导致的失眠确实很普遍，但每个人的具体情况可能不同。我需要考虑他的工作环境、压力源是什么，比如工作量、人际关系还是职业发展。然后，可能涉及到他的睡前习惯，是否有使用电子设备、咖啡因摄入等影响睡眠的因素。此外，心理健康方面，比如焦虑或抑郁情绪也可能加剧失眠。我需要建议他调整作息，比如建立规律的睡眠时间，避免咖啡因和蓝光。放松技巧如冥想、深呼吸可能会有帮助。如果自我调节无效，可能需要建议他寻求专业帮助，比如心理咨询师或医生。同时，时间管理技巧可能减轻工作压力，比如任务优先级划分，适当授权任务。还要注意他的支持系统，比如家人朋友的支持。需要提醒他如果症状持续，可能有更严重的健康问题，应该及时就医。最后，要确保建议具体可行，并且语气要 empathetic，让他感受到被理解和支持。\n</think>\n你的情况是很多职场人都会遇到的困扰

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [12]:
FastLanguageModel.for_training(model)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# 超参数配置:

## 学习率（Learning Rate）：通过 TrainingArguments 中的 learning_rate 参数设置的，这里的值为 2e-4（即 0.0002）。

## 批量大小（Batch Size）：由两个参数共同决定（实际的批量大小：per_device_train_batch_size * gradient_accumulation_steps，也就是 2 * 4 = 8）：
* per_device_train_batch_size：每个设备（如 GPU）上的批量大小。
* gradient_accumulation_steps：梯度累积步数，用于模拟更大的批量大小。


## 训练轮数（Epochs）：通过 max_steps(最大训练步数) 和数据集大小计算得出，
## 在这段代码中，最大训练 5000 步，每一步训练 8 个，数据集大小为 10K，那训练论数就是 5000 * 8 / 10K = 4


In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 2000,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
    ),
)

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,775 | Num Epochs = 2 | Total steps = 2,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 80,740,352/5,423,699,456 (1.49% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5927
2,2.5509
3,2.5413
4,2.6027
5,2.6222
6,2.4934
7,2.4232
8,2.2372
9,2.201
10,2.1198


## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [15]:
print(question)

我最近感到非常焦虑，但不知道原因是什么


In [16]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
嗯，用户最近感到焦虑，但不知道原因是什么。首先，我需要理解焦虑的常见原因，可能包括工作压力、人际关系、健康问题或者生活中的变化。用户可能没有意识到这些潜在因素，所以需要引导他们自我反思。

接下来，我应该考虑如何结构化回答。通常，分点列出可能的原因和应对策略会比较清晰。比如，列出可能的原因，如工作压力、人际关系、健康、经济、季节变化等，然后针对每个原因提供具体建议。

还要注意语气，要表现出同理心，避免让用户感到被评判。可能需要使用一些温和的语言，比如“你可能正在经历一些不易察觉的变化”或者“焦虑是身体在提醒你关注某些事情”。

另外，用户可能需要实际的步骤来缓解焦虑，比如深呼吸、运动、时间管理等。这些方法需要简单易行，适合不同情况的人尝试。

还要考虑用户可能没有提到的深层需求，比如是否需要专业帮助。如果焦虑持续或影响生活，建议寻求心理咨询是重要的。但需要以鼓励的方式提出，避免让用户觉得问题被轻视。

可能还需要提醒用户避免自我诊断，强调专业帮助的重要性。同时，可以推荐一些放松技巧，比如冥想、正念练习，或者写日记来整理思绪。

最后，确保整个回答逻辑清晰，信息准确，同时保持支持和鼓励的态度。避免使用专业术语，保持口语化，让用户感到被理解和支持。
</think>
你的感受是很常见的，许多人在生活中都会经历“无名焦虑”。这种情绪可能源于生活中的潜意识变化、生理状态调整，或是未被察觉的压力源。以下是一些可能的方向和应对建议，希望能帮你找到突破口：

---

### 🌱 **可能隐藏的焦虑来源**
1. **隐性压力累积**  
   - 工作/学习中的微小变化（如截止日期、人际关系）、生活节奏被打乱、甚至季节变化（比如日照减少、空气quality下降）都可能引发潜意识的焦虑。
   - **试试**：记录一周内让你稍微紧张的事情，哪怕只是“今天开会前心慌了一下”。

2. **未被处理的情绪**  
   - 长期压抑的愤怒、悲伤，或未解决的矛盾（比如家庭矛盾、职场冲突）可能以焦虑的形式浮现。
   - **试试**：用纸笔写下让你感到压力或委屈的事情，不必急着解决，只是观察它们的存在。

3. **身体信号的警报**  
   - 焦虑有时会通过身体反应提醒你：睡眠不足、饮食不规律、缺乏运动等。
   - **试试**：先调整基础生活习惯（如早睡

## Upload Model to HuggingFace

Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [17]:
# bash
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

SyntaxError: invalid syntax (2638912410.py, line 2)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [18]:
from huggingface_hub import create_repo
create_repo("jong-un/Qwen2.5-7B-Instruct-think", token=hf_token, exist_ok=True)

RepoUrl('https://huggingface.co/jong-un/Qwen2.5-7B-Instruct-think', endpoint='https://huggingface.co', repo_type='model', repo_id='jong-un/Qwen2.5-7B-Instruct-think')

In [19]:
model.push_to_hub_gguf("jong-un/Qwen2.5-7B-Instruct-think", tokenizer, token = hf_token)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1653.99 out of 2015.49 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 75.00it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at jong-un/Qwen2.5-7B-Instruct-think into q8_0 GGUF format.
The output location will be /workspace/jong-un/Qwen2.5-7B-Instruct-think/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Qwen2.5-7B-Instruct-think
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> Q8_0

unsloth.Q8_0.gguf:   0%|          | 0.00/8.10G [00:00<?, ?B/s]

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/jong-un/Qwen2.5-7B-Instruct-think


<a name="Ollama"></a>
### Ollama Support

[Unsloth](https://github.com/unslothai/unsloth) now allows you to automatically finetune and create a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md), and export to [Ollama](https://ollama.com/)! This makes finetuning much easier and provides a seamless workflow from `Unsloth` to `Ollama`!

Let's first install `Ollama`!

In [20]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


We use `subprocess` to start `Ollama` up in a non blocking fashion! In your own desktop, you can simply open up a new `terminal` and type `ollama serve`, but in Colab, we have to use this hack!

In [21]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFV8IrpE26w6dYE4GaWBsHiGruPgUmQRHMb1aYmuVWp/



2025/03/12 07:23:37 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy:

[GIN] 2025/03/12 - 07:23:45 | 200 |      94.109µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/12 - 07:23:45 | 404 |      316.09µs |       127.0.0.1 | POST     "/api/show"


time=2025-03-12T07:23:46.934Z level=INFO source=download.go:176 msg="downloading 54b06104b852 in 16 506 MB part(s)"
time=2025-03-12T07:24:13.182Z level=INFO source=download.go:176 msg="downloading 369ca498f347 in 1 387 B part(s)"
time=2025-03-12T07:24:14.280Z level=INFO source=download.go:176 msg="downloading b31c130852cc in 1 107 B part(s)"
time=2025-03-12T07:24:15.383Z level=INFO source=download.go:176 msg="downloading 9ae14bd2c052 in 1 193 B part(s)"


[GIN] 2025/03/12 - 07:24:24 | 200 | 38.990101408s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2025/03/12 - 07:24:25 | 200 |    32.12716ms |       127.0.0.1 | POST     "/api/show"


time=2025-03-12T07:24:25.445Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-12T07:24:25.445Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-12T07:24:25.445Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-54b06104b852de8e6404dc4a00e84d23975e90a2b00d460b1e10afc627428ba8 gpu=GPU-751f3245-4432-1555-5db5-460aedca8119 parallel=4 available=73875521536 required="8.6 GiB"
time=2025-03-12T07:24:26.147Z level=INFO source=server.go:105 msg="system memory" total="2015.5 GiB" free="1962.2 GiB" free_swap="0 B"
time=2025-03-12T07:24:26.147Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-12T07:24:26.147Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-12T07:24:26.147Z level=INFO

[GIN] 2025/03/12 - 07:24:28 | 200 |  3.667117944s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/12 - 07:24:34 | 200 |  3.335225562s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 07:24:57 | 200 |  1.928830766s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 07:25:11 | 200 |   1.38031164s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 07:25:30 | 200 |  2.271597078s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 07:25:48 | 200 |  6.174187442s |       127.0.0.1 | POST     "/api/chat"


### Ollama run HuggingFace model

```bash
#ollama run hf.co/jong-un/Qwen2.5-7B-Instruct-think
ollama run hf.co/{username}/{repository}:{quantization}
```

### Ollama inference

```bash
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/jong-un/Qwen2.5-7B-Instruct-think",
    "messages": [
      { "role": "user", "content": "我最近感到非常焦虑，但不知道原因是什么" }
    ]
  }'

```

# evaluation and benchmarks

In [None]:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

In [None]:
pip install lm_eval[wandb]

In [None]:
# bash
lm_eval \
    --model hf \
    --model_args pretrained=jong-un/Qwen2.5-7B-Instruct-think,trust_remote_code=True \
    --tasks winogrande,mmlu,gsm8k,triviaqa,truthfulqa,hellaswag,openbookqa,arc_easy,sst2,boolq \
    --device cuda:0 \
    --batch_size 8 \
    --output_path output/qwen-think \
    --limit 10 \
    --wandb_args project=lm-eval-harness-integration \
    --log_samples