# 1. Setting Up
## 1.1. login to Hugging Face

In [12]:
!pip install huggingface_hub
# !pip install kaggle_secrets
!pip install -U accelerate



In [1]:
from huggingface_hub import notebook_login
notebook_login()

# login(token="<HF_TOKEN>", add_to_git_credential=True, new_session=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1.2. Login to Weight & Bias

In [3]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.19.8-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting click!=8.0.0,>=7.1 (from wandb)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Collecting pydantic<3,>=2.6 (from wandb)
  Downloading pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting sentry-sdk>=2.0.0 (from wandb)
  Downloading sentry_sdk-2.22.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading gitdb-4.0.12-py3-none-any.whl.metadata (1.2 kB)
Collecting an

In [2]:
import wandb

wandb.login(key="<WandB_TOKEN")
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ryan/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mryan-wibawa[0m ([33mryan-wibawa-california-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


# 2. Loading the model and tokenizer

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 
dtype = None 
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "HF_TOKEN", 
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4070. Max memory: 11.994 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.3.0. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 2.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

# 3. Model inference before fine-tuning
To create a prompt style for the model, we will define a system prompt and include placeholders for the question and response generation. The prompt will guide the model to think step-by-step and provide a logical, accurate response.

In [4]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

In this example, we will provide a medical question to the prompt_style, convert it into tokens, and then pass the tokens to the model for response generation. 

In [5]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model) 
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out what the cystometry would show for this 61-year-old woman. She's had a long history of involuntary urine loss when she coughs or sneezes, but she doesn't leak at night. She had a gynecological exam and a Q-tip test. I need to think through this step by step.

First, let's break down the information. She's experiencing urine loss during activities that involve increased abdominal pressure, like coughing or sneezing. That makes me think of stress urinary incontinence. Stress incontinence usually happens when the urethral closure mechanism doesn't work properly, especially when there's increased pressure, like from coughing.

Now, she's had a gynecological exam. I'm not entirely sure what specific findings they might have noted, but the Q-tip test is a common diagnostic tool for incontinence. The Q-tip test involves inserting a catheter with a balloon at the end and then slowly withdrawing it to check for urethral obstruction. If the balloon does

Even without fine-tuning, our model successfully generated a chain of thought and provided reasoning before delivering the final answer. The reasoning process is encapsulated within the <think></think> tags.

So, why do we still need fine-tuning? The reasoning process, while detailed, was long-winded and not concise. Additionally, the final answer was presented in a bullet-point format, which deviates from the structure and style of the dataset that we want to fine-tune on. 


# 4. Loading and processing the dataset
We will slightly change the prompt style for processing the dataset by adding the third placeholder for the complex chain of thought column. 

In [6]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

Write the Python function that will create a "text" column in the dataset, which consists of the train prompt style. Fill the placeholders with questions, chains of text, and answers. 

In [7]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

We will load the first 500 samples from the [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT?row=46) dataset, which is available on the Hugging Face hub. After that, we will map the `text` column using the `formatting_prompts_func` function. 

In [8]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

"Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her ab

As we can see, the text column has a system prompt, instructions, chain of thought, and the answer. 

# 5. Setting up the model
Using the target modules, we will set up the model by adding the low-rank adopter to the model. 

In [9]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Next, we will set up the training arguments and the trainer by providing the model, tokenizers, dataset, and other important training parameters that will optimize our fine-tuning process.

In [10]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        # auto_find_batch_size=True,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        # per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

# 6. Model training
Run the following command to start training. 

In [11]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.9481
20,1.4964
30,1.4217
40,1.3798
50,1.3828
60,1.3638


In [13]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

GPU = NVIDIA GeForce RTX 4070. Max memory = 11.994 GB.
7.826 GB of memory reserved.
242.7146 seconds used for training.
4.05 minutes used for training.
Peak reserved memory = 7.826 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 65.249 %.
Peak reserved memory for training % of max memory = 0.0 %.


You can view the fill model evaluation report on the Weights and biases dash board by logging into the website and viewing the [project](https://wandb.ai/ryan-wibawa-california-state-university/Fine-tune-DeepSeek-R1-Distill-Llama-8B%20on%20Medical%20COT%20Dataset/runs/1yu957uq?nw=nwuserryanwibawa).

## 7. Model inference after fine-tuning
To compare the results, we will ask the fine-tuned model the same question as before to see what has changed.

In [14]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, let's see. This woman is 61, and she's been dealing with involuntary urine loss, especially when she coughs or sneezes. That sounds like a classic case of stress incontinence. I'm pretty sure we're dealing with some kind of urethral issue here. 

Now, she's undergone a gynecological exam, and the Q-tip test was done. I'm trying to recall what those tests usually show. The Q-tip test involves inserting a Q-tip catheter into the urethra and then removing it. If the tip is wet, it suggests that the urethral opening is at a lower position, which might be causing some issues.

But wait, she doesn't leak at night. That's interesting. Nighttime incontinence usually points toward something like a neurological issue or a problem with the bladder capacity. However, since she's leaking involuntarily during activities like sneezing or coughing, it's more likely a urethral issue rather than a bladder capacity problem.

So, thinking about stress incontinence, we usually look at how th

This is much better and more accurate. The chain of thought was direct, and the answer was straightforward and in one paragraph. The fine-tuning was successful.

## 8. Saving the model locally
Now, let's save the adopter, full model, and tokenizer locally so that we can use them in other projects.

In [15]:
new_model_local = "DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 29.73 out of 49.39 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 16%|████████████▎                                                                  | 5/32 [00:00<00:02, 11.19it/s]
We will save to Disk and not RAM now.
100%|██████████████████████████████████████████████████████████████████████████████| 32/32 [04:50<00:00,  9.09s/it]


Unsloth: Saving tokenizer... Done.
Done.


## 9. Pushing the model to Hugging Face Hub
We will also push the adopter, tokenizer, and model to Hugging Face Hub so that the AI community can take advantage of this model by integrating it into their systems.

In [16]:
new_model_online = "rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT"
model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")

README.md:   0%|          | 0.00/626 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth: You are pushing to hub, but you passed your HF username = rwibawa.
We shall truncate rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT to DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 29.61 out of 49.39 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████████████████████████████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.88it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.


model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT


## 10. Converting the Model to Llama.cpp GGUF
We can’t use the safetensors files locally as most local AI chatbots don’t support them. Instead, we'll convert it into the llama.cpp GGUF file format.

In [17]:
!git clone --recursive https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 45841, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 45841 (delta 9), reused 6 (delta 6), pack-reused 45829 (from 2)[K
Receiving objects: 100% (45841/45841), 96.01 MiB | 4.39 MiB/s, done.
Resolving deltas: 100% (33169/33169), done.
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/home/ryan/Documents/workspaces/workspace_ai/Finetuning-DeepSeek-R1/llama.cpp/ggml/src/ggml-kompute/kompute'...
remote: Enumerating objects: 9122, done.        
remote: Counting objects: 100% (155/155), done.        
remote: Compressing objects: 100% (70/70), done.        
remote: Total 9122 (delta 108), reused 86 (delta 85), pack-reused 8967 (from 3)        
Receiving objects: 100% (9122/9122), 17.59 MiB | 33.80 MiB/s, done.
Resolving deltas: 100% (5726/5726), done.
Submodule path 'ggml/src/ggml-komp

In [18]:
!make clean -C llama.cpp

make: Entering directory '/home/ryan/Documents/workspaces/workspace_ai/Finetuning-DeepSeek-R1/llama.cpp'
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
make: Leaving directory '/home/ryan/Documents/workspaces/workspace_ai/Finetuning-DeepSeek-R1/llama.cpp'


In [19]:
!make all -j -C llama.cpp

make: Entering directory '/home/ryan/Documents/workspaces/workspace_ai/Finetuning-DeepSeek-R1/llama.cpp'
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
make: Leaving directory '/home/ryan/Documents/workspaces/workspace_ai/Finetuning-DeepSeek-R1/llama.cpp'


In [20]:
!pip install gguf protobuf

Collecting gguf
  Using cached gguf-0.14.0-py3-none-any.whl.metadata (3.7 kB)
Using cached gguf-0.14.0-py3-none-any.whl (76 kB)
Installing collected packages: gguf
Successfully installed gguf-0.14.0


In [21]:
!python llama.cpp/convert_hf_to_gguf.py DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT --outfile llama-3-8b-chat-doctor.gguf --outtype q8_0

INFO:hf-to-gguf:Loading model: DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.w

## 11.Pushing the model file to Hugging Face
To push the single file to the Hugging Face Hub, we'll:

### 1. Login to the Hugging Face Hub using the API key.
### 2. Create the API object.
### 3. Upload the file by providing the local path, repo path, repo id, and repo type.

In [23]:
# from huggingface_hub import login
# from kaggle_secrets import UserSecretsClient
from huggingface_hub import HfApi
# user_secrets = UserSecretsClient()
# hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
# login(token = hf_token)

api = HfApi()
api.upload_file(
    path_or_fileobj="./llama-3-8b-chat-doctor.gguf",
    path_in_repo="llama-3-8b-chat-doctor-Q8_0.gguf",
    repo_id="rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT",
    repo_type="model",
)


llama-3-8b-chat-doctor.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT/commit/07f8b323a79d55a5570f1ddee1fa47e7140ad255', commit_message='Upload llama-3-8b-chat-doctor-Q8_0.gguf with huggingface_hub', commit_description='', oid='07f8b323a79d55a5570f1ddee1fa47e7140ad255', pr_url=None, repo_url=RepoUrl('https://huggingface.co/rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT', endpoint='https://huggingface.co', repo_type='model', repo_id='rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT'), pr_revision=None, pr_num=None)

## 12. How to use from `Ollama`
```shell
$ ollama run hf.co/rwibawa/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit-Medical-COT
```