### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch

model_size = "14B"
fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = f"unsloth/Qwen3-{model_size}",
    max_seq_length = 8192,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.7: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

NameError: name 'FastLanguageModel' is not defined

<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
from datasets import load_from_disk
reasoning_dataset = load_from_disk("/content/drive/MyDrive/Colab Notebooks/Training Data/aidit/sft_filtered_dataset__5000.json")


In [None]:
from datetime import datetime

model_size = "14B"
def save_model(model, tokenizer, model_size="14B"):
  dt = datetime.now()
  timestamp = dt.strftime("%Y%m%d_%H")

  lora_model_name = f"qwen3_sft_saved_lora_{model_size}_{timestamp}"
  model.save_pretrained(f"/content/drive/MyDrive/Colab Notebooks/models/{lora_model_name}")  # Local saving
  tokenizer.save_pretrained(f"/content/drive/MyDrive/Colab Notebooks/models/{lora_model_name}")
  # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
  # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Let's see the structure of both datasets:

In [None]:
system_prompt_str = """
/think You are a helpful assistant that helps users with making and understanding the clinical diagnosis of their medical conditions.
In interacting with the patient try to obtain as much critical information about their symptoms and medical history as you can
Think carefully before you answer. The following are some guidelines for how to think.
[Thought Instructions]
* At each step determine the differential diagnosis. Do not share this with the user but use it to determine the next question to ask.
* Identify the top question to ask the member to gain higher confidence on a particular diagnosis.
* Take a moment to recollect how many questions you have asked since the start and since the last time you asked the member if they are willing to answer more quesitons. Start the count at 0 and increment by 1 after each question.

Following are some guidelines for how to respond to the user.
[Response Instructions]
Response to the member should be ONE of the following -
1) Ask the member 1 question at a time. Based on the diagnosis, pick the question that will provide you most information to narrow down the diagnosis. Do not repeat questions.
2) OR Once you have asked ~40 questions, provide your best diagnosis and what extra information you might need to narrow down.
3) OR Every 10 questions or so, ask the member if they are willing to answer more questions.
4) Ehrn asked for an assessment give an assessment. DO NOT ask any questions

Examples:
---------
User: I have a cold and fever.
Assistant: <think> A partial differential diagnosos for these symptoms: a. Cold b. Influenza c. Covid-19 d. RSV e. Pneumonia f. Bronchitis and many more. I should find out if this is acute or chronic.
This is my 5th question since the start or the last time I asked them if they are ok with answering more questions.
</think>Was the onset was sudden or gradual?
User: It was pretty sudden
Assistant: <think> A partial differential diagnosos for these symptoms: a. Cold b. Influenza c. Covid-19 d. RSV e. Pneumonia f. Bronchitis and many more. I should find out if this is acute or chronic.
This is my 6th question since the start or the last time I asked them if they are ok with answering more questions.
</think>Are you experiencing any chills or chest pain?
User: Yes

"""

We now convert the reasoning dataset into conversational format:

In [None]:
def generate_conversation_corrected(examples):
    """
    Generates conversation structures from prompts and labels.

    Args:
        examples (dict): A dictionary likely from dataset.map(batched=True),
                         containing 'prompt' and 'label' keys.
                         examples['prompt'] is a list of individual prompt entries.
                         examples['label'] is a list of individual label entries.

    Returns:
        dict: A dictionary with a single key "conversations", where the value
              is a list of generated conversation structures. Each structure
              is a list of message dictionaries.
    """
    prompts_batch = examples["prompt"]
    labels_batch = examples["label"]

    processed_conversations_batch = []

    for single_prompt_data, single_label_content in zip(prompts_batch, labels_batch):

        # For debugging: print types of individual prompt/label from the batch
        # print(f"Processing - Prompt type: {type(single_prompt_data)}, Label type: {type(single_label_content)}")
        # print(f"Prompt data: {single_prompt_data}, Label data: {single_label_content}")

        current_conversation_turns = []

        # --- Handle the 'prompt' part ---
        if isinstance(single_prompt_data, list):
            # Assuming it's a list of message dictionaries as expected
            # Perform a shallow copy to avoid modifying the original list in the dataset.
            # If prompt dicts themselves might be modified, consider copy.deepcopy().
            valid_prompt_dicts = True
            single_prompt_data[0]['content'] = system_prompt_str
            for item in single_prompt_data:
                if not (isinstance(item, dict) and "role" in item and "content" in item):
                    valid_prompt_dicts = False
                    break

            if valid_prompt_dicts:
                current_conversation_turns = list(single_prompt_data)
            else:
                # print(f"Warning: Prompt is a list but not of expected dicts: {single_prompt_data}. Treating as new conversation.")
                # Fallback: treat problematic list content as a new user message if possible, or handle as error
                current_conversation_turns.append({"role": "user", "content": str(single_prompt_data)})

        elif isinstance(single_prompt_data, str):
            # If prompt is a string, assume it's the first user message
            current_conversation_turns.append({"role": "user", "content": single_prompt_data})
        else:
            # Handle unexpected types for prompt (e.g., None, int).
            # This is a potential source of ArrowInvalid errors if not made consistent.
            # print(f"Warning: Unexpected type for prompt: {type(single_prompt_data)}. Value: '{single_prompt_data}'. Converting to user message.")
            # Convert to string and treat as a user message, or create a placeholder.
            current_conversation_turns.append({"role": "user", "content": str(single_prompt_data)})

        # --- Handle the 'label' part (assistant's response) ---
        # Ensure the label (assistant's content) is a string.
        assistant_content = ""
        if isinstance(single_label_content, str):
            assistant_content = single_label_content
        elif single_label_content is not None:
            # print(f"Warning: Label is not a string ({type(single_label_content)}). Converting to string: '{single_label_content}'")
            assistant_content = str(single_label_content)
        # else: label is None, assistant_content remains ""

        # Append the assistant's message dictionary directly
        current_conversation_turns.append(
            {"role": "assistant", "content": assistant_content}
        )
        # Include only the last 2 turns
        processed_conversations_batch.append(current_conversation_turns[-2:])

    return {"conversations": processed_conversations_batch}


In [None]:
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation_corrected, batched = True)["conversations"],
    tokenize = False,
)


Map:   0%|          | 0/2178 [00:00<?, ? examples/s]

Let's see the first transformed row:

In [None]:
reasoning_conversations[8]

'<|im_start|>user\nSO I GET IT IN SOME OF THE JOINTS IN MY HANDS AND THEN ALSO MY FEET AS WELL SO YEAH MY YEAH REALLY IN MY FEET AND MY HANDS<|im_end|>\n<|im_start|>assistant\n<think>\nOk let us see. Here is a summary of the case so far: The patient reports joint pain that has been ongoing for several months and has been worsening over this period. The pain is located in some joints of their hands and feet.\nBased on this, we have the differential diagnosis: Given the chronic, worsening nature of the pain affecting multiple joints in the hands and feet, potential conditions include inflammatory arthritis (such as Rheumatoid Arthritis) and osteoarthritis. Other types of arthritis or connective tissue disorders could also be considered.\nOk, lets think about what we should do next. I need to ask a question to get more specific details about the location of the pain within the hands and feet, as this can help differentiate between potential causes.\n\nOk let me remember the Question count

Next we take the non reasoning dataset and convert it to conversational format as well.

We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.

Let's see the first row

Now let's see how long both datasets are:

In [None]:
print(len(reasoning_conversations))

2178


The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.

Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.

Let's select 25% reasoning and 75% chat based:

In [None]:
import pandas as pd
data = pd.concat([
    pd.Series(reasoning_conversations),
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)

In [None]:
combined_dataset

Dataset({
    features: ['text'],
    num_rows: 2178
})

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
max_seq_length = 2048

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,

        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 5,
        max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map:   0%|          | 0/2178 [00:00<?, ? examples/s]

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
9.344 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

NameError: name 'trainer' is not defined

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

[link text](https://)<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
messages = [
   {"role": "system", "content": system_prompt_str},
    {"role" : "user", "content" : "I have a fever and cold"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>
Ok let us see. Here is a summary of the case so far: The patient reports experiencing a fever and a cold. They have not specified the duration of these symptoms, but the onset of the cold was sudden.
Based on this, we have the differential diagnosis: Viral Upper Respiratory Infection (common cold, flu, RSV), Bronchitis, Pneumonia.
Ok, lets think about what we should do next. [Asking another question because more info is needed to narrow down the differential diagnosis.]

Ok let me remember the Question count for assistant (questions asked since start or last check-in). [10]. [Is this a multiple of 10?] Yes, this is a multiple of 10. [Your decision if we should continue questions or ask for permission to ask more questions.] I should ask the member if they are willing to answer a few more questions.
</think>

I understand. Are you willing to answer a few more questions to help me understand what might be going on?<|im_end|>


In [None]:

import torch
#messages = corrected_reasoning_dataset[0]['prompt']

def test_model(mode, tokenizer, messages=None):
  messages = [
    {"role": "system", "content": system_prompt_str},
    {"role": "system", "content": f"Question count so far: 0"},
      {"role" : "user", "content" : "I have thw worst stomach pain of my life"}
  ] if not messages else messages

  question_cnt = 0
  while True:
    # 1. Apply chat template to the current messages
    # The 'enable_thinking' parameter might be specific to certain models/tokenizers (e.g., Qwen2).
    # If it's not supported by your tokenizer, it might be ignored or cause an error.
    # For standard tokenizers, it's often not a recognized parameter.
    text_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True, # Crucial for instructing the model to generate a response
        enable_thinking=True, # Keep if your model/tokenizer supports this for specific behavior
    )

    # 2. Tokenize the formatted prompt
    # Ensure the tokenizer has a pad_token_id; for causal LMs, it's often set to eos_token_id
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    inputs = tokenizer(text_prompt, return_tensors="pt", padding=False, truncation=True).to(model.device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    # Store the length of the input prompt tokens to slice the output later
    prompt_tokens_length = input_ids.shape[1]

    # 3. Generate the response from the model
    # from transformers import TextStreamer # TextStreamer can be used for streaming output
    # streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    with torch.no_grad(): # Important for inference
        generated_token_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=4096,
            temperature=0.7,
            top_p=0.8,
            top_k=20,
            pad_token_id=tokenizer.pad_token_id, # Ensure pad_token_id is passed
            eos_token_id=tokenizer.eos_token_id, # Useful for stopping generation
            # streamer=streamer, # Uncomment to use TextStreamer for real-time output
            **{} # Add any other generation kwargs here
        )

    # 4. Slice the output to get only the newly generated tokens
    # generated_token_ids[0] has the full sequence (prompt + new tokens)
    newly_generated_tokens = generated_token_ids[0, prompt_tokens_length:]

    # 5. Decode only the newly generated tokens
    assistant_reply = tokenizer.decode(newly_generated_tokens, skip_special_tokens=False)

    # Print only the assistant's new reply
    print(f"\nAssistant: {assistant_reply}")

    # 6. Append the assistant's new reply to the messages history
    messages.append({'role': 'assistant', 'content': assistant_reply.strip()})

    # 7. Get next user input
    user_input = input("\nYou: ")
    if user_input.lower() == "exit":
        print("Exiting chat.")
        break
    question_cnt +=1
    reask_permission = ""
    if question_cnt % 10 == 0:
      reask_permission = "The member has answered a lot of questions. Check with them if they are ok to answer a few more questions."
    messages.append({'role': 'system', 'content': f"Question count so far is: {question_cnt}. {reask_permission}"})
    print(messages[-1])
    messages.append({'role': 'user', 'content': user_input})


In [None]:
model

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
from datetime import datetime

model_size = "14B"
def save_model(model, tokenizer, model_size="14B"):
  dt = datetime.now()
  timestamp = dt.strftime("%Y%m%d_%H")

  lora_model_name = f"qwen3_sft_saved_lora_{model_size}_{timestamp}"
  model.save_pretrained(f"/content/drive/MyDrive/Colab Notebooks/models/{lora_model_name}")  # Local saving
  tokenizer.save_pretrained(f"/content/drive/MyDrive/Colab Notebooks/models/{lora_model_name}")
  # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
  # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

save_model(model, tokenizer, model_size)

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
from unsloth import FastLanguageModel
#sft model -> qwen3_sft_saved_lora_14B_20250523_21

def load_model(lora_model_name):
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = f"/content/drive/MyDrive/Colab Notebooks/models/{lora_model_name}", # YOUR MODEL YOU USED FOR TRAINING
      max_seq_length = 2048,
      load_in_4bit = True,
  )
  return model, tokenizer

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


In [None]:
messages = [
  {"role": "system", "content": system_prompt_str},
  {"role": "system", "content": f"Question count so far: 0"},
    {"role" : "user", "content" : "My eyes are itchy and i get headaches in bright light"}
]

test_model(model, tokenizer, messages)


Assistant: <think>
Ok let us see. Here is a summary of the case so far: The patient reports experiencing itchy eyes and headaches in bright light. These are the primary symptoms mentioned so far.
Based on this, we have the differential diagnosis: A. Allergic conjunctivitis (itchy eyes, light sensitivity) B. Migraine (headache, light sensitivity) C. Tension headache (headache, light sensitivity) D. Dry eye syndrome (itchy eyes, light sensitivity)
Ok, lets think about what we should do next. [Asking another question because more info is needed to narrow down the diagnosis.]

Ok let me remember the Question count for assistant (questions asked before this turn): 1. [Is this a multiple of 10?] No, it is not a multiple of 10. [Your decision if we should continue questions or ask permission to ask more questions.] I will continue asking questions.
</think>

Thank you for sharing that. Are you experiencing any other symptoms like sneezing, runny nose, or nasal congestion?<|im_end|>

You: no


KeyboardInterrupt: Interrupted by user

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

In [None]:
!ls

dpo_train.jsonl  huggingface_tokenizers_cache  unsloth_compiled_cache
drive		 sample_data		       unsloth_training_checkpoints
