# Full Finetuning Qwen3-0.6B for Question Decomposition

This notebook demonstrates **full finetuning** (not LoRA) of the Qwen3-0.6B model using Unsloth for question decomposition tasks.

The model learns to break down complex multi-hop questions into sequences of single-hop, atomic questions.

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!

<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train) with full finetuning, how to [run the model](#Inference), & [how to save it](#Save)

## Key Features:
- **Full finetuning** - All model parameters are updated (not just LoRA adapters)
- **Question decomposition task** - Breaks complex questions into atomic sub-questions
- **Lower learning rate** required (5e-6 vs 2e-4 for LoRA)
- **More memory intensive** but potentially better quality for domain-specific tasks
- **Longer training time** as all parameters are updated
- Uses **BF16 precision** and **gradient checkpointing** for memory efficiency

## Dataset:
- Custom question decomposition dataset with 802 examples
- Each example contains a complex question and its decomposition into atomic sub-questions
- Formatted as user-assistant conversations for instruction tuning

### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Unsloth

In [1]:
from unsloth import FastLanguageModel
import torch
import os 
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

# FULL FINETUNING MODE - Using Qwen3-0.6B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-0.6B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = False,    # Must be False for full finetuning
    load_in_8bit = False,    # Must be False for full finetuning - will use BF16
    full_finetuning = True,  # ENABLED: Full finetuning instead of LoRA!
    # token = "hf_...",      # use one if using gated models
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info


INFO 11-12 14:35:56 [__init__.py:216] Automatically detected platform cuda.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.0.
   \\   /|    NVIDIA RTX A5000. Num GPUs = 5. Max memory: 23.673 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.


**FULL FINETUNING MODE** - We are NOT using LoRA adapters. All model parameters will be updated during training.

In [2]:
# FULL FINETUNING - Skip LoRA adapter setup
# For full finetuning, we train all parameters directly without LoRA adapters
# The model is already prepared for full finetuning from the previous cell

print(f"Model loaded with full finetuning enabled.")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Model loaded with full finetuning enabled.
Total parameters: 596,049,920
Trainable parameters: 596,049,920


<a name="Data"></a>
### Data Prep - Question Decomposition Dataset

We use a custom dataset of question decompositions for training. The dataset contains:
- Original multi-hop questions
- Decomposed single-hop questions with retrieval flags
- Each example is formatted as a user-assistant conversation

In [3]:
import json
from datasets import Dataset

# Load the question decomposition dataset
dataset_path = "/mnt/SSD6/yigit/gsw-memory/playground/question_decomp_local/q_decomp_training_5.json"

print(f"Loading dataset from: {dataset_path}")
with open(dataset_path, 'r', encoding='utf-8') as f:
    decomposition_results = json.load(f)

print(f"Loaded {len(decomposition_results)} question decompositions")

# Convert to list format
decomposition_list = list(decomposition_results.values())

print(f"\nExample structure:")
print(f"Keys: {decomposition_list[0].keys()}")
print(f"\nFirst example:")
print(f"Question ID: {decomposition_list[0]['question_id']}")
print(f"Original Question: {decomposition_list[0]['original_question']}")
print(f"Decomposed Questions: {decomposition_list[0]['decomposed_questions']}")

Loading dataset from: /mnt/SSD6/yigit/gsw-memory/playground/question_decomp_local/q_decomp_training_5.json
Loaded 802 question decompositions

Example structure:
Keys: dict_keys(['question_id', 'original_question', 'decomposed_questions'])

First example:
Question ID: 2hop__42543_20093
Original Question: What year did the writer of Crazy Little Thing Called Love die?
Decomposed Questions: [{'question': 'Who wrote "Crazy Little Thing Called Love"?', 'requires_retrieval': True}, {'question': 'In what year did <ENTITY_Q1> die?', 'requires_retrieval': True}]


### Convert to Chat Format

Now we'll convert each example into chat format for training

In [4]:
def create_chat_messages(example):
    """
    Convert a single example into chat format for training.
    
    Args:
        example: Dict with 'original_question' and 'decomposed_questions' keys
    
    Returns:
        Dict with 'messages' key containing the chat-formatted data
    """
    original_question = example['original_question']
    decomposed_questions = example['decomposed_questions']
    
    # Serialize the decomposed questions to JSON format (this is what the model should output)
    assistant_response = json.dumps(
        {"questions": decomposed_questions},
        indent=4,
        ensure_ascii=False
    )
    
    # Create the instruction prompt for the user (same as used in question_decomp_lora_ft.py)
    user_prompt = f"""Your task is to break down a complex multi-hop question into the most efficient sequence of single-hop, **atomic** questions.

## Your Main Goal: Build Smart Bridges, Don't Just Collect Nouns
The most critical skill is to convert complex logical clauses (like "despite," "the country where," "the year before") into a single, powerful **bridging question**. This question should use a known entity as context to find the next one. Avoid finding all the entities separately and then trying to figure out how they connect.

---
## A Simple Analogy for Efficiency

**Question:** "What is the phone number of the mother of the tallest player on the Lakers?"

** Inefficient Path:**
1.  Who are the players on the Lakers?
2.  What are all their heights?
3.  Who is the mother of the tallest player? *(This step is a logical leap)*

** Efficient Path:**
1.  Who is the tallest player on the Lakers?
2.  Who is the mother of `<ENTITY_Q1>`?
3.  What is the phone number of `<ENTITY_Q2>`?

---
## How to Decompose a Question
This process follows a logical flow from high-level analysis to the fine-tuning of your question chain.

### 1. Analyze the Query's Components
First, break down the original question into its fundamental building blocks. Identify the core **entities** (people, places, organizations), their **properties** (attributes like rank, location, date), and the **relationships** that connect them.

### 2. Construct an Atomic Chain
Next, formulate a sequence of questions where each question retrieves a single fact.
* **Isolate Comparisons:** Don't ask "who is faster?" Ask for the specific rank or time of each person involved.
* **Link with Placeholders:** Use `<ENTITY_Qn>` to pass the answer from a previous question (`Qn`) into the next one.

### 3. Optimize for Efficiency and Precision
Your final goal is the **shortest and most direct path** to the answer.
* **Embed Constraints to Build Bridges:** If a piece of information is only a filter (like a date or location), embed it as a constraint in the next question instead of asking for it directly.
**Important note for bridges:** There can be no `<ENTITY_Qn>` in the first question if the nth question DOES NOT require retrieval.

## Formatting
Format each decomposed question as follows:

<decomposition>
Question: [the question text]
Requires retrieval: [true/false]

And provide the response in the following json format:
{{
  "questions": [
    {{
      "question": "the decomposed question text",
      "requires_retrieval": "true/false"
    }}
  ]
}}

Examples:

Input: "What is the birth year of the spouse of the director of Casablanca?"
Output:
{{
    "questions": [
        {{
            "question": "Who directed Casablanca?",
            "requires_retrieval": "true"
        }},
        {{
            "question": "Who was <ENTITY_Q1>'s spouse?",
            "requires_retrieval": "true"
        }},
        {{
            "question": "What is <ENTITY_Q2>'s birth year?",
            "requires_retrieval": "true"
        }}
    ]
}}

Input: "Which film has the director who is older, Dune or The Dark Knight?"
Output:
{{
    "questions": [
        {{
            "question": "Who directed Dune?",
            "requires_retrieval": "true"
        }},
        {{
            "question": "Who directed The Dark Knight?",
            "requires_retrieval": "true"
        }},
        {{
            "question": "Who is older, <ENTITY_Q1> or <ENTITY_Q2>?",
            "requires_retrieval": "true"
        }},
        {{
            "question": "Who is older, <ENTITY_Q1> or <ENTITY_Q2>?",
            "requires_retrieval": "false"
        }}
    ]
}}


IMPORTANT:
    AVOID over-decomposition like this:
    DON'T break "Who is John Doe?" into:
    1. Who is John Doe? ‚Üí "English"
    2. When was <ENTITY_Q1> born? ‚Üí "When was English born?"

    DO ask directly: "When was John Doe born?"

Now decompose this question:
Input: "{original_question}"
Output:
"""
    
    # Create the chat messages in the format expected by chat models
    messages = [
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_response},
    ]
    
    return {"messages": messages}

print("Chat formatting function created!")

Chat formatting function created!


In [5]:
# Create HuggingFace Dataset from the decomposition list
raw_dataset = Dataset.from_list(decomposition_list)

print(f"Raw dataset info:")
print(raw_dataset)
print(f"\nColumn names: {raw_dataset.column_names}")

# Apply the preprocessing to create the final training dataset
training_dataset = raw_dataset.map(
    create_chat_messages,
    remove_columns=raw_dataset.column_names,  # Remove original columns, keep only 'messages'
    desc="Creating chat-formatted training data"
)

print(f"\nTraining dataset created!")
print(training_dataset)
print(f"\nColumn names: {training_dataset.column_names}")
print(f"\nFirst example messages (user prompt first 500 chars):")
print(training_dataset[0]['messages'][0]['content'][:500] + "...")
print(f"\nAssistant response (first 300 chars):")
print(training_dataset[0]['messages'][1]['content'][:300] + "...")

Raw dataset info:
Dataset({
    features: ['question_id', 'original_question', 'decomposed_questions'],
    num_rows: 802
})

Column names: ['question_id', 'original_question', 'decomposed_questions']


Creating chat-formatted training data:   0%|          | 0/802 [00:00<?, ? examples/s]


Training dataset created!
Dataset({
    features: ['messages'],
    num_rows: 802
})

Column names: ['messages']

First example messages (user prompt first 500 chars):
Your task is to break down a complex multi-hop question into the most efficient sequence of single-hop, **atomic** questions.

## Your Main Goal: Build Smart Bridges, Don't Just Collect Nouns
The most critical skill is to convert complex logical clauses (like "despite," "the country where," "the year before") into a single, powerful **bridging question**. This question should use a known entity as context to find the next one. Avoid finding all the entities separately and then trying to figure o...

Assistant response (first 300 chars):
{
    "questions": [
        {
            "question": "Who wrote \"Crazy Little Thing Called Love\"?",
            "requires_retrieval": true
        },
        {
            "question": "In what year did <ENTITY_Q1> die?",
            "requires_retrieval": true
        }
    ]
}...


### Apply Chat Template

Convert messages to text format using the tokenizer's chat template

In [6]:
# Apply chat template to convert to final training format
# This will be done automatically by SFTTrainer, but we test it here

print("Testing chat template formatting...")
sample_formatted = tokenizer.apply_chat_template(
    training_dataset[0]["messages"],
    tokenize=False,
    add_generation_prompt=False
)

print(f"\nFormatted sample (first 500 chars):")
print(sample_formatted[:500])
print("\n... [content truncated] ...")
print(f"\nLast 200 chars:")
print(sample_formatted[-200:])

print(f"\n‚úì Chat template formatting works!")
print(f"Training dataset ready with {len(training_dataset)} examples")

Testing chat template formatting...

Formatted sample (first 500 chars):
<|im_start|>user
Your task is to break down a complex multi-hop question into the most efficient sequence of single-hop, **atomic** questions.

## Your Main Goal: Build Smart Bridges, Don't Just Collect Nouns
The most critical skill is to convert complex logical clauses (like "despite," "the country where," "the year before") into a single, powerful **bridging question**. This question should use a known entity as context to find the next one. Avoid finding all the entities separately and then t

... [content truncated] ...

Last 200 chars:
d Love\"?",
            "requires_retrieval": true
        },
        {
            "question": "In what year did <ENTITY_Q1> die?",
            "requires_retrieval": true
        }
    ]
}<|im_end|>


‚úì Chat template formatting works!
Training dataset ready with 802 examples


### Dataset Ready for Training

The dataset is now prepared and ready for full finetuning. Proceed to the training section below.

<a name="Train"></a>
### Train the model with Full Finetuning

Now let's train our model using **full finetuning** (all parameters are updated, not just LoRA adapters). 

We do 30 steps for quick testing, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

**Important notes for full finetuning:**
- Learning rate is much lower (5e-6 vs 2e-4 for LoRA)
- Training will take longer as all parameters are updated
- Memory usage is higher, but gradient checkpointing helps
- The model quality can be better than LoRA since all parameters are trained

In [7]:
from trl import SFTTrainer, SFTConfig
from typing import Union, List, Dict, Any

def formatting_function(example):
    """
    Unsloth requires List[str].
    - If 'example["messages"]' is a list of message-lists (batched), return one
      formatted string per item.
    - If it's a single list of messages (single sample), return a 1-element list.
    """
    msgs = example["messages"]

    # Batched: msgs is like [ [ {role, content}, ... ], [ {role, content}, ... ], ... ]
    if isinstance(msgs, list) and msgs and isinstance(msgs[0], list):
        texts = [
            tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=False)
            for m in msgs
        ]
    else:
        # Single example
        texts = [
            tokenizer.apply_chat_template(
                msgs, tokenize=False, add_generation_prompt=False
            )
        ]

    # Ensure flat list[str] with no empties
    return [t for t in texts if isinstance(t, str) and t.strip()]


# Create SFTConfig (formatting_func goes to SFTTrainer, not here)
training_args = SFTConfig(
    output_dir="./qwen3_0.6b_question_decomp_full_ft",  # Required parameter
    per_device_train_batch_size = 2,  # Can increase for 0.6B model
    gradient_accumulation_steps = 4,   # Use GA to mimic batch size!
    warmup_steps = 5,
    # num_train_epochs = 1,            # Set this for 1 full training run.
    max_steps = 30,                    # For quick testing
    # max_length=2048,
    learning_rate = 5e-6,              # Lower LR for full finetuning (was 2e-4 for LoRA)
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,               # Increased weight decay for full finetuning
    lr_scheduler_type = "linear",
    seed = 3407,
    # bf16 = True,                       # Use BF16 precision for full finetuning
    gradient_checkpointing = True,     # Enable to save memory
    report_to = "wandb",                # Use TrackIO/WandB etc
)

# Create trainer - formatting_func is passed to SFTTrainer, not SFTConfig
# Note: We removed the custom data_collator that was causing the "no loss" error
# SFTTrainer will use its default DataCollatorForLanguageModeling which properly handles labels
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = training_dataset,  # Using the question decomposition dataset
    eval_dataset = None, # Can set up evaluation!
    formatting_func = formatting_function,  # Pass to SFTTrainer in newer trl versions
    args = training_args,
)

Unsloth: Tokenizing ["text"] (num_proc=68):   0%|          | 0/802 [00:00<?, ? examples/s]

In [8]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA RTX A5000. Max memory = 23.673 GB.
1.135 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [9]:
trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 802 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 596,049,920 of 596,049,920 (100.00% trained)
[34m[1mwandb[0m: Currently logged in as: [33mmyigitturali[0m ([33mmyigitturali-UCLA[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Initializing weave.


RuntimeError: Expected attn_mask dtype to be bool or float or to match query dtype, but got attn_mask.dtype: long int and  query.dtype: c10::BFloat16 instead.

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2476.6345 seconds used for training.
41.28 minutes used for training.
Peak reserved memory = 14.508 GB.
Peak reserved memory for training = 2.61 GB.
Peak reserved memory % of max memory = 98.419 %.
Peak reserved memory for training % of max memory = 17.706 %.


<a name="Inference"></a>
### Inference - Question Decomposition

Let's test the model on question decomposition! The model should break down complex multi-hop questions into atomic single-hop questions.

The model will output JSON format with:
- `questions`: List of decomposed sub-questions
- `requires_retrieval`: Boolean flag for each question indicating if it needs retrieval

For structured JSON output, use lower temperature (0.0-0.3) for consistency.

In [None]:
# Example 1: Test question decomposition with a 2-hop question
test_question = "Where did Lothair II's mother die?"

# Create the prompt using the same format as training
user_prompt = f"""Your task is to break down a complex multi-hop question into the most efficient sequence of single-hop, **atomic** questions.

## Your Main Goal: Build Smart Bridges, Don't Just Collect Nouns
The most critical skill is to convert complex logical clauses (like "despite," "the country where," "the year before") into a single, powerful **bridging question**. This question should use a known entity as context to find the next one. Avoid finding all the entities separately and then trying to figure out how they connect.

Now decompose this question:
Input: "{test_question}"
Output:
"""

messages = [
    {"role": "user", "content": user_prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

from transformers import TextStreamer
print(f"Question: {test_question}\n")
print("Decomposition:")
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=512,  # Enough for JSON output
    temperature=0.1,     # Low temperature for consistent JSON
    top_p=0.9,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

In [None]:
# Example 2: Test with a more complex 3-hop question
test_question_2 = "What is the birth year of the spouse of the director of Casablanca?"

user_prompt_2 = f"""Your task is to break down a complex multi-hop question into the most efficient sequence of single-hop, **atomic** questions.

## Your Main Goal: Build Smart Bridges, Don't Just Collect Nouns
The most critical skill is to convert complex logical clauses (like "despite," "the country where," "the year before") into a single, powerful **bridging question**. This question should use a known entity as context to find the next one. Avoid finding all the entities separately and then trying to figure out how they connect.

Now decompose this question:
Input: "{test_question_2}"
Output:
"""

messages_2 = [
    {"role": "user", "content": user_prompt_2}
]

text_2 = tokenizer.apply_chat_template(
    messages_2,
    tokenize=False,
    add_generation_prompt=True,
)

from transformers import TextStreamer
print(f"\nQuestion: {test_question_2}\n")
print("Decomposition:")
_ = model.generate(
    **tokenizer(text_2, return_tensors="pt").to("cuda"),
    max_new_tokens=512,
    temperature=0.1,
    top_p=0.9,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

### Saving the fully finetuned model
For full finetuning, we save the **entire model** (not just adapters like LoRA). You can use Huggingface's `push_to_hub` for online save or `save_pretrained` for local save.

**[NOTE]** This saves the complete model with all trained parameters.

In [None]:
# Save full model locally
model.save_pretrained("full_finetuned_model")
tokenizer.save_pretrained("full_finetuned_model")

# Save to Hugging Face Hub (uncomment and add your token)
# model.push_to_hub("your_name/qwen3-0.6b-finetuned", token = "...")
# tokenizer.push_to_hub("your_name/qwen3-0.6b-finetuned", token = "...")

Now if you want to load the fully finetuned model we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "full_finetuned_model", # Path to your saved model
        max_seq_length = 2048,
        load_in_4bit = True,  # Can use 4bit for inference to save memory
    )

### Saving to float16 for VLLM

For full finetuning, you can save the model in different formats. Select `merged_16bit` for float16 or `merged_4bit` for int4. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

**Note:** For full finetuning, the model is already "merged" (no adapters to merge), so these methods will save the complete model in the specified format.

In [None]:
# Save full model to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Save full model to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Save full model in standard format
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

For full finetuning, the entire model will be converted to GGUF format.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )