<a href="https://colab.research.google.com/github/viragtiwari/text2svg/blob/main/notebook_for_qwen_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

qwen_models = [
    "unsloth/Qwen2.5-Coder-32B-Instruct",      # Qwen 2.5 Coder 2x faster
    "unsloth/Qwen2.5-Coder-7B",
    "unsloth/Qwen2.5-14B-Instruct",            # 14B fits in a 16GB card
    "unsloth/Qwen2.5-7B",
    "unsloth/Qwen2.5-72B-Instruct",            # 72B fits in a 48GB card
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-Coder-7B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.50.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Qwen-2.5` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Qwen renders multi turn conversations like below:

```
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
It's 4.<|im_end|>

```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("viragorangecat/text2svg")

print(dataset)

README.md:   0%|          | 0.00/333 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'svg'],
        num_rows: 20
    })
})


Input format for the dataset
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
def convert_dataset_dict(dataset_dict):
    """Converts a DatasetDict to the desired format.

    Args:
        dataset_dict: The DatasetDict to convert.

    Returns:
        A list of dictionaries in the desired format.
    """

    converted_data = []
    for example in dataset_dict['train']:
        # Adding the prefixes here
        prompt_with_prefix = "<|im_start|>user\n" + example['prompt'] + "<|im_end|>\n"
        svg_with_prefix = "<|im_start|>assistant\n" + example['svg'] + "<|im_end|>\n"

        converted_data.append(
            prompt_with_prefix + svg_with_prefix
        )
    return converted_data

# Example usage:
converted_dataset = convert_dataset_dict(dataset)  # Assuming 'dataset' is your DatasetDict

from datasets import Dataset
trainer_dataset = Dataset.from_dict({"text": converted_dataset})

print(trainer_dataset['text'])

['<|im_start|>user\nA clean, minimalist mountain landscape design featuring layered mountains in grayscale tones with a golden sun element. The design uses simple geometric shapes to create depth and perspective.<|im_end|>\n<|im_start|>assistant\n<?xml version="1.0" encoding="UTF-8"?>\n<svg width="800" height="600" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">\n    <!-- Background -->\n    <rect width="800" height="600" fill="#f5f5f5"/>\n    \n    <!-- Mountains -->\n    <path d="M0 600 L300 200 L500 400 L800 100 L800 600 Z" fill="#e0e0e0"/>\n    <path d="M0 600 L200 300 L400 500 L600 200 L800 400 L800 600 Z" fill="#d0d0d0"/>\n    <path d="M0 600 L100 400 L300 550 L500 350 L700 500 L800 450 L800 600 Z" fill="#c0c0c0"/>\n    \n    <!-- Sun -->\n    <circle cx="700" cy="150" r="60" fill="#ffd700"/>\n</svg><|im_end|>\n', '<|im_start|>user\nAn abstract composition of overlapping circles in vibrant colors against a dark background. The design creates visual interest through tran

We look at how the conversations are structured for item 5:

In [None]:
from datasets import Dataset

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=trainer_dataset,  # Updated line
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=4,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="paged_adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

And we see how the chat template transformed these conversations.

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

The trainer includes our **gradient accumulation bug fix**. Read more about it here: [Blog post](https://unsloth.ai/blog/gradient)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
print(trainer_dataset[0])  # Print the first example in the dataset

{'text': '<|im_start|>user\nA clean, minimalist mountain landscape design featuring layered mountains in grayscale tones with a golden sun element. The design uses simple geometric shapes to create depth and perspective.<|im_end|>\n<|im_start|>assistant\n<?xml version="1.0" encoding="UTF-8"?>\n<svg width="800" height="600" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">\n    <!-- Background -->\n    <rect width="800" height="600" fill="#f5f5f5"/>\n    \n    <!-- Mountains -->\n    <path d="M0 600 L300 200 L500 400 L800 100 L800 600 Z" fill="#e0e0e0"/>\n    <path d="M0 600 L200 300 L400 500 L600 200 L800 400 L800 600 Z" fill="#d0d0d0"/>\n    <path d="M0 600 L100 400 L300 550 L500 350 L700 500 L800 450 L800 600 Z" fill="#c0c0c0"/>\n    \n    <!-- Sun -->\n    <circle cx="700" cy="150" r="60" fill="#ffd700"/>\n</svg><|im_end|>\n'}


In [None]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
)

Map (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|im_start|>user\nA circuit board inspired design featuring clean lines and connection points. The design creates a modern tech aesthetic with geometric shapes and subtle opacity variations.<|im_end|>\n<|im_start|>assistant\n<?xml version="1.0" encoding="UTF-8"?>\n<svg width="800" height="600" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">\n    <!-- Background -->\n    <rect width="800" height="600" fill="#2d3436"/>\n    \n    <!-- Circuit Board Pattern -->\n    <g stroke="#00b894" stroke-width="2" fill="none">\n        <!-- Horizontal Lines -->\n        <line x1="100" y1="100" x2="700" y2="100"/>\n        <line x1="200" y1="300" x2="600" y2="300"/>\n        <line x1="100" y1="500" x2="700" y2="500"/>\n        \n        <!-- Vertical Lines -->\n        <line x1="200" y1="50" x2="200" y2="550"/>\n        <line x1="400" y1="50" x2="400" y2="550"/>\n        <line x1="600" y1="50" x2="600" y2="550"/>\n    </g>\n    \n    <!-- Connection Points -->\n    <g fill="#00cec9">\n     

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                   <?xml version="1.0" encoding="UTF-8"?>\n<svg width="800" height="600" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">\n    <!-- Background -->\n    <rect width="800" height="600" fill="#2d3436"/>\n    \n    <!-- Circuit Board Pattern -->\n    <g stroke="#00b894" stroke-width="2" fill="none">\n        <!-- Horizontal Lines -->\n        <line x1="100" y1="100" x2="700" y2="100"/>\n        <line x1="200" y1="300" x2="600" y2="300"/>\n        <line x1="100" y1="500" x2="700" y2="500"/>\n        \n        <!-- Vertical Lines -->\n        <line x1="200" y1="50" x2="200" y2="550"/>\n        <line x1="400" y1="50" x2="400" y2="550"/>\n        <line x1="600" y1="50" x2="600" y2="550"/>\n    </g>\n    \n    <!-- Connection Points -->\n    <g fill="#00cec9">\n        <circle cx="200" cy="100" r="8"/>\n        <circle cx="400" cy="100" r="8"/>\n        <circle cx="600" cy="100" r="8"/>\n        <circle cx="200" cy="300" r="8"/>\n        <circle cx="400

We can see the System and Instruction prompts are successfully masked!

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
10.438 GB of memory reserved.


We fixed a major gradient accumulation bug in all trainers. See [blog](https://unsloth.ai/blog/gradient) for more details.

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20 | Num Epochs = 6 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 40,370,176/7,000,000,000 (0.58% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.2319
2,0.2877
3,0.2669
4,0.2981
5,0.2894
6,0.214
7,0.2682
8,0.2618
9,0.1675
10,0.2489


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

412.9074 seconds used for training.
6.88 minutes used for training.
Peak reserved memory = 10.438 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 70.809 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Unsloth_Studio.ipynb)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "purple pyramids spiraling around a bronze cone"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 1, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

['<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\npurple pyramids spiraling around a bronze cone<|im_end|>\n<|im_start|>assistant\n<?xml version="1.0" encoding="UTF-8"?>\n<svg width="400" height="400" viewBox="0 0 400 400" xmlns="http://www.w3.org/2000/svg">\n    <!-- Background -->\n    <rect width="400" height="400" fill="#f0f4f8"/>\n    \n    <!-- Pyramids -->\n    <g transform="translate(200, 200)">\n        <!-- Purple Pyramid 1 -->\n        <path d="M0,0 L50,-100 L100,0 Z" fill="#7e57c2"/>\n        \n        <!-- Purple Pyramid 2 -->\n        <path d="M0,0 L-50,-100 L-100,0 Z" fill="#7e57c2" transform="rotate(120)"/>\n        \n        <!-- Purple Pyramid 3 -->\n        <path d="M0,0 L-50,-100 L50,-100 Z" fill="#7e57c2" transform="rotate(240)"/>\n    </g>\n    \n    <!-- Bronze Cone -->\n    <path d="M200,400 L300,300 L100,300 Z" fill="#cd7f32"/>\n    \n    <!-- Center Point -->\n    <circle cx="200" cy="200" r

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "purple pyramids spiraling around a bronze cone"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 5000,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<?xml version="1.0" encoding="UTF-8"?>
<svg width="400" height="400" viewBox="0 0 400 400" xmlns="http://www.w3.org/2000/svg">
    <!-- Background -->
    <rect width="400" height="400" fill="#f0f4f8"/>
    
    <!-- Pyramids -->
    <g transform="translate(200, 200)">
        <!-- Purple Pyramid 1 -->
        <path d="M0,0 L50,-100 L100,0 Z" fill="#7e57c2"/>
        
        <!-- Purple Pyramid 2 -->
        <path d="M0,0 L-50,-100 L-100,0 Z" fill="#7e57c2" transform="rotate(120)"/>
        
        <!-- Purple Pyramid 3 -->
        <path d="M0,0 L-50,-100 L50,-100 Z" fill="#7e57c2" transform="rotate(240)"/>
    </g>
    
    <!-- Bronze Cone -->
    <path d="M200,400 L300,300 L100,300 Z" fill="#cd7f32"/>
    
    <!-- Center Point -->
    <circle cx="200" cy="200" r="5" fill="#333"/>
</svg><|repo_name|><?xml version="1.0" encoding="UTF-8"?>
<svg width="400" height="400" viewBox="0 0 400 400" xmlns="http://www.w3.org/2000/svg">
    <!-- Background -->
    <rect width="400" height="400

KeyboardInterrupt: 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<?xml version="1.0" encoding="UTF-8"?>
<scene>
    <title>Tall Tower in Paris</title>
    <description>A majestic tower standing tall in the heart of Paris.</description>
    <location>Paris, France</location>
    <architecture>
        <type>Tower</type>
        <height>100 meters</height>
        <material>Stone and metal</material>
    </architecture>
    <environment>
        <sky>Clear blue sky</sky>
        <weather>Sunny</weather>
        <background>Cityscape of Paris</background>
    </environment>



You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit=load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)