To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
from unsloth import FastLanguageModel
import torch

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import logging
import kagglehub
import os
from datasets import Dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import sys
print(sys.executable)  # Путь к Python-интерпретатору
!pip list             # Список установленных пакетов
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
/home/student/kuzin/python_project/TextToSvg_MLCUP/.venv/bin/python3
Package                  Version
------------------------ ----------------
absl-py                  2.3.0
accelerate               1.6.0
aiohappyeyeballs         2.6.1
aiohttp                  3.11.18
aiosignal                1.3.2
asttokens                3.0.0
async-timeout            5.0.1
attrs                    21.2.0
Automat                  20.2.0
Babel                    2.8.0
bcrypt                   3.2.0
bitsandbytes             0.45.3
blinker                  1.4
CacheControl             0.12.10
cachy                    0.3.0
cairocffi                1.7.1
CairoSVG                 2.7.1
certifi                  2020.6.20
cffi                     1.17.1
chardet                  4.0.0
charset-normalizer       3.4.2
cleo                     0.8.1
cl

### Unsloth

In [3]:
wandb login

In [4]:
model_name = "Qwen/Qwen3-8B"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    load_in_4bit=False,
    load_in_8bit=True,
    max_seq_length = 1024,   # Context length - can be longer, but uses more memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

==((====))==  Unsloth 2025.5.7: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA A100 80GB PCIe. Num GPUs = 1. Max memory: 79.252 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


In [6]:
from datasets import load_dataset

for_fine_tune = load_dataset("Azrail/filtered_svg")
def remove_any_none(example):
    return all(value is not None for value in example.values())
print("До очистки:", for_fine_tune["train"].num_rows)
# Применяем фильтр ко всем разделам (train/test/val)
for_fine_tune = for_fine_tune.filter(remove_any_none)
print("После очистки:", for_fine_tune["train"].num_rows)
shuffled_dataset = for_fine_tune["train"].shuffle(seed=42)  # seed для воспроизводимости
print(shuffled_dataset)
for_fine_tune =  shuffled_dataset.select(range(len(shuffled_dataset) // 2))
print(f"После шаффла и обрезки: {len(for_fine_tune)}")

README.md:   0%|          | 0.00/368 [00:00<?, ?B/s]

До очистки: 131419
После очистки: 131263
Dataset({
    features: ['id', 'svg', 'description'],
    num_rows: 131263
})
После шаффла и обрезки: 65631


In [7]:
def get_samples(batch):
    samples = []
    for desc, svg in zip(batch["description"], batch["svg"]):
        user_msg = {"role": "user", "content": f'Generate SVG image from description "{desc}"'}
        assistant_msg = {"role": "assistant", "content": f"{svg}"}
        samples.append([user_msg, assistant_msg])
    return {"data": samples} 

for_fine_tune = for_fine_tune.map(
    get_samples,
    batched=True,
    remove_columns=["id","description", "svg"],
)

In [8]:
dataset = for_fine_tune.map(lambda x: {"text": tokenizer.apply_chat_template(x["data"], tokenize=False, add_generation_prompt=False, enable_thinking=False)}, remove_columns=["data"])

Map:   0%|          | 0/65631 [00:00<?, ? examples/s]

In [9]:
plug_prompt =  "Okay, I think I have finished thinking"


In [10]:
import re

def insert_text_between_think_tags(original_text):
    original_text = original_text["text"]
    return {
        "text":
        re.sub(
        r'(<think>).*?(</think>)',
        r'\1{}\2'.format(plug_prompt),
        original_text,
        flags=re.DOTALL
    )
    }

In [11]:
dataset = dataset.map(insert_text_between_think_tags)

Map:   0%|          | 0/65631 [00:00<?, ? examples/s]

In [12]:
print(dataset), dataset[0]

Dataset({
    features: ['text'],
    num_rows: 65631
})


(None,
 {'text': '<|im_start|>user\nGenerate SVG image from description "Stylized black QR code elements form a unique pattern on a white background."<|im_end|>\n<|im_start|>assistant\n<think>Okay, I think I have finished thinking</think>\n\n<svg xmlns="http://www.w3.org/2000/svg" viewBox="0.0 0.0 200.0 200.0" height="200.0px" width="200.0px"><path fill="#000000" fill-opacity="1.0" d="M75.0 25.0 A16.67 16.67 0.0 0 1 91.67 41.67 L91.67 75.0 A16.67 16.67 0.0 0 1 75.0 91.67 L41.67 91.67 A16.67 16.67 0.0 0 1 25.0 75.0 L25.0 41.67 A16.67 16.67 0.0 0 1 41.67 25.0 L75.0 25.0 Z M75.0 37.5 L41.67 37.5 A4.17 4.17 0.0 0 0 37.57 40.92 L37.5 41.67 L37.5 75.0 A4.17 4.17 0.0 0 0 40.92 79.1 L41.67 79.17 L75.0 79.17 A4.17 4.17 0.0 0 0 79.1 75.75 L79.17 75.0 L79.17 41.67 A4.17 4.17 0.0 0 0 75.75 37.57 L75.0 37.5 Z M75.0 108.33 A16.67 16.67 0.0 0 1 91.67 125.0 L91.67 158.33 A16.67 16.67 0.0 0 1 75.0 175.0 L41.67 175.0 A16.67 16.67 0.0 0 1 25.0 158.33 L25.0 125.0 A16.67 16.67 0.0 0 1 41.67 108.33 L75.0 10

In [13]:
dataset_split= dataset.train_test_split(test_size=0.1, seed=42)
print(dataset_split)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 59067
    })
    test: Dataset({
        features: ['text'],
        num_rows: 6564
    })
})


Let's sample the reasoning dataset by 25% (or whatever is 100% - chat_percentage)

Finally combine both datasets:

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [14]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_split['train'],
    eval_dataset = dataset_split['test'],
    args = SFTConfig(
        output_dir="Qwen16B/checkpoints",
        logging_dir="./logs",
        eval_strategy="epoch",
        dataset_text_field = "text",
        per_device_train_batch_size = 64,
        warmup_steps = 5,
        num_train_epochs = 3, # Set this for 1 full training run.
        learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 10,
        save_steps=100,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to="wandb", 
        run_name="qwen3-1b5566-lora-finetune",
        torch_compile= True
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=32):   0%|          | 0/59067 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=32):   0%|          | 0/6564 [00:00<?, ? examples/s]

In [15]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100 80GB PCIe. Max memory = 79.252 GB.
15.658 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [16]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 59,067 | Num Epochs = 3 | Total steps = 1,386
O^O/ \_/ \    Batch size per device = 64 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (64 x 2 x 1) = 128
 "-____-"     Trainable parameters = 87,293,952/8,278,029,312 (1.05% trained)


Unsloth: Enabled auto compiling


[34m[1mwandb[0m: Currently logged in as: [33mkuzinmails[0m ([33mkuzinmails-higher-school-of-economics[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.

class GraphModule(torch.nn.Module):
    def forward(self):
        pass
        


Unsloth: Will smartly offload gradients to save VRAM!


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
model.save_pretrained("Qwen-16B_lora_finetune") 
tokenizer.save_pretrained("Qwen-16B_lora_finetune")
model.save_pretrained_merged("Qwen-16B_lora_finetune_bfloat16", tokenizer, save_method = "merged_16bit",)

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

261.8103 seconds used for training.
4.36 minutes used for training.
Peak reserved memory = 14.066 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 35.559 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
import torch
model_name = "Qwen/Qwen3-14B"
check_point_path=  "Qwen14B/checkpoints/checkpoint-400"
tokenizer = AutoTokenizer.from_pretrained(check_point_path)
model = AutoModelForCausalLM.from_pretrained(
    check_point_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
model.bfloat16()

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 5120, padding_idx=151654)
    (layers): ModuleList(
      (0-39): 40 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): lora.Linear(
            (base_layer): Linear(in_features=5120, out_features=5120, bias=False)
            (lora_dropout): ModuleDict(
              (default): Identity()
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=5120, out_features=32, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=32, out_features=5120, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear(
            (base_layer): Linear(in_features=5120, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default

In [3]:
from peft import PeftModel

# Загрузите обученную модель с адаптерами
model = PeftModel.from_pretrained(model, "Qwen16B/checkpoints/checkpoint-1000")
model = model.merge_and_unload()

In [4]:
model.eval()

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 5120)
    (layers): ModuleList(
      (0-39): 40 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=1024, bias=False)
          (v_proj): Linear(in_features=5120, out_features=1024, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=5120, out_features=17408, bias=False)
          (up_proj): Linear(in_features=5120, out_features=17408, bias=False)
          (down_proj): Linear(in_features=17408, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((5120,), eps=1e-06)
        (post_attention_layernorm): 

In [9]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

We are given the equation:

$$
(x + 2)^2 = 0
$$

### Step 1: Take the square root of both sides

$$
\sqrt{(x + 2)^2} = \sqrt{0}
$$

$$
|x + 2| = 0
$$

### Step 2: Solve the absolute value equation

$$
x + 2 = 0
$$

$$
x = -2
$$

### ✅ Final Answer:

$$
\boxed{-2}
$$<|im_end|>


In [2]:
messages = [
 {"role": "user", "content": "Generate SVG image from description: two lakes near to mountins .The SVG scene should be of the following dimensions 256*256. First, think about the scene, about the objects on the stage and their characteristics. Draw simple. Make sure that all tags are closed and the svg is correct!"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    repetition_penalty=10,  # немного штрафует за повторы
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 3000, # Increase for longer outputs!
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>Okay, I think I have finished thinking</think>

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0.0 0.0 256.0 256.0" height="256.0px" width="256.0px"><path fill="#e5e9f0" fill-opacity="1.0" d="M256.0 128.0 A128.0 128.0 0.0 1 1 0.0 128.0 A128.0 128.0 0.0 1 1 256.0 128.0 Z" /><path fill="#e5e9f0" fill-opacity="1.0" d="M256.0 128.0 A128.0 128.0 0.0 1 1 0.0 128.0 A128.0 128.0 0.0 1 1 256.0 128.0 Z" /><path fill="#f6f8fa" fill-opacity="1.0" d="M216.32 128.0 L128.0 128.0 C122.67 128.0 118.13 123.46 118.13 118.13 L118.13 116.32 C118.13 111.0 122.67 106.46 128.0 106.46 L216.32 106.46 C221.65 106.46 226.19 111.0 226.19 116.32 L226.19 118.13 C226.19 123.46 221.65 128.0 216.32 128.0 Z" /><path fill="#f6f8fa" fill-opacity="1.0" d="M128.0 128.0 L118.13 128.0 C118.13 123.46 113.59 118.92 108.26 118.92 L108.26 116.32 C108.26 111.0 112.8 106.46 118.13 106.46 L128.0 106.46 C133.33 106.46 137.87 111.0 137.87 116.32 L137.87 118.92 C137.87 123.46 133.33 128.0 128.0 128.0 Z" /><path fill="#f6f8fa" 

In [6]:
model.save_pretrained_merged("model_test_save", tokenizer, save_method = "merged_16bit",)

AttributeError: 'Qwen3ForCausalLM' object has no attribute 'save_pretrained_merged'

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
