To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# T√ºrk√ße karakterler i√ßin tokenizer optimizasyonu
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
    tokenizer.chat_template = "{% for message in messages %}{{ message['role'] + ': ' + message['content'] + '\n' }}{% endfor %}"

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # T√ºrk√ße i√ßin daha y√ºksek rank - daha iyi √∂ƒürenme
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64, # Alpha deƒüerini de artƒ±rdƒ±k
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.7.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb).

In [5]:
customer_service_prompt = """<|begin_of_text|>Sen profesyonel bir T√ºrk m√º≈üteri hizmetleri temsilcisisin. M√º≈üterinin sorununu anlayarak, kibar, yardƒ±mcƒ± ve √ß√∂z√ºm odaklƒ± bir yanƒ±t ver.

G√ñREVƒ∞N:
- T√ºrk√ße dilbilgisi kurallarƒ±na dikkat et
- M√º≈üterinin duygularƒ±nƒ± anla ve empati g√∂ster
- Konkret ve uygulanabilir √ß√∂z√ºmler √∂ner
- Profesyonel ama samimi bir dil kullan
- Gerektiƒüinde √∂z√ºr dile ve sorumluluƒüu al

YANIT VERƒ∞RKEN:
- "Merhaba" ile ba≈üla
- Sorunu √∂zetle ve anlayƒ±≈ü g√∂ster
- Adƒ±m adƒ±m √ß√∂z√ºm sun
- "Ba≈üka yardƒ±mcƒ± olabileceƒüim bir konu var mƒ±?" ile bitir

### M√º≈üteri Mesajƒ±:
{}

### M√º≈üteri Temsilcisi Yanƒ±tƒ±:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

# 2. Yeni Formatlama Fonksiyonu
# Bu fonksiyon, her bir diyalogdaki kar≈üƒ±lƒ±klƒ± konu≈ümalarƒ± ayƒ±rƒ±r.
def formatting_turns_func(examples):
    conversations = examples["conversation"]
    texts = []

    for conv in conversations:
        # Konu≈ümalarƒ± satƒ±rlara b√∂l ve rollerine g√∂re ayƒ±r
        turns = conv.strip().split('\n')

        # Konu≈üma ge√ßmi≈üini tutmak i√ßin bir liste
        context = []

        for turn in turns:
            if turn.startswith('m√º≈üteri:'):
                # M√º≈üteri konu≈ümasƒ±nƒ± baƒülama ekle
                context.append(turn)
            elif turn.startswith('temsilci:'):
                # Eƒüer bir temsilci konu≈ümasƒ± varsa ve √∂ncesinde m√º≈üteri konu≈ümasƒ± varsa
                if context:
                    # M√º≈üteri mesaj(lar)ƒ±nƒ± birle≈ütir
                    user_message = "\n".join(context).replace("m√º≈üteri: ", "").replace("temsilci: ", "")

                    # Temsilcinin yanƒ±tƒ±
                    assistant_message = turn.replace("temsilci: ", "")

                    # Eƒüitim i√ßin metni formatla
                    text = customer_service_prompt.format(user_message, assistant_message) + EOS_TOKEN
                    texts.append(text)

                    # Bu temsilci konu≈ümasƒ±nƒ± da baƒülama ekle ki bir sonraki turun ge√ßmi≈üi olsun
                    context.append(turn)

    return { "text": texts }

# 3. Veri Setini Y√ºkle ve Formatla
from datasets import load_dataset
dataset = load_dataset("ulasdesouza/jedai", split="train")

# Yeni formatlama fonksiyonunu kullanarak veri setini haritala
# Bu i≈ülem, orijinal 9000 diyalogdan √ßok daha fazla sayƒ±da (√∂rneƒüin 20-30 bin) eƒüitim verisi √ºretecektir.
dataset = dataset.map(formatting_turns_func, batched=True, remove_columns=dataset.column_names)

# √ñrnek bir verinin nasƒ±l g√∂r√ºnd√ºƒü√ºn√º kontrol edelim
print("Yeni formatlanmƒ±≈ü bir veri √∂rneƒüi:")
print(dataset[0]['text'])

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Yeni formatlanmƒ±≈ü bir veri √∂rneƒüi:
<|begin_of_text|>Sen profesyonel bir T√ºrk m√º≈üteri hizmetleri temsilcisisin. M√º≈üterinin sorununu anlayarak, kibar, yardƒ±mcƒ± ve √ß√∂z√ºm odaklƒ± bir yanƒ±t ver.

G√ñREVƒ∞N:
- T√ºrk√ße dilbilgisi kurallarƒ±na dikkat et
- M√º≈üterinin duygularƒ±nƒ± anla ve empati g√∂ster
- Konkret ve uygulanabilir √ß√∂z√ºmler √∂ner
- Profesyonel ama samimi bir dil kullan
- Gerektiƒüinde √∂z√ºr dile ve sorumluluƒüu al

YANIT VERƒ∞RKEN:
- "Merhaba" ile ba≈üla
- Sorunu √∂zetle ve anlayƒ±≈ü g√∂ster
- Adƒ±m adƒ±m √ß√∂z√ºm sun
- "Ba≈üka yardƒ±mcƒ± olabileceƒüim bir konu var mƒ±?" ile bitir

### M√º≈üteri Mesajƒ±:
Merhaba, mikrodalga sipari≈üimin durumunu √∂ƒürenmek istiyorum.

### M√º≈üteri Temsilcisi Yanƒ±tƒ±:
Merhaba! Sipari≈ü numaranƒ±zƒ± payla≈üabilir misiniz? Kontrol edip size bilgi vereceƒüim.<|end_of_text|>


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = True, # üöÄ 5x daha hƒ±zlƒ± eƒüitim!
    args = SFTConfig(
        per_device_train_batch_size = 4, # Batch size artƒ±rƒ±ldƒ±
        gradient_accumulation_steps = 2, # Azaltƒ±ldƒ±, daha hƒ±zlƒ±
        warmup_steps = 20, # Kƒ±sa warmup
        max_steps = 300, # ‚è±Ô∏è Yakla≈üƒ±k 20-30 dakika s√ºrecek
        learning_rate = 3e-4, # Hƒ±zlƒ± √∂ƒürenme i√ßin artƒ±rƒ±ldƒ±
        logging_steps = 50, # Daha az log spam
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine", # Cosine daha etkili
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        save_steps = 150, # Ara kayƒ±t
        save_total_limit = 2, # Sadece 2 checkpoint
        dataloader_num_workers = 2, # Hƒ±zlƒ± veri y√ºkleme
        fp16 = False,
        remove_unused_columns = False,
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/19308 [00:00<?, ? examples/s]

In [7]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
7.135 GB of memory reserved.


In [9]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 19,308 | Num Epochs = 1 | Total steps = 300
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Step,Training Loss
50,0.1212
100,0.0886
150,0.0748
200,0.069
250,0.0639
300,0.0688


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

462.7198 seconds used for training.
7.71 minutes used for training.
Peak reserved memory = 7.922 GB.
Peak reserved memory for training = 1.938 GB.
Peak reserved memory % of max memory = 53.716 %.
Peak reserved memory for training % of max memory = 13.141 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [10]:
# 1. Eƒüitimde kullandƒ±ƒüƒ±mƒ±z prompt kalƒ±bƒ±nƒ± tanƒ±mlƒ±yoruz.
customer_service_prompt = """<|begin_of_text|>Sen profesyonel bir T√ºrk m√º≈üteri hizmetleri temsilcisisin. M√º≈üterinin sorununu anlayarak, kibar, yardƒ±mcƒ± ve √ß√∂z√ºm odaklƒ± bir yanƒ±t ver.

G√ñREVƒ∞N:
- T√ºrk√ße dilbilgisi kurallarƒ±na dikkat et
- M√º≈üterinin duygularƒ±nƒ± anla ve empati g√∂ster
- Konkret ve uygulanabilir √ß√∂z√ºmler √∂ner
- Profesyonel ama samimi bir dil kullan
- Gerektiƒüinde √∂z√ºr dile ve sorumluluƒüu al

YANIT VERƒ∞RKEN:
- "Merhaba" ile ba≈üla
- Sorunu √∂zetle ve anlayƒ±≈ü g√∂ster
- Adƒ±m adƒ±m √ß√∂z√ºm sun
- "Ba≈üka yardƒ±mcƒ± olabileceƒüim bir konu var mƒ±?" ile bitir

### M√º≈üteri Mesajƒ±:
{}

### M√º≈üteri Temsilcisi Yanƒ±tƒ±:
{}"""

# 2. Modeli inference i√ßin hazƒ±rlƒ±yoruz (2 kat hƒ±zlandƒ±rƒ±r).
FastLanguageModel.for_inference(model)

# 3. Test etmek istediƒüimiz m√º≈üteri mesajƒ±nƒ± belirliyoruz.
user_input = "Merhaba, d√ºn sipari≈ü ettiƒüim kahve makinesi bozuk geldi. Kutusu da ezilmi≈üti. Ne yapmam gerekiyor?"

# 4. Prompt'u formatlayƒ±p token'larƒ±na ayƒ±rƒ±yoruz.
# Dikkat: Sadece 2 arg√ºman var. ƒ∞lki m√º≈üteri mesajƒ±, ikincisi modelin doldurmasƒ± i√ßin bo≈ü bƒ±rakƒ±lƒ±yor.
inputs = tokenizer(
[
    customer_service_prompt.format(
        user_input, # M√º≈üteri Mesajƒ±
        "",         # M√º≈üteri Temsilcisi Yanƒ±tƒ± - Burasƒ± bo≈ü kalacak!
    )
], return_tensors = "pt").to("cuda")

# 5. Modeli √ßalƒ±≈ütƒ±rƒ±p yanƒ±tƒ± anlƒ±k olarak ekrana yazdƒ±rƒ±yoruz.
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True) # skip_prompt=True sadece √ºretilen cevabƒ± g√∂sterir.

outputs = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256, use_cache = True)

√úr√ºn√ºn arƒ±zalƒ± √ßƒ±kmasƒ±na √ºz√ºld√ºk. ƒ∞ade veya deƒüi≈üim talep etmek i√ßin sipari≈ü numaranƒ±zƒ± alabilir miyim?<|end_of_text|>


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [11]:
# 1. Eƒüitimde kullandƒ±ƒüƒ±mƒ±z prompt kalƒ±bƒ±nƒ± tanƒ±mlƒ±yoruz.
customer_service_prompt = """<|begin_of_text|>Sen profesyonel bir T√ºrk m√º≈üteri hizmetleri temsilcisisin. M√º≈üterinin sorununu anlayarak, kibar, yardƒ±mcƒ± ve √ß√∂z√ºm odaklƒ± bir yanƒ±t ver.

G√ñREVƒ∞N:
- T√ºrk√ße dilbilgisi kurallarƒ±na dikkat et
- M√º≈üterinin duygularƒ±nƒ± anla ve empati g√∂ster
- Konkret ve uygulanabilir √ß√∂z√ºmler √∂ner
- Profesyonel ama samimi bir dil kullan
- Gerektiƒüinde √∂z√ºr dile ve sorumluluƒüu al

YANIT VERƒ∞RKEN:
- "Merhaba" ile ba≈üla
- Sorunu √∂zetle ve anlayƒ±≈ü g√∂ster
- Adƒ±m adƒ±m √ß√∂z√ºm sun
- "Ba≈üka yardƒ±mcƒ± olabileceƒüim bir konu var mƒ±?" ile bitir

### M√º≈üteri Mesajƒ±:
{}

### M√º≈üteri Temsilcisi Yanƒ±tƒ±:
{}"""

# 2. Modeli inference i√ßin hazƒ±rlƒ±yoruz.
FastLanguageModel.for_inference(model)

# 3. Test etmek i√ßin yeni bir m√º≈üteri mesajƒ± belirliyoruz.
user_input = "√úr√ºn√º iade etmek istiyorum ama nasƒ±l yapacaƒüƒ±mƒ± bilmiyorum. Yardƒ±mcƒ± olur musunuz?"

# 4. Prompt'u formatlayƒ±p token'larƒ±na ayƒ±rƒ±yoruz.
inputs = tokenizer(
[
    customer_service_prompt.format(
        user_input, # M√º≈üteri Mesajƒ±
        "",         # M√º≈üteri Temsilcisi Yanƒ±tƒ± - Burasƒ± bo≈ü kalacak!
    )
], return_tensors = "pt").to("cuda")

# 5. Modelin yanƒ±tƒ±nƒ± anlƒ±k olarak ekrana yazdƒ±rmak i√ßin TextStreamer kullanƒ±yoruz.
from transformers import TextStreamer
# skip_prompt=True, sistem prompt'unu tekrar g√∂stermeden sadece modelin cevabƒ±nƒ± yazdƒ±rƒ±r.
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256, use_cache = True)

ƒ∞ade talebiniz i√ßin size memnuniyetle yardƒ±mcƒ± olacaƒüƒ±z. Sipari≈ü numaranƒ±zƒ± payla≈üƒ±r mƒ±sƒ±nƒ±z?<|end_of_text|>


In [12]:
# Sohbeti ilerletmek i√ßin bir test yapalƒ±m

# 1. Adƒ±m: ƒ∞lk m√º≈üteri mesajƒ±
user_message_1 = "√úr√ºn√º iade etmek istiyorum ama nasƒ±l yapacaƒüƒ±mƒ± bilmiyorum. Yardƒ±mcƒ± olur musunuz?"

# 2. Adƒ±m: Modelin ilk cevabƒ±nƒ± alalƒ±m (Bu kƒ±smƒ± zaten yaptƒ±nƒ±z)
# Modelin cevabƒ±: "ƒ∞ade talebiniz i√ßin size memnuniyetle yardƒ±mcƒ± olacaƒüƒ±z. Sipari≈ü numaranƒ±zƒ± payla≈üƒ±r mƒ±sƒ±nƒ±z?"
assistant_response_1 = "ƒ∞ade talebiniz i√ßin size memnuniyetle yardƒ±mcƒ± olacaƒüƒ±z. Sipari≈ü numaranƒ±zƒ± payla≈üƒ±r mƒ±sƒ±nƒ±z?"

# 3. Adƒ±m: Yeni m√º≈üteri cevabƒ±nƒ± hazƒ±rlayalƒ±m
user_message_2 = "Tabii, sipari≈ü numaram XYZ12345."

# 4. Adƒ±m: T√úM GE√áMƒ∞≈ûƒ∞ birle≈ütirerek yeni bir prompt olu≈üturalƒ±m
full_context = f"{user_message_1}\n{assistant_response_1}\n{user_message_2}"

print("--- YENƒ∞ PROMPT (T√úM GE√áMƒ∞≈û ƒ∞LE) ---")
print(full_context)
print("------------------------------------")

# 5. Adƒ±m: Bu yeni ve uzun prompt ile modeli tekrar √ßaƒüƒ±ralƒ±m
inputs = tokenizer(
[
    customer_service_prompt.format(
        full_context, # M√º≈üteri Mesajƒ± b√∂l√ºm√ºne t√ºm ge√ßmi≈üi koyuyoruz
        "",           # Temsilci Yanƒ±tƒ±'nƒ± yine bo≈ü bƒ±rakƒ±yoruz
    )
], return_tensors = "pt").to("cuda")

# Yanƒ±tƒ± anlƒ±k olarak g√∂relim
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256, use_cache = True)

--- YENƒ∞ PROMPT (T√úM GE√áMƒ∞≈û ƒ∞LE) ---
√úr√ºn√º iade etmek istiyorum ama nasƒ±l yapacaƒüƒ±mƒ± bilmiyorum. Yardƒ±mcƒ± olur musunuz?
ƒ∞ade talebiniz i√ßin size memnuniyetle yardƒ±mcƒ± olacaƒüƒ±z. Sipari≈ü numaranƒ±zƒ± payla≈üƒ±r mƒ±sƒ±nƒ±z?
Tabii, sipari≈ü numaram XYZ12345.
------------------------------------
Te≈üekk√ºrler. ƒ∞ade s√ºreci kapsamƒ±nda √ºr√ºn√º anla≈ümalƒ± kargo ile √ºcretsiz g√∂nderebilirsiniz. ƒ∞ade depomuza ula≈ütƒ±ktan sonra √∂demeniz 3 i≈ü g√ºn√º i√ßinde iade edilecektir.<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [13]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.chat_template = "{% for message in messages %}{{ message['role'] + ': ' + message['content'] + '\n' }}{% endfor %}"
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [22]:
if True:  # LoRA adapt√∂rlerini y√ºkle ve kullan
    from unsloth import FastLanguageModel
    from peft import PeftModel # PeftModel'i import ediyoruz
    import torch

    # GPU belleƒüini temizle
    torch.cuda.empty_cache()

    model = None
    tokenizer = None

    try:
        # Adƒ±m 1: Orijinal, hatasƒ±z temel modeli ve tokenizer'ƒ± y√ºkle
        print("Adƒ±m 1: Temel model (unsloth/Meta-Llama-3.1-8B) ve tokenizer'ƒ± y√ºkleniyor...")
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = "unsloth/Meta-Llama-3.1-8B", # Hata almamak i√ßin orijinal modeli y√ºkl√ºyoruz
            max_seq_length = max_seq_length,
            dtype = dtype,
            load_in_4bit = load_in_4bit,
            device_map = "cuda",
        )

        # Adƒ±m 2: Temel modelin √ºzerine kaydettiƒüiniz LoRA adapt√∂rlerini uygulayƒ±n
        print("Adƒ±m 2: 'lora_model' klas√∂r√ºndeki adapt√∂rler uygulanƒ±yor...")
        model = PeftModel.from_pretrained(model, "lora_model")

        print("‚úÖ Model ve LoRA adapt√∂rleri ba≈üarƒ±yla birle≈ütirildi!")
        FastLanguageModel.for_inference(model) # 2 kat daha hƒ±zlƒ± inference i√ßin etkinle≈ütir

    except Exception as e:
        print(f"‚ùå Model ve LoRA y√ºkleme hatasƒ±: {e}")


# Customer service prompt'u tekrar tanƒ±mla
customer_service_prompt = """Sen profesyonel bir m√º≈üteri hizmetleri temsilcisisin. M√º≈üterinin sorununu anlayarak, kibar, yardƒ±mcƒ± ve √ß√∂z√ºm odaklƒ± bir yanƒ±t ver. M√º≈üteri memnuniyeti √∂nceliƒüindir.

KURALLARIN:
- Her zaman saygƒ±lƒ± ve anlayƒ±≈ülƒ± ol
- M√º≈üterinin sorununu tam olarak dinle ve anla
- Konkret √ß√∂z√ºmler √∂ner
- Gerektiƒüinde √∂z√ºr dile
- M√º≈üterinin duygularƒ±nƒ± dikkate al
- Net ve anla≈üƒ±lƒ±r bir dil kullan
- Ek yardƒ±m gerekiyorsa proaktif ol

YANIT TARZI:
- Empati g√∂ster
- √á√∂z√ºm odaklƒ± yakla≈ü
- Profesyonel ama sƒ±cak bir ton kullan
- Gerekirse adƒ±m adƒ±m rehberlik et

### M√º≈üteri Mesajƒ±:
{}

### M√º≈üteri Temsilcisi Yanƒ±tƒ±:
{}"""

# Y√ºklenen birle≈üik model ile test yapalƒ±m
if model and tokenizer:
    user_input = "Aldƒ±ƒüƒ±m √ºr√ºn beklediƒüim gibi deƒüil ve bedeni yanlƒ±≈ü geldi. Ayrƒ±ca ambalajƒ± da hasarlƒ±ydƒ±. Bu √ºr√ºn√º iade etmek ve param geri almak istiyorum."

    inputs = tokenizer(
    [
        customer_service_prompt.format(
            user_input,
            "",
        )
    ], return_tensors = "pt").to("cuda")

    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Adƒ±m 1: Temel model (unsloth/Meta-Llama-3.1-8B) ve tokenizer'ƒ± y√ºkleniyor...
==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
‚ùå Model ve LoRA y√ºkleme hatasƒ±: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 896.00 KiB is free. Process 2569 has 39.54 GiB memory in use. Of the allocated memory 39.01 GiB is allocated by PyTorch, and 12.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documenta

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
