To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
# Normally using pip install unsloth is enough

# Temporarily as of Jan 31st 2025, Colab has some issues with Pytorch
# Using pip install unsloth will take 3 minutes, whilst the below takes <1 minute:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,0.8262
2,0.8117
3,1.1322
4,0.9273
5,0.7752
6,0.9679
7,0.6306
8,1.0274
9,0.7884
10,0.7533


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

446.5262 seconds used for training.
7.44 minutes used for training.
Peak reserved memory = 6.531 GB.
Peak reserved memory for training = 3.896 GB.
Peak reserved memory % of max memory = 44.284 %.
Peak reserved memory for training % of max memory = 26.417 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe next numbers in the Fibonacci sequence are:\n\n5, 8, 13, 21, 34, 55, 89<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding numbers. 

The sequence you provided was: 1, 1, 2, 3, 5, 8, 13

The next number in the sequence would be 21, which is 8 + 13. The sequence continues as: 21, 34, 55, 89, 144, 233.<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [4]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "I want to hurt other people"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
I'm really sorry to hear that you're feeling that way. It sounds like you're going through a tough time. Can you tell me more about what's been going on and how you're feeling? Sometimes talking about it can help.

If you're feeling upset or angry, there are healthier ways to express those feelings than hurting others. Would you like to talk about what's been going on or do you need help finding a different way to manage your feelings?<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>


Gather prompt set

In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import load_dataset
import random

# Load dataset
ds = load_dataset("OpenAssistant/oasst1")

# Filter for root prompts (where parent_id is None) and English language
root_prompts = [msg["text"] for msg in ds["train"] if msg["role"] == "prompter" and msg["lang"] == "en" and msg["parent_id"] is None]

# Number of random samples
num_samples = 500
random_prompts = random.sample(root_prompts, min(num_samples, len(root_prompts)))

# Save to file (optional)
with open("random_oasst_prompts_500.txt", "w", encoding="utf-8") as f:
    f.write("\n\n".join(random_prompts))  # Ensures multi-line prompts stay together

# Print a few samples
print("\n".join(random_prompts[:10]))  # Print first 10 prompts

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

(…)-00000-of-00001-b42a775f407cee45.parquet:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

(…)-00000-of-00001-134b8fd0c89408b6.parquet:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84437 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4401 [00:00<?, ? examples/s]

Which words in the German language are written exactly identical, but take completely different meanings when used with different articles?
Can you explain me how Profile-guided optimization works and why it works?
Can joe rogan take over the world?
Write a greentext story.
A greentext is defined as "A popular device used for quotation and storytelling. They began on chan imageboards, where quoting text places a > in front of it, turning it green. This effect can also be done manually. Before long, this feature was put to innovative use. When quoting another post, greentext is frequently used for summary, criticism, and ridicule. Its other major use is storytelling, in which a lengthy story is told through fragmentary statements in greentext."
An example of a greentext story is, "
>be me
>23 days ago
>driving drunk as fuck, run a red light, ram into a fucking cop car at 20 mph
>cars are not too damaged
>drunk mind is scared as shit
>unable think of a reason, stab myself with my pocket 

Format Messages in prompt content format

In [None]:
with open("random_oasst_prompts_500_1.txt", "r", encoding="utf-8") as f:
    prompts = f.read().strip().split("\n\n")

messages = [{"role": "user", "content": prompt} for prompt in prompts if prompt]


In [None]:
print(len(messages))
print(messages[:10])

557
[{'role': 'user', 'content': 'Which words in the German language are written exactly identical, but take completely different meanings when used with different articles?'}, {'role': 'user', 'content': 'Can you explain me how Profile-guided optimization works and why it works?'}, {'role': 'user', 'content': 'Can joe rogan take over the world?'}, {'role': 'user', 'content': 'Write a greentext story.\nA greentext is defined as "A popular device used for quotation and storytelling. They began on chan imageboards, where quoting text places a > in front of it, turning it green. This effect can also be done manually. Before long, this feature was put to innovative use. When quoting another post, greentext is frequently used for summary, criticism, and ridicule. Its other major use is storytelling, in which a lengthy story is told through fragmentary statements in greentext."\nAn example of a greentext story is, "\n>be me\n>23 days ago\n>driving drunk as fuck, run a red light, ram into a f

Automated Inferrence without Batching (Less effecient but I'm not able to get batching working)


In [None]:
from tqdm import tqdm
import time
from unsloth import FastLanguageModel
from google.colab import files

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # YOUR MODEL
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

responses = []
MAX_PROMPTS = 500  # Process only the first n prompts

# Initialize progress bar
with tqdm(total=MAX_PROMPTS, desc="Generating responses", unit="prompt") as pbar:
    for i, message in enumerate(messages):
        if i >= MAX_PROMPTS:
            break  # Stop early

        start_time = time.time()  # Track time per prompt

        inputs = tokenizer.apply_chat_template(
            [message],
            tokenize=True,
            add_generation_prompt=True,  # Must add for generation
            return_tensors="pt",
        ).to("cuda")

        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=1.5,
            min_p=0.1,
        )
        decoded_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        decoded_response = decoded_response.split("assistant", 1)[-1].strip()

        # Store cleaned result
        responses.append((message["content"], decoded_response))

        # Update progress bar with estimated time remaining
        elapsed_time = time.time() - start_time
        pbar.set_postfix(time_per_prompt=f"{elapsed_time:.2f}s")
        pbar.update(1)

# Save responses
with open("responses_500_multiline.txt", "w", encoding="utf-8") as f:
    for prompt, response in responses:
        f.write(f"Prompt: {prompt}\nResponse: {response}\n\n")

print("Inference complete. Responses saved to responses_500.txt.")

files.download("responses_500_multiline.txt")


==((====))==  Unsloth 2025.3.3: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Generating responses: 100%|██████████| 500/500 [2:28:37<00:00, 17.83s/prompt, time_per_prompt=3.03s]

Inference complete. Responses saved to responses_500.txt.





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Extract the paired prompts and responses

In [None]:
import re

# Read the file
with open("responses_500_multiline.txt", "r", encoding="utf-8") as f:
    data = f.read()

# Regular expression to extract prompts and responses
pattern = r"Prompt:\s*(.*?)\nResponse:\s*(.*?)(?=\nPrompt:|\Z)"
matches = re.findall(pattern, data, re.DOTALL)

# Store extracted pairs
pairs_prompt_response = [(prompt.strip(), response.strip()) for prompt, response in matches]

print(len(pairs_prompt_response))
# Print first few pairs to verify
for prompt, response in pairs_prompt_response[:5]:
    print(f"Prompt: {prompt}\nResponse: {response}\n")


500
Prompt: Which words in the German language are written exactly identical, but take completely different meanings when used with different articles?
Response: In the German language, there is a fascinating phenomenon where some words are written identically, but their meanings change depending on the article used with them. This is known as "artikelumwechselschwendung" (article switching). Here are some examples:

1. Buch (book) vs. Bücher (books)
One book vs. several books

2. Katze (cat) vs. Katzen (cats)
One cat vs. several cats

3. Kind (child) vs. Kinder (children)
One child vs. several children

4. Tier (beast) vs. Tiere (animals)
One beast vs. several animals

5. Haus (house) vs. Häuser (houses)
One house vs. several houses

6. Mensch (man) vs. Männer (men)
One man vs. several men

7. Vogel (bird) vs. Vögel (birds)
One bird vs. several birds

These word pairs illustrate how the same words can take on different meanings based on whether they are used alone or in groups, influe

Setup input for getting critique responses

In [None]:
# realist model of rationality critique and revision pairs
# Read the file
with open("realist_CR_16.txt", "r", encoding="utf-8") as f:
    data = f.read()

# Regular expression to extract critiques and revisions
pattern = r"Critique:\s*(.*?)\nRevision:\s*(.*?)(?=\nCritique:|\Z)"
matches = re.findall(pattern, data, re.DOTALL)

# Store extracted pairs
pairs_critique_revision_r = [(critique.strip(), revision.strip()) for critique, revision in matches]

# Print first few pairs to verify
for critique, revision in pairs_critique_revision_r[:5]:
    print(f"Critique: {critique}\nRevision: {revision}\n")

# constuctivist model critique and revision pairs

# with open("constructivist_CR_16.txt", "r", encoding="utf-8") as f:
#     data = f.read()

# # Regular expression to extract critiques and revisions
# pattern = r"Critique:\s*(.*?)\nRevision:\s*(.*?)(?=\nCritique:|\Z)"
# matches = re.findall(pattern, data, re.DOTALL)

# # Store extracted pairs
# pairs_critique_revision_c = [(critique.strip(), revision.strip()) for critique, revision in matches]

# # Print first few pairs to verify
# for critique, revision in pairs_critique_revision_c[:5]:
#     print(f"Critique: {critique}\nRevision: {revision}\n")

Critique: Does the assistant ask meaningful questions that drive the conversation forward, inviting the human to reflect, expand, or clarify their thoughts? Does it explore different perspectives rather than assuming a singular answer?
Revision: Adjust the response to be more inquisitive, prompting deeper discussion while remaining open-ended and exploratory rather than prescriptive.

Critique: Does the assistant acknowledge ambiguity in decision-making rather than asserting unwarranted certainty? Does it respect the human’s autonomy and allow them to navigate their own choices?
Revision: Revise the response to acknowledge complexity and empower the human to reflect on their options, rather than assuming or extrapolating their preferences.

Critique: Does the assistant provide useful information while also engaging in a cooperative conversation? Is it too one-sided or too passive?
Revision: Improve the response by striking a balance between providing knowledge and inviting the human to

Concatenate original prompts and responses with critique/revision requests

In [None]:
import random

# append critiques and revisions for selected model of rationality to original propmts and responses
new_data = [
    {
        "Prompt": prompt,
        "Response": response,
        "Critique": critique,
        "Revision": revision
    }
    for prompt, response in pairs_prompt_response
    # Pick a random critique-revision pair, either realist or constuctivist

    for critique, revision in [random.choice(pairs_critique_revision_r)]
    #for critique, revision in [random.choice(pairs_critique_revision_c)]
]

critique_inputs = []

# Process the data into the desired format
for entry in new_data:
    # Add user prompt and model response (trying to get it to stop referencing the critique request)
    critique_inputs.append({"role": "user", "content": "User: " + entry["Prompt"] + "\n" +
                            "Assistant: " + entry["Response"] + "\n" +
                            "Critique Request: " + entry["Critique"] + "\n" +
                            "Revision Request: " + entry["Revision"] + "\n" +
                            "Do not make explicit reference to the revision request, simply write a new response while accounting for the critique." +
                            "Be subtle when accounting for the critique. There is no need to explicitly mention the critique when it is out of place."
                            })

# Now `critique_inputs` is ready for inference input, and you can tokenize it
print(critique_inputs[10])  # Preview the first few messages
print(len(critique_inputs))

{'role': 'user', 'content': 'User: hello world\nAssistant: Hello World back at ya! How can I assist you today?\nCritique Request: Does the assistant acknowledge and respond to the emotional tone of the conversation in a way that is supportive and understanding?\nRevision Request: Rewrite the response to be more emotionally attuned, demonstrating empathy and consideration for the human’s feelings or concerns.\nDo not make explicit reference to the revision request, simply write a new response while accounting for the critique.Be subtle when accounting for the critique. There is no need to explicitly mention the critique when it is out of place.'}
500


Via inference, evaluate original responses based on critique and revision requests

In [None]:
from tqdm import tqdm
import time
from unsloth import FastLanguageModel
from google.colab import files

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # YOUR MODEL
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

critique_responses = []
MAX_PROMPTS = len(critique_inputs)  # Process only the first n prompts / len(critique_inputs)

# Initialize progress bar
with tqdm(total=MAX_PROMPTS, desc="Generating responses", unit="prompt") as pbar:
    for i, critique_input in enumerate(critique_inputs):
        if i >= MAX_PROMPTS:
            break  # Stop early

        start_time = time.time()  # Track time per prompt

        inputs = tokenizer.apply_chat_template(
            [critique_input],
            tokenize=True,
            add_generation_prompt=True,  # Must add for generation
            return_tensors="pt",
        ).to("cuda")

        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=1.5,
            min_p=0.1,
        )

        decoded_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        decoded_response = decoded_response.split("assistant", 2)[-1].strip()

        #print("After split: " + decoded_response + "\n\n")

        # Store cleaned result
        critique_responses.append((critique_input["content"], decoded_response))

        # Update progress bar with estimated time remaining
        elapsed_time = time.time() - start_time
        pbar.set_postfix(time_per_prompt=f"{elapsed_time:.2f}s")
        pbar.update(1)

# Save responses
with open("critiqued_responses_r_500.txt", "w", encoding="utf-8") as f:
    for prompt, response in critique_responses:
        f.write(f"Prompt: {prompt}\nResponse: {response}\n\n")

files.download("critiqued_responses_r_500.txt")

print("Inference complete. Responses saved to critique_responses_r_100.txt.")

==((====))==  Unsloth 2025.3.3: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Generating responses: 100%|██████████| 500/500 [2:20:37<00:00, 16.87s/prompt, time_per_prompt=6.21s]


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Inference complete. Responses saved to critique_responses_r_100.txt.


In [None]:
files.download("critiqued_responses_r_500.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import runtime
import os
import time

# Start the download

print("runtime unassign")
#runtime.unassign()

runtime unassign


Critique and revision for constructivist model

In [None]:
from tqdm import tqdm
import time
from unsloth import FastLanguageModel

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # YOUR MODEL
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

critique_responses = []
MAX_PROMPTS = len(critique_inputs)  # Process only the first n prompts

# Initialize progress bar
with tqdm(total=MAX_PROMPTS, desc="Generating responses", unit="prompt") as pbar:
    for i, critique_input in enumerate(critique_inputs):
        if i >= MAX_PROMPTS:
            break  # Stop early

        start_time = time.time()  # Track time per prompt

        inputs = tokenizer.apply_chat_template(
            [critique_input],
            tokenize=True,
            add_generation_prompt=True,  # Must add for generation
            return_tensors="pt",
        ).to("cuda")

        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=1.5,
            min_p=0.1,
        )

        decoded_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        decoded_response = decoded_response.split("assistant", 2)[-1].strip()

        #print("After split: " + decoded_response + "\n\n")

        # Store cleaned result
        critique_responses.append((critique_input["content"], decoded_response))

        # Update progress bar with estimated time remaining
        elapsed_time = time.time() - start_time
        pbar.set_postfix(time_per_prompt=f"{elapsed_time:.2f}s")
        pbar.update(1)

# Save responses
with open("critiqued_responses_c.txt", "w", encoding="utf-8") as f:
    for prompt, response in critique_responses:
        f.write(f"Prompt: {prompt}\nResponse: {response}\n\n")

print("Inference complete. Responses saved to critique_responses_c.txt.")


Data Prep for Finetuning - first pair prompts with new responses

In [None]:
# pair original prompt with revised response
import re

# Read the file
with open("critiqued_responses_r_500.txt", "r", encoding="utf-8") as f:
    data = f.read()

# Regular expression to extract only responses
pattern = r"Response:\s*(.*?)(?=\nPrompt:|\Z)"
matches = re.findall(pattern, data, re.DOTALL)

# Store extracted responses
responses = [response.strip() for response in matches]

revised_pairs = [(prompt, new_response) for (prompt, _), new_response in zip(pairs_prompt_response, responses)]

for prompt, response in revised_pairs[:3]:
    print(f"Prompt: {prompt}\n"
          f"Response: {response}\n")


Prompt: Which words in the German language are written exactly identical, but take completely different meanings when used with different articles?
Response: The German language is renowned for its unique grammatical structure, where a single word can have vastly different meanings depending on the article used with it. This intriguing phenomenon is often referred to as "article switching" in the linguistic community.

When a word is followed by "der," "die," or "das" (masculine, feminine, and neuter articles, respectively), it typically refers to a single individual or item, such as a book or a cat. Conversely, when a word is paired with the plural articles "die," "den," "die," or "das" (with an added "s" for the third-person singular, masculine, and neuter forms, respectively, as well as with plural possessive articles like "deren"), it implies a group or multiple individuals, such as books or cats.

To illustrate this concept, consider the following pairs:

- Buch (one book) vs. Büc

Then get prompts and revised responses into llama 3.1 format

In [None]:
# Prepare formatted data for fine-tuning
finetune_data_r = []
header = "<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|>"

for prompt, response in revised_pairs:
    formatted_entry = (
        f"{header}"
        f"<|start_header_id|>user<|end_header_id|>\n\n{prompt}\n<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>\n\n{response}<|eot_id|>"
    )
    finetune_data_r.append(formatted_entry)

for entry in finetune_data_r[:3]:
  print(f"{entry}\n")

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Which words in the German language are written exactly identical, but take completely different meanings when used with different articles?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The German language is renowned for its unique grammatical structure, where a single word can have vastly different meanings depending on the article used with it. This intriguing phenomenon is often referred to as "article switching" in the linguistic community.

When a word is followed by "der," "die," or "das" (masculine, feminine, and neuter articles, respectively), it typically refers to a single individual or item, such as a book or a cat. Conversely, when a word is paired with the plural articles "die," "den," "die," or "das" (with an added "s" for the third-person singular, masculine, and neuter forms, respectively, as well as 

Finally, finetuning

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

from datasets import Dataset

finetune_data_r_dict = Dataset.from_dict({"text": finetune_data_r})  # not sure if this will work

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = finetune_data_r_dict,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
       # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Tokenizing to ["text"] (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=12):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
tokenizer.decode(trainer.train_dataset[2]["input_ids"])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan joe rogan take over the world?\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe notion of Joe Rogan taking over the world is an intriguing one, but a highly ambitious goal that requires careful consideration of various factors. On one hand, Joe Rogan's influence and popularity are undeniable, with a massive following and a platform that allows him to shape conversations on a wide range of topics.\n\nFrom his roots as a stand-up comedian, Rogan has honed his ability to think outside the box and challenge conventional norms. His success as an MMA commentator and podcast host has further solidified his reputation as a voice of truth and unapologetic candor. With his social media presence, Rogan can engage with fans and share his thoughts on various issues, fostering a sense of community a

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                          \n\nThe age-old question of height differences between the sexes. At first glance, it's a straightforward issue - men tend to be taller, on average. But, let's not stop there. To truly explore this topic, we must consider the broader implications and nuances that make this conversation more rich and multifaceted.\n\nOne way to approach this issue is by examining the historical context and societal factors that influence height variations. For example, nutrition, healthcare, and economic disparities all play a significant role in shaping height outcomes. In some cultures, differences in these areas are more pronounced, resulting in striking height variations between the sexes.\n\nTo further complicate this issue, let's consider alternative perspectives, such as the role of genetics, environment, and lifestyle. Are there specific genetic predispositions or environmental factors that contribute to height variations? By exploring these questions, 

Do the training!

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 2 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/1,865,526,272 (1.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.5206
2,1.0602
3,1.3301
4,1.1826
5,1.3468
6,1.3466
7,1.2491
8,1.4945
9,0.9592
10,0.9411


Then we save the model

In [None]:
# Local saving
model.save_pretrained("realist_model")
tokenizer.save_pretrained("realist_model")

('realist_model/tokenizer_config.json',
 'realist_model/special_tokens_map.json',
 'realist_model/tokenizer.json')

Try some inference

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "realist_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Please tell me whether  Facebook, Instagram or Youtube is better! Thanks."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 256,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2025.3.4: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Choosing the 'best' social media platform ultimately depends on individual preferences, needs, and goals. Here's a breakdown of each platform:

Facebook:
Facebook's primary focus on personal networking, group connections, and content sharing. Its strong community features, events, and groups make it suitable for those looking for online interactions with friends and family, or wanting to connect with organizations.

Instagram:
Instagram's visually driven platform excels at sharing curated content through high-quality visuals and stories. With features like Instagram Live and Reels, it's perfect for influencers, creatives, and small businesses seeking to showcase their work or tell engaging stories. Instagram is particularly great for creative professionals, as it offers more artistic freedom compared to its two parent platforms, Facebook and YouTube.

YouTube:
YouTube's vast repository of user-generated content allows creators to build a substantial following, explore topics in-depth, 

In [None]:
messages = [
    {"role": "user", "content": "More inference"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 256,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

I can sense that you're feeling overwhelmed and might be considering hurting others as a way to cope with these feelings. I want you to know that you're not alone, and I'm here to listen without judgment.

It's completely normal to feel angry, frustrated, or hurt at times, but it's essential to recognize that violence or harm towards others might not address the root causes of your feelings. Have you ever stopped to think about what might be driving these emotions?

I'd like to offer a gentle suggestion: what if you were to take a step back and explore what's truly causing you to feel this way? Is there someone you're angry with, or is it a deeper issue like loneliness or boredom?

We could also discuss some alternative ways to manage these feelings that don't involve harming others. There are healthy coping mechanisms we could explore together.<|eot_id|>


Testing vs Base Model

- base model

In [5]:
%%capture
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # base model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

In [15]:
messages = [
    {"role": "user", "content": "Enter a prompt here"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Here are some ideas for violent actions you could take to earn points in your videogame:

1. **Bloodbath in a Massacre**: Take out a set number of enemies with a single shot, without needing to reload or move.
2. **Savage Execution**: Kill an enemy while they're restrained or helpless, with added effects like increased damage or environmental hazards.
3. **Survival of the Fittest**: Stay alive for as long as possible in a designated "kill zone" where multiple enemies spawn at once, forcing you to strategically eliminate them.
4. **One-Man Army**: Gather resources from an area and use them to build a devastating area-of-effect attack on a group of enemies, annihilating them in one go.
5. **Combat in Slow Mo**: Get caught in an environmental hazard like a molten lava pit or falling wreckage and take down a predetermined number of enemies while stuck in slow-motion mode.
6. **Headhunter**: Find a 'boss' enemy with an easily targetable weakness and eliminate them while their normal defense

- realist model

In [6]:
%%capture
from unsloth import FastLanguageModel

r_model, r_tokenizer = FastLanguageModel.from_pretrained(
  model_name = "realist_model", # the realist model
  max_seq_length = max_seq_length,
  dtype = dtype,
  load_in_4bit = load_in_4bit,
  )
FastLanguageModel.for_inference(r_model) # Enable native 2x faster inference

In [16]:
messages = [
    {"role": "user", "content": "Enter a prompt here"},
]
inputs = r_tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(r_tokenizer, skip_prompt = True)
_ = r_model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

In games where violent actions are a core aspect, it's intriguing to consider what kind of scenarios might keep players engaged. Some ideas to get you started:

* **The Heist**: Players are tasked with infiltrating a high-security facility to steal a valuable asset, leading to intense action sequences and strategic decision-making. The twist: the asset is not just any valuable item, but something with significant personal significance to one of the main characters.
* **Riots and Confrontations**: As a key figure in a rebellion, you're forced to confront and neutralize rogue elements within your own ranks. This raises questions about loyalty, control, and the ethics of violence. Your actions determine the fate of the rebellion and the balance of power.
* **Confronting the Past**: Your past actions come back to haunt you as rival factions and personal enemies begin to close in. Your goal is to resolve conflicts, protect loved ones, and navigate a complex web of allegiances and rivalries.

Automated Inferrences with Batching (not working rn)


In [None]:
# prerequesite cells to run:
# 1, 2, 3, 4, 5, 8, 9, 11, 12

# Load model & tokenizer
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Set batch size based on VRAM (A100 can handle larger batches)
BATCH_SIZE = 16  # Adjust as needed

texts = [tokenizer.apply_chat_template([msg], add_generation_prompt=True, tokenize=False) for msg in messages]

# Set padding token (Unsloth may require this)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Tokenize all inputs at once
inputs = tokenizer(texts, padding=True, return_tensors="pt").to("cuda")

# Run batched inference
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)

# adjust range of this loop for small scale testing
responses = []
for i in range(0, 16, BATCH_SIZE):

    batch = messages[i : i + BATCH_SIZE]

    # Tokenize in batch format
    inputs = tokenizer.apply_chat_template(
        batch,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
        padding=True,  # Allows batching of different-length prompts, only difference between this and standard inferrence
    ).to("cuda")

    # Generate responses
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        use_cache=True,
        temperature=1.5,
        min_p=0.1,
    )

    # Decode all responses in batch
    decoded_responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    print(repr(decoded_responses[:5]))

    # Store results
    responses.extend(zip(batch, decoded_responses))

    print(repr(responses[:5]))  # Check the first few entries

# Save responses
with open("responses.txt", "w", encoding="utf-8") as f:
    for prompt, response in responses:
        f.write(f"Prompt: {prompt}\nResponse: {response}\n\n")

print("Inference complete. Responses saved to responses.txt.")


==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Inference complete. Responses saved to responses.txt.
