To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Qwen2_5_Vl patching. Transformers: 4.50.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.97G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/267 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

<a name="Data"></a>
### Data Prep
We'll be using the ARC (AI2 Reasoning Challenge) dataset for science multiple-choice questions. The model will learn to answer questions by looking at images containing the question and choices.

The images already contain the question text and answer options - we only use the images (not the JSON text) during training.

In [None]:
import json
from PIL import Image
import os

# Load the ARC training data
data_dir = "arc_train"
images_dir = os.path.join(data_dir, "arc_train_images")

dataset = []
with open(os.path.join(data_dir, "arc_train.jsonl"), "r") as f:
    for line in f:
        item = json.loads(line)
        # Load the image
        image_path = os.path.join(images_dir, item["image_path"])
        image = Image.open(image_path).convert("RGB")
        # Only keep image and answer_key (not the text)
        dataset.append({
            "image": image,
            "answer_key": item["answer_key"]
        })

print(f"Loaded {len(dataset)} samples")

Let's take an overview look at the dataset. We shall see what the 3rd image is, and what answer it should produce.

In [None]:
print(f"Total samples: {len(dataset)}")
print(f"Sample keys: {dataset[0].keys()}")

In [None]:
# Display a sample image
dataset[2]["image"]

In [None]:
# The expected answer (formatted as the model should output)
answer_key = dataset[2]["answer_key"]
print(f"Expected output: <answer>{answer_key}</answer>")

The images already contain the question and answer choices. The model will learn to output the correct answer in the format `<answer>X</answer>` where X is A, B, C, or D.

In [None]:
# Show a few more examples
for i in range(3):
    print(f"Sample {i}: answer_key = {dataset[i]['answer_key']}")

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
{ "role": "user",
  "content": [{"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

For ARC, the image already contains the question and instructions, so we only provide the image.

In [None]:
def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : f"<answer>{sample['answer_key']}</answer>"} ]
        },
    ]
    return { "messages" : conversation }

Let's convert the dataset into the "correct" format for finetuning:

In [None]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]

We look at how the conversations are structured for the first example:

In [None]:
converted_dataset[0]

{'messages': [{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Write the LaTeX representation for this image.'},
    {'type': 'image',
     'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=160x40>}]},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}

Let's first see before we do any finetuning what the model outputs for a sample question!

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]

messages = [
    {"role": "user", "content": [
        {"type": "image"},
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.

In [None]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 30,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
    ),
)

Unsloth: Model does not have a default image size - using 512


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.068 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 68,686 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 51,521,536/7,000,000,000 (0.74% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.7872
2,3.326
3,3.4317
4,2.3135
5,2.0534
6,2.0385
7,1.5227
8,1.0047
9,0.7168
10,0.7449


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

219.9988 seconds used for training.
3.67 minutes used for training.
Peak reserved memory = 6.484 GB.
Peak reserved memory for training = 0.416 GB.
Peak reserved memory % of max memory = 43.986 %.
Peak reserved memory for training % of max memory = 2.822 %.


<a name="Validation"></a>
### Validation Comparison
Now let's compare the fine-tuned model against the original model on the validation set.

In [None]:
# Load validation dataset
val_data_dir = "arc_validation"
val_images_dir = os.path.join(val_data_dir, "arc_validation_images")

val_dataset = []
with open(os.path.join(val_data_dir, "arc_validation.jsonl"), "r") as f:
    for line in f:
        item = json.loads(line)
        image_path = os.path.join(val_images_dir, item["image_path"])
        image = Image.open(image_path).convert("RGB")
        val_dataset.append({
            "image": image,
            "answer_key": item["answer_key"],
            "id": item["id"]
        })

print(f"Loaded {len(val_dataset)} validation samples")

In [None]:
import re
from tqdm import tqdm

def extract_answer(text):
    """Extract answer letter from model output."""
    # Try to find <answer>X</answer> pattern
    match = re.search(r'<answer>\s*([A-Da-d])\s*</answer>', text)
    if match:
        return match.group(1).upper()
    # Fallback: look for standalone A, B, C, D at start or after common patterns
    match = re.search(r'(?:^|answer is|answer:|choice)\s*([A-Da-d])\b', text, re.IGNORECASE)
    if match:
        return match.group(1).upper()
    return None

def evaluate_model(model, tokenizer, val_dataset, num_samples=50, use_adapters=True):
    """Evaluate model on validation set."""
    FastVisionModel.for_inference(model)
    
    # Enable or disable LoRA adapters
    if hasattr(model, 'disable_adapters'):
        if use_adapters:
            model.enable_adapters()
        else:
            model.disable_adapters()
    
    results = []
    correct = 0
    
    for i, sample in enumerate(tqdm(val_dataset[:num_samples], desc="Evaluating")):
        image = sample["image"]
        
        messages = [{"role": "user", "content": [{"type": "image"}]}]
        input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
        inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=64, use_cache=True, 
                                   temperature=0.1, do_sample=False)
        
        response = tokenizer.decode(output[0], skip_special_tokens=True)
        # Get only the generated part (after the prompt)
        response = response.split("assistant")[-1].strip() if "assistant" in response.lower() else response
        
        predicted = extract_answer(response)
        expected = sample["answer_key"]
        is_correct = predicted == expected
        
        if is_correct:
            correct += 1
        
        results.append({
            "id": sample["id"],
            "expected": expected,
            "predicted": predicted,
            "correct": is_correct,
            "response": response[:200]  # Truncate for readability
        })
    
    accuracy = correct / len(results) * 100
    return results, accuracy

In [None]:
# Evaluate original model (without LoRA adapters)
print("Evaluating ORIGINAL model (adapters disabled)...")
original_results, original_accuracy = evaluate_model(
    model, tokenizer, val_dataset, num_samples=50, use_adapters=False
)
print(f"Original Model Accuracy: {original_accuracy:.2f}%")

In [None]:
# Evaluate fine-tuned model (with LoRA adapters)
print("Evaluating FINE-TUNED model (adapters enabled)...")
finetuned_results, finetuned_accuracy = evaluate_model(
    model, tokenizer, val_dataset, num_samples=50, use_adapters=True
)
print(f"Fine-tuned Model Accuracy: {finetuned_accuracy:.2f}%")

In [None]:
# Save comparison results to file
comparison_results = {
    "summary": {
        "original_accuracy": original_accuracy,
        "finetuned_accuracy": finetuned_accuracy,
        "improvement": finetuned_accuracy - original_accuracy,
        "num_samples": len(original_results)
    },
    "detailed_comparison": []
}

for orig, ft in zip(original_results, finetuned_results):
    comparison_results["detailed_comparison"].append({
        "id": orig["id"],
        "expected": orig["expected"],
        "original_predicted": orig["predicted"],
        "original_correct": orig["correct"],
        "original_response": orig["response"],
        "finetuned_predicted": ft["predicted"],
        "finetuned_correct": ft["correct"],
        "finetuned_response": ft["response"]
    })

# Save to JSON file
with open("validation_comparison.json", "w") as f:
    json.dump(comparison_results, f, indent=2)

print("Results saved to validation_comparison.json")
print(f"\n{'='*50}")
print(f"SUMMARY")
print(f"{'='*50}")
print(f"Original Model Accuracy:   {original_accuracy:.2f}%")
print(f"Fine-tuned Model Accuracy: {finetuned_accuracy:.2f}%")
print(f"Improvement:               {finetuned_accuracy - original_accuracy:+.2f}%")
print(f"{'='*50}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Create comparison visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Bar chart comparing accuracies
ax1 = axes[0]
models = ['Original', 'Fine-tuned']
accuracies = [original_accuracy, finetuned_accuracy]
colors = ['#ff6b6b', '#4ecdc4']
bars = ax1.bar(models, accuracies, color=colors, edgecolor='black', linewidth=1.2)
ax1.set_ylabel('Accuracy (%)', fontsize=12)
ax1.set_title('Model Accuracy Comparison', fontsize=14, fontweight='bold')
ax1.set_ylim(0, 100)
for bar, acc in zip(bars, accuracies):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2, 
             f'{acc:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

# 2. Pie chart showing correct/incorrect for each model
ax2 = axes[1]
orig_correct = sum(1 for r in original_results if r["correct"])
orig_incorrect = len(original_results) - orig_correct
ft_correct = sum(1 for r in finetuned_results if r["correct"])
ft_incorrect = len(finetuned_results) - ft_correct

x = np.arange(2)
width = 0.35
bars1 = ax2.bar(x - width/2, [orig_correct, ft_correct], width, label='Correct', color='#4ecdc4')
bars2 = ax2.bar(x + width/2, [orig_incorrect, ft_incorrect], width, label='Incorrect', color='#ff6b6b')
ax2.set_ylabel('Count', fontsize=12)
ax2.set_title('Correct vs Incorrect Predictions', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(['Original', 'Fine-tuned'])
ax2.legend()

# 3. Per-sample comparison (showing where models differ)
ax3 = axes[2]
both_correct = sum(1 for o, f in zip(original_results, finetuned_results) if o["correct"] and f["correct"])
both_wrong = sum(1 for o, f in zip(original_results, finetuned_results) if not o["correct"] and not f["correct"])
only_orig = sum(1 for o, f in zip(original_results, finetuned_results) if o["correct"] and not f["correct"])
only_ft = sum(1 for o, f in zip(original_results, finetuned_results) if not o["correct"] and f["correct"])

categories = ['Both\nCorrect', 'Both\nWrong', 'Only Original\nCorrect', 'Only Fine-tuned\nCorrect']
values = [both_correct, both_wrong, only_orig, only_ft]
colors = ['#4ecdc4', '#ff6b6b', '#ffe66d', '#95e1d3']
bars = ax3.bar(categories, values, color=colors, edgecolor='black', linewidth=1.2)
ax3.set_ylabel('Count', fontsize=12)
ax3.set_title('Prediction Agreement Analysis', fontsize=14, fontweight='bold')
for bar, val in zip(bars, values):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
             str(val), ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('validation_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nGraph saved to validation_comparison.png")

In [None]:
# Show some example comparisons
print("Sample Predictions Comparison:")
print("="*80)
for i, (orig, ft) in enumerate(zip(original_results[:10], finetuned_results[:10])):
    status_orig = "O" if orig["correct"] else "X"
    status_ft = "O" if ft["correct"] else "X"
    print(f"\n[{i+1}] ID: {orig['id']}")
    print(f"    Expected: {orig['expected']}")
    print(f"    Original:   {orig['predicted']} [{status_orig}]")
    print(f"    Fine-tuned: {ft['predicted']} [{status_ft}]")

<a name="Inference"></a>
### Inference
Let's run the model on a sample image!

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]

messages = [
    {"role": "user", "content": [
        {"type": "image"},
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

[]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = True, # Set to False for 16bit LoRA
    )
    FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[0]["image"]

messages = [
    {"role": "user", "content": [
        {"type": "image"},
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)

# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", tokenizer, token = "PUT_HERE")

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)
</div>
