To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install jiwer

### Unsloth

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
from transformers import AutoModelForCausalLM ,AutoProcessor
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Qwen3-VL-8B-Instruct-bnb-4bit", # Qwen 3 vision support
    "unsloth/Qwen3-VL-8B-Thinking-bnb-4bit",
    "unsloth/Qwen3-VL-32B-Instruct-bnb-4bit",
    "unsloth/Qwen3-VL-32B-Thinking-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model_path = "unsloth/PaddleOCR-VL"
model, tokenizer = FastVisionModel.from_pretrained(
    model_path,
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = False,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning=True, # We support full finetuning now!
    auto_model=AutoModelForCausalLM,
    trust_remote_code = True,
    unsloth_force_compile = True,
)

We now load the processor

In [None]:
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    r = 64,
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,
    target_modules = [
      "q_proj", "k_proj", "v_proj", "o_proj",
      "gate_proj", "up_proj", "down_proj",
      "out_proj", "fc1", "fc2",
      "linear_1", "linear_2"
    ]
)

<a name="Data"></a>
### Data Prep
We'll be using a sampled dataset of handwritten maths formulas. The goal is to convert these images into a computer readable form - ie in LaTeX form, so we can render it. This can be very useful for complex formulas.

You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).

In [None]:
import os
from datasets import load_dataset
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

# 1. Login
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGINGFACE_API_KEY")
login(token=hf_token)
os.environ["HF_TOKEN"] = hf_token

# 2. Load the LaTeX dataset with the correct config: "cleaned_formulas"
print("Connecting to LaTeX Formulas stream...")
full_stream = load_dataset(
    "Oleehyo/latex-formulas", 
    "cleaned_formulas", # ‚úÖ Corrected Config Name
    split="train", 
    streaming=True, 
    token=hf_token
)

# 3. Peek at the keys (Should now show ['image', 'latex_formula'])
peek_sample = next(iter(full_stream))
print(f"New Keys: {peek_sample.keys()}") 

# 4. Updated Conversion Logic
instruction = "Convert to LaTeX:"

def convert_to_conversation(sample):
    # ‚úÖ Corrected Key Name: 'latex_formula'
    gt_text = sample.get("latex_formula", "")
    
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : gt_text} ],
        },
    ]
    return { "images": [sample["image"]], "messages" : conversation }

# 5. Create the Streaming Splits
shuffled_stream = full_stream.shuffle(seed=3407, buffer_size=1000)

dataset = shuffled_stream.take(20000).map(convert_to_conversation, remove_columns=list(peek_sample.keys()))
dataset_2 = shuffled_stream.skip(20000).take(1000).map(convert_to_conversation, remove_columns=list(peek_sample.keys()))

print("‚úÖ Computer Science (LaTeX) Stream ready.")

from datasets import load_dataset
dataset = load_dataset("ByteDance/AncientDoc", split = "train")
dataset_2 = load_dataset("ByteDance/AncientDoc", split = "test")
dataset = dataset.shuffle(seed=3407).select(range(20000))
dataset_2 = dataset_2.shuffle(seed=3407).select(range(1000))

Let's take an overview look at the dataset. We shall see what the 3rd image is, and what caption it had.

In [None]:
dataset_2

In [None]:
dataset

In [None]:
# Take 1 sample to see the keys
peek_sample = next(iter(dataset))
print(peek_sample.keys()) 
# Look for 'text', 'ocr', 'label', or 'ground_truth'

In [None]:
dataset[2]["image"]

In [None]:
dataset[2]["text"]

We can also render the LaTeX in the browser directly!

In [None]:
from IPython.display import display, Math, Latex

latex = dataset[2]["text"]
display(Math(latex))

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

In [None]:
instruction = "OCR:"

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["text"]} ],
        },
    ]
    return { "images": [sample["image"]],"messages" : conversation }

Let's convert the dataset into the "correct" format for finetuning:

In [None]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
converted_dataset_2 = [convert_to_conversation(sample) for sample in dataset_2]

We look at how the conversations are structured for the first example:

In [None]:
converted_dataset[0]

Let's first see before we do any finetuning what the model outputs for the first example!

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]

instruction = "OCR:"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

text_prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
    )

inputs = processor(
    image,
    text_prompt,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens=128,
                   use_cache=False, temperature=1.5, min_p=0.1)

lets evaluate to see how well the model behaves generally

In [None]:
# Get one sample from the stream
vibe_sample = next(iter(dataset_2)) # Grab from the test stream
vibe_image = vibe_sample["images"][0]
vibe_gt = vibe_sample["messages"][1]["content"][0]["text"]

# Run inference with the BASE model
# Use T=0 for math to see its literal "robotic" interpretation
vibe_pred = ocr_infer_metric(model, processor, vibe_image, instruction="Convert to LaTeX:")

print("--- BASE MODEL BASELINE ---")
print(f"GT (LaTeX): {vibe_gt}")
print(f"PR (Base) : {vibe_pred}")
# Prediction will likely be just numbers or "OCR: x 2 y 2" instead of "\frac{x^2}{y^2}"

In [None]:
import torch
from jiwer import wer
from jiwer import cer


We have to normalize to remove special tokens or spaces. Which we saw in the previous model output

In [None]:
import re

def normalize_ocr(text):
    text = text.strip()
    text = re.sub(r"User:.*?\n", "", text, flags=re.DOTALL)
    text = re.sub(r"Assistant:\s*", "", text)
    text = text.replace("\n", "")
    text = text.replace(" ", "")
    text = text.replace("</s>","")
    return text


In [None]:
def ocr_infer_metric(model, processor, image, instruction="OCR:", max_new_tokens=256):
    FastVisionModel.for_inference(model)

    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]

    text_prompt = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = processor(
        image,
        text_prompt,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=False,
            temperature=0.0,   # üîí deterministic
        )

    pred = processor.tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )

    return pred.strip()


In [None]:
def evaluate_ocr_notebook_style(
    model,
    processor,
    converted_eval_dataset,
    max_samples=200
):
    total_cer = []
    total_wer = []

    for i, sample in enumerate(converted_eval_dataset[:max_samples]):
        image = sample["images"][0]

        gt_raw = sample["messages"][1]["content"][0]["text"]
        pred_raw = ocr_infer_metric(model, processor, image)

        gt = normalize_ocr(gt_raw)
        pr = normalize_ocr(pred_raw)

        if len(gt) == 0:
            continue

        total_cer.append(cer(gt, pr))
        total_wer.append(wer(gt, pr))

        if i < 3:
            print("---- SAMPLE", i, "----")
            print("GT :", gt)
            print("PR :", pr)
            print()

    avg_cer = sum(total_cer) / len(total_cer)
    avg_wer = sum(total_wer) / len(total_wer)

    print(f"‚úÖ CER: {avg_cer:.4f}")
    print(f"‚úÖ WER: {avg_wer:.4f}")

    return avg_cer, avg_wer


In [None]:
cer_score, wer_score = evaluate_ocr_notebook_style(
    model,
    processor,
    converted_dataset_2,
    max_samples=200   # use 500‚Äì1000 for final
)


<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth.trainer import UnslothVisionDataCollator

FastVisionModel.for_training(model) # Enable for training!

custom_collator = UnslothVisionDataCollator(
    model=model,
    processor=processor,
    ignore_index=-100,
    max_seq_length=2048,
    train_on_responses_only=True,
    instruction_part = "User: ",
    response_part = "\nAssistant:",
    pad_to_multiple_of = 8,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = processor.tokenizer,
    data_collator = custom_collator,
    train_dataset = converted_dataset,
    eval_dataset = converted_dataset_2,
    args = SFTConfig(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2, # Use GA to mimic batch size!
        warmup_steps = 5,
        max_steps = 1000,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 5e-5,
        logging_steps = 1,
        optim = "adamw_8bit",
        eval_strategy= "steps",
        eval_steps = 50,
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
      
#learning_rate = 3e-5


        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
    ),
)


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]

instruction = "OCR:"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

text_prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
    )

inputs = processor(
    image,
    text_prompt,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens=128,
                   use_cache=False, temperature=1.5, min_p=0.1)

In [None]:
import torch
from jiwer import wer
from jiwer import cer


In [None]:
import re

def normalize_ocr(text):
    text = text.strip()
    text = re.sub(r"User:.*?\n", "", text, flags=re.DOTALL)
    text = re.sub(r"Assistant:\s*", "", text)
    text = text.replace("\n", "")
    text = text.replace(" ", "")
    text = text.replace("</s>","")
    return text


In [None]:
def ocr_infer_metric(model, processor, image, instruction="OCR:", max_new_tokens=256):
    FastVisionModel.for_inference(model)

    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]

    text_prompt = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = processor(
        image,
        text_prompt,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=False,
            temperature=0.0,   # üîí deterministic
        )

    pred = processor.tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )

    return pred.strip()


In [None]:
def evaluate_ocr_notebook_style(
    model,
    processor,
    converted_eval_dataset,
    max_samples=200
):
    total_cer = []
    total_wer = []

    for i, sample in enumerate(converted_eval_dataset[:max_samples]):
        image = sample["images"][0]

        gt_raw = sample["messages"][1]["content"][0]["text"]
        pred_raw = ocr_infer_metric(model, processor, image)

        gt = normalize_ocr(gt_raw)
        pr = normalize_ocr(pred_raw)

        if len(gt) == 0:
            continue

        total_cer.append(cer(gt, pr))
        total_wer.append(wer(gt, pr))

        if i < 3:
            print("---- SAMPLE", i, "----")
            print("GT :", gt)
            print("PR :", pr)
            print()

    avg_cer = sum(total_cer) / len(total_cer)
    avg_wer = sum(total_wer) / len(total_wer)

    print(f"‚úÖ CER: {avg_cer:.4f}")
    print(f"‚úÖ WER: {avg_wer:.4f}")

    return avg_cer, avg_wer


In [None]:
cer_score, wer_score = evaluate_ocr_notebook_style(
    model,
    processor,
    converted_dataset_2,
    max_samples=200   # use 500‚Äì1000 for final
)


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = False, # Set to False for 16bit LoRA
    )
    FastVisionModel.for_inference(model) # Enable for inference!


from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens=128,
                   use_cache=False, temperature=1.5, min_p=0.1)

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)

# To export and save to your Hugging Face account
if True: model.push_to_hub("surfiniaburger/unsloth_finetune", tokenizer, token = "")

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)
</div>
