To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')

In [60]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
!pip install spaces

### Unsloth

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Llama-3.2-11B-Vision-Instruct",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

Unsloth: Making `model.base_model.model.model.vision_model.transformer` require gradients


<a name="Data"></a>
### Data Prep

In [None]:
import spaces
from datasets import load_dataset, Image

# Login using e.g. `huggingface-cli login` to access this dataset
indian_monuments_ds = load_dataset("AIMLOps-C4-G16/indian_monuments")

In [None]:
#indian_festivals_ds = load_dataset("AIMLOps-C4-G16/IndianFestivals")

Let's take a look at the dataset, and check what the 1st example shows:

In [None]:
indian_monuments_ds

DatasetDict({
    train: Dataset({
        features: ['image'],
        num_rows: 148
    })
})

In [None]:
len(indian_monuments_ds['train'])

148

In [None]:
indian_monuments_ds['train'][0]["image"]

In [None]:
dataset = load_dataset("AIMLOps-C4-G16/indian_monuments", split="train").cast_column("image", Image(decode=False))
dataset[0]["image"]

In [None]:
list_of_image_names = []
for i in range(len(dataset['train'])):
  list_of_image_names.append(((dataset['train'][i]["image"])['path']).split('/')[-1])

Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

image = indian_monuments_ds["train"][0]["image"]
instruction = "Identify the monument with a short caption in less than 10 words"

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The monument is an ornate, Indian temple with intricate details.<|eot_id|>


<a name="Inference"></a>
### Inference
Let's run the model!
We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

Generate 4 captions per image and write to a file

In [None]:
'''
Captions file will be in this format:

img_name \t caption 0 \t caption 1 \t caption 2 \t caption 3 \n
'''

In [None]:
num_captions_per_image = 4

def generate_captions():
  with open('llama3.2_11b_vi_monuments_captions.txt', 'w') as f:
    for i in range(len(indian_monuments_ds['train'])):
      output = []
      for j in range(num_captions_per_image):
        image = indian_monuments_ds['train'][i]["image"]
        inputs = tokenizer(
                     image,
                     input_text,
                     add_special_tokens = False,
                     return_tensors = "pt",).to("cuda")
        text_streamer = TextStreamer(tokenizer, skip_prompt = True)
        _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                                   use_cache = True, temperature = 1.5, min_p = 0.1)
        output[j] = " ".join(_)
      f.write(list_of_image_names[i] + "\t" + output[0] + "\t" + output[1] + "\t" + output[2] + "\t" + output[3] + "\n")

In [None]:
generate_captions()

In [None]:
# Download llama3.2_11b_vi_monuments_captions.txt
from google.colab import files
files.download('llama3.2_11b_vi_monuments_captions.txt')

In [None]:
'''
Feedback file will be in this format

img_name \t caption 0 \t caption 1 \t caption 2 \t caption 3 \t best_caption_number(-1,0,1,2,3) \t alternate_caption \n
'''

In [41]:
count = 0
def get_next_image_and_captions():
  with open('llama3.2_11b_vi_monuments_captions.txt') as f:
    for i, line in enumerate(f):
      if i == count:
        #img_name is .jpg file name
        img_name, c0,c1,c2,c3 = line.split('\t')
        #image is actual path to image file /root/.cache/huggingface/datasets/downloads/extracted/
        image = indian_monuments_ds['train'][count]["image"]
        count += 1
        return image, c0, c1, c2, c3

In [48]:
thanks_message = "Done"
def run_rlhf(c0, c1, c2, c3, best_caption_number, alternate_caption):
  if c0 is not None:
    with open('rlhf_llama3.2_11b_monuments.txt', 'w') as f:
      f.write(list_of_image_names[count-1] + "\t" + c0 + "\t" + c1 + "\t" + c2 + "\t" + c3 + "\t" + best_caption_number + "\t" + alternate_caption + "\n")
  return thanks_message, get_next_image_and_captions()

In [None]:
import gradio as gr

css = """
  #output {
    height: 500px;
    overflow: auto;
    border: 1px solid #ccc;
  }
"""
rlhf_btn = gr.Button("Ok, Next Image")
input_img = gr.Image(label="Input Picture")
output_img = gr.Image(label="Input Picture")
c0 = gr.Textbox(label="Caption 0")
c1 = gr.Textbox(label="Caption 1")
c2 = gr.Textbox(label="Caption 2")
c3 = gr.Textbox(label="Caption 3")
best_caption_number = gr.Textbox(label="Choose best caption number -1(None),0,1,2,3")
alternate_caption = gr.Textbox(label="Your suggestion for an alternate caption")
response_output = gr.Textbox(label="Response") # Add a textbox for the response

with gr.Blocks(css=css) as demo:
    gr.Markdown("RLHF")
    with gr.Tab(label="Real or Kidding?"):
        with gr.Row():
          with gr.Column():
            rlhf_btn.render()
            rlhf_btn.click(run_rlhf, [c0, c1, c2, c3, best_caption_number, alternate_caption],[response_output, output_img, c0, c1, c2, c3])
            @gr.render(triggers=[rlhf_btn.click])
            def rlhf():
              output_img.render()
              c0.render()
              c1.render()
              c2.render()
              c3.render()
              # Display the image using the path from the state
          with gr.Column():
            best_caption_number.render()
            alternate_caption.render()
            response_output.render()

demo.launch(debug=True)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
