<!-- ## Internal Notes: to be deleted.

1 TODO: Let's implement a `return_full_text` field so the user can demand a model
does not include the the input text as well in its response
see https://huggingface.co/docs/transformers/v4.17.0/main_classes/pipelines

2 pip installing Oumi with [.gpu] it does not include ipywidgets which disables the monitoring of 
tqdm inside the notebook and results below in: `TqdmWarning: IProgress not found. Please update jupyter and ipywidgets`
Handling it with `!pip install ipywidgets`, TODO: Can we do better?


!pip install ipywidgets # Installing ipywidgets for widget visualization -->

# Finetuning a Vision-Language Model (Overview)

In this tutorial, we'll use LoRA training and SFT to guide a large vision/language model to produce short and concise answer grounded on visual input.

Specifically, we'll use the Oumi framework to streamline the process and achieve high-quality results fast.

We'll cover the following topics:
1. Prerequisites
2. Data Preparation & Sanity Checks
3. Training Config Preparation
4. Launching Training
5. Inference

# Prerequisites
## Oumi Installation
First, let's install Oumi and some additional packages. For this notebook you will need access to at least one NVIDIA or AMD GPU **with ~30GBs of memory**.

You can find detailed installation instructions [here](https://github.com/oumi-ai/oumi/blob/main/README.md), but for Oumi it should be as simple as:

```bash
pip install -e ".[gpu]"
```

In [None]:
# Additionally, install the following packages for widget visualization.
!pip install ipywidgets

# And deactivate the parallelism warning from the tokenizers library.
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Deactivate relevant HF warnings

## Creating our working directory

For our experiments, we'll use the following folder to save the model, training artifacts, and our inference and training configs.

In [2]:
from pathlib import Path

tutorial_dir = "vision_language_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

In what follows we use Meta's [Llama-3.2-11B-Vision-Instruct](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) model.

Llama-11B-Vision is a high-performing instruction-tuned multi-modal model, which uses a moderate amount of resources (11B parameters).

We will finetune this model with the [vqav2-small](https://huggingface.co/datasets/merve/vqav2-small) dataset which will help the model respond in __a succinct manner__ on visually grounded questions.

The principles presented here are generic and "Oumi-flexible". 

To repeat this experiment with other models/data you can simply replace e.g., the `model_name` (a string) with the names of other supported models (see [here](https://oumi.ai/docs/en/latest/resources/models/supported_models.html)) and adapt the configurations.

## First, let's initialize our dataset and build a tokenizer and an underlying data processor.


In [None]:
from oumi.builders import build_tokenizer
from oumi.core.configs import ModelParams
from oumi.datasets.vision_language.vqav2_small import Vqav2SmallDataset

model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"

tokenizer = build_tokenizer(ModelParams(model_name=model_name))

dataset = Vqav2SmallDataset(
    tokenizer=tokenizer,
    processor_name=model_name,
    limit=1000,  # Limit the number of examples for demonstration purposes (!)
)

print("\nExamples included:", len(dataset))

### Now let's see a few examples to get a feel for the dataset we are going to use.

In [None]:
import io

from PIL import Image

from oumi.core.types.conversation import Type

num_examples_to_display = 3

for i in range(num_examples_to_display):
    conversation = dataset.conversation(i)  # Retrieve the i-th example (conversation)

    print(f"Example {i}:")

    for message in conversation.messages:
        if message.role == "user":  # The `user` poses a question, regarding an image
            img_content = message.content[0]
            assert (
                img_content.type == Type.IMAGE_BINARY
            ), "Oumi encodes image content in binary for VQA-Small."

            image = Image.open(io.BytesIO(img_content.binary))
            image.save(f"{tutorial_dir}/example_{i}.png")  # Save the image locally
            display(image)

        print(f"{message.role}: {message.content}")
    print("\n")

As we can see above the ground-truth answers are **very short and succinct**, which can be an advantage for scenarios where we want to generate concise answers.

In [None]:
# Furthermore, if you want to see directly the underlying stored data, stored in a
# pandas DataFrame, you can do so by running the following command:
dataset.data.head()

## Initial Model Responses

Let's see now how this model performs on a given prompt without any finetuning.
- For this we will create and execute and `inference configuration` stored in a YAML file.

In [None]:
%%writefile $tutorial_dir/infer.yaml

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"
  torch_dtype_str: "bfloat16" # Good choice if you have access to Ampere or newer GPU
  chat_template: "llama3-instruct"
  model_max_length: 1024
  trust_remote_code: False # For other models this might need to be set to True
  
generation:
  max_new_tokens: 128
  batch_size: 1
  
engine: NATIVE 
# Let's use the `native` engine (i.e., the underlying machine's default)
# for inference.  
# You can also consider VLLM, if are working with GPU for much faster inference. 
# To install an Oumi tested/compatible version, use:
# pip install vllm>=0.6.3,<0.7.0

In [None]:
from oumi.core.configs import InferenceConfig
from oumi.core.types.conversation import Conversation, Message, Role
from oumi.inference import NativeTextInferenceEngine

# Note: the *first* time you call inference will take a few minutes to download
# and cache the model (assuming you do not already have it downloaded locally).
inference_config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))
inference_engine = NativeTextInferenceEngine(inference_config.model)

example = dataset.conversation(1)
example = Conversation(messages=example.filter_messages(Role.USER))
inference_engine.infer([example], inference_config)

In [None]:
# Clean up to free-up GPU memory used for inference above
# Delete the inference_engine and collect garbage
import gc

import torch

del inference_engine
gc.collect()

# Clear GPU memory
torch.cuda.empty_cache()

In [11]:
# Note. You can do the same inference directly with our CLI (terminal) instead of the
# Python API. E.g., uncomment the following line and execute this cell:

conversation_id = 1
query_text_string = (
    dataset.conversation(conversation_id).messages[0].text_content_items[0].content
)
print(f"\n{query_text_string}")

# !echo "$query_text_string" | oumi infer -c "$tutorial_dir/infer.yaml" -i --image="$tutorial_dir/example_1.png" # noqa: E501

OK! As you can see by default this model gives quite __verbose__ responses. Can we change this behavior?

## Preparing our training experiment
 - Specifically, let's create an execute a YAML file with our _training_ config!
 - You can find many more details about the listed hyper-parameters in our [docs](https://oumi.ai/docs/en/latest/user_guides/train/training_methods.html).

In [None]:
%%writefile $tutorial_dir/train.yaml

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"
  torch_dtype_str: "bfloat16"
  model_max_length: 1024  
  attn_implementation: "sdpa"
  chat_template: "llama3-instruct"
  freeze_layers:
    - "visual"     # Let's finetune only the language component of the model

data:
  train:
    collator_name: "vision_language_with_padding" # Simple padding collator
    use_torchdata: True

    datasets:
      - dataset_name: "merve/vqav2-small"
        split: "validation" # This dataset has only a validation split
        shuffle: True
        seed: 42
        transform_num_workers: "auto"
        dataset_kwargs:
          # The default for our model:
          processor_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"           
          limit: 1000 # Again, we downsample to 1000 examples for demonstration 
                      # purposes only.
          return_tensors: True      

training:
  output_dir: "vision_language_tutorial"
  trainer_type: "TRL_SFT"
  enable_gradient_checkpointing: True
  # You can decrease the following two params if you run out of memory
  per_device_train_batch_size: 2 
  gradient_accumulation_steps: 8 # Thus effective batch size is 2x8=16 on a single GPU
  use_peft: True
  
  # **NOTE**
  # We set `max_steps` to 10 steps to first verify that training works
  # Swap to `num_train_epochs: 1` to get more meaningful results
  # (One training epoch will take ~25 mins on a single A100-40GB GPUs)
  max_steps: 10
  # num_train_epochs: 1

  gradient_checkpointing_kwargs:
    # Reentrant docs: https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
    use_reentrant: False
  ddp_find_unused_parameters: False
  empty_device_cache_steps: 1

  optimizer: "adamw_torch_fused"
  learning_rate: 2e-5
  warmup_ratio: 0.03
  weight_decay: 0.0
  lr_scheduler_type: "cosine"

  logging_steps: 5
  save_steps: 0
  dataloader_main_process_only: False
  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 16
  include_performance_metrics: True
  enable_wandb: True # Set to False if you don't want to use Weights & Biases
  
peft: # Our LoRA configuration; we target several layers  
  lora_r: 8
  lora_alpha: 8
  lora_dropout: 0.1
  lora_target_modules:
    - "q_proj"
    - "v_proj"
    - "o_proj"
    - "k_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"
  lora_init_weights: GAUSSIAN

# Below lines are effective if you have access to multiple GPUs
# If you do, please uncomment them to train with all available GPUS:

# fsdp:
#   enable_fsdp: True
#   sharding_strategy: "HYBRID_SHARD"
#   forward_prefetch: True
#   auto_wrap_policy: "TRANSFORMER_BASED_WRAP"
#   transformer_layer_cls: "MllamaSelfAttentionDecoderLayer,MllamaCrossAttentionDecoderLayer,MllamaVisionEncoderLayer"

In [None]:
## Let's launch the training!

!oumi train -c "$tutorial_dir/train.yaml"

# Or, if you have multiple GPUS you want to use:
# !oumi distributed torchrun -m oumi train -c "$tutorial_dir/train.yaml"

## Finally, let's use the Fine-tuned Model and see the effect of training!

Once we're happy with the results, we can serve the fine-tuned model for inference:

In [None]:
%%writefile $tutorial_dir/trained_infer.yaml

model:
  model_name: "meta-llama/Llama-3.2-11B-Vision-Instruct"  
  adapter_model: "vision_language_tutorial" # Directory with our saved LoRA parameters!
  torch_dtype_str: "bfloat16"
  chat_template: "llama3-instruct"
  model_max_length: 1024
  trust_remote_code: False

generation:
  max_new_tokens: 256
  batch_size: 1
  
engine: NATIVE

In [None]:
config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "trained_infer.yaml"))
inference_engine = NativeTextInferenceEngine(config.model)

example = dataset.conversation(1)
example = Conversation(messages=example.filter_messages(Role.USER))
inference_engine.infer([example], inference_config)

In [None]:
# Or, if you want to test it with your own image/question pair:
from oumi.core.types.conversation import ContentItem
from oumi.utils.image_utils import load_image_png_bytes_from_path

your_image_path = f"{tutorial_dir}/example_1.png"  # "Replace with your image path!"
image_bytes = load_image_png_bytes_from_path(your_image_path)

conversation = Conversation(
    messages=[
        Message(
            role=Role.USER,
            content=[
                ContentItem(type=Type.IMAGE_BINARY, binary=image_bytes),
                ContentItem(type=Type.TEXT, content="Your question here!"),
            ],
        )
    ]
)

inference_engine.infer([conversation], config)