# Vision Language Models


Vision language models are complex systems that process both images and text. They generate text based on images and text inputs. These models are powerful and can handle many different types of images, such as documents and web pages. They can be used for several purposes, including chatting about images, recognizing images based on instructions, and understanding documents. Some of these models can even identify and outline specific parts of images or provide information about the location of objects in an image. There is a wide variety in the large vision language models available, based on the data they were trained on, how they process images, and their overall capabilities.

![image.png](attachment:image.png)


source: [Huggingface](https://huggingface.co/blog/vlms)

# How to select your VLM?

There are many open vision language models on the Hugging Face Hub.

![image.png](attachment:image.png)


* Source: [Huggingface](https://huggingface.co/blog/vlms)

* Source: [Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)

# Benchmarking

* [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
* [LMMS-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
* [MMMU](https://huggingface.co/datasets/MMMU/MMMU)
* [MMBench](https://huggingface.co/datasets/lmms-lab/MMBench)



# LLAVA Finetunning

In the exciting world of artificial intelligence, combining language and vision is a big step forward. Among many models that do this, LLaVA really stands out. It doesn't just compete with other models, it surpasses them in what it can do.
Here, we'll look at what the Large Language-and-Vision Assistant (LLaVA) can do. Our main aim is to explain how to adjust LLaVA for AI developers, whether they're using Colab Pro or their own setups. We'll provide easy-to-use code to simplify the process.

Multimodal models are at the forefront of AI, bringing together the power of understanding language and images. Unlike traditional models that only focus on text, multimodal models like LLaVA handle both text and images, allowing them to understand and create responses that blend language and visuals seamlessly.



## How it works?

LLaVA shows how language and vision work together effectively. At its core, LLaVA uses an advanced structure that combines a vision encoder with a Large Language Model (LLM). The vision encoder, CLIP ViT-L/14, is great at getting details from images, while the language model, Vicuna, is a more advanced version of the open-source LLaMA model, designed to follow instructions accurately.

During training, it goes through two main phases. First, it learns to connect visual details with language using pairs of images and text. Then, it tackles more complex tasks focused on visual instructions. Despite the increased computational demands, LLaVA handles it well, showing efficiency and accuracy across various tasks.

![image.png](attachment:image.png)

LLaVA's success goes beyond just its design; it outperforms other models too. In its latest version, LLaVA 1.5 breaks new ground by adding a multi-layer perceptron (MLP) to improve how language and vision interact. By incorporating task-specific data from academic sources, LLaVA 1.5 shows impressive performance, setting itself apart from both its older versions and competitors.

Unlike closed models like GPT-4 Vision, LLaVA stands out as a leading open-source option. This openness encourages innovation and helps overcome potential limitations linked to closed models. While GPT-4 Vision remains strong, LLaVA's affordability, scalability, and impressive performance in various tests make it a compelling choice for those seeking open-source solutions.

# Pre-Training Vs Fine Tuning

There are various ways to pretrain a vision language model. The main trick is to unify the image and text representation and feed it to a text decoder for generation. The most common and prominent models often consist of an image encoder, an embedding projector to align image and text representations (often a dense neural network) and a text decoder stacked in this order. As for the training parts, different models have been following different approaches.

LLaVA consists of a CLIP image encoder, a multimodal projector and a Vicuna text decoder. The authors fed a dataset of images and captions to GPT-4 and generated questions related to the caption and the image. The authors have frozen the image encoder and text decoder and have only trained the multimodal projector to align the image and text features by feeding the model images and generated questions and comparing the model output to the ground truth captions. After the projector pretraining, they keep the image encoder frozen, unfreeze the text decoder, and train the projector with the decoder. This way of pre-training and fine-tuning is the most common way of training vision language models.

![image.png](attachment:image.png)

## Using Vision Language Models with transformers

In [None]:
!pip install accelerate --q
!pip install -U "transformers>=4.39.0" --q
!pip install peft bitsandbytes --q
!pip install -U "trl>=0.8.3" --q

In [None]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)
model.to(device)

We now pass the image and the text prompt to the processor, and then pass the processed inputs to the generate.

In [None]:
from PIL import Image
import requests

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=100)

Call decode to decode the output tokens.

In [None]:
print(processor.decode(output[0], skip_special_tokens=True))

 # Fine-tuning Vision Language Models with TRL

![image-3.png](attachment:image-3.png)

source [TRL](https://github.com/huggingface/trl)


Transformer Reinforcement Learning (TRL) is a new technique in machine learning that combines the strengths of transformers and reinforcement learning. TRL uses transformers to understand the current situation of an agent and its surroundings. This helps the agent make smarter choices and learn through trial and error.

## Intro

Let's say you're a student learning a new skill with a complicated textbook, so you seek help from a tutor who simplifies the material and gives you feedback.

Now, picture that you're not a student, but a machine learning agent in a tricky environment. You have a lot of data, but it's challenging to understand. This is where TRL comes into play.

TRL works like a tutor for these agents. It employs a kind of neural network known as a transformer to help the agent grasp its environment and improve its decisions. The transformer serves as a guide, emphasizing crucial details and filtering out distractions.

Like any good tutor, TRL also gives personalized feedback based on the agent's actions. This feedback helps the agent learn from its errors and make better choices moving forward.

TRL has shown promising results in various areas, such as gaming and robotics. It is expected to be useful in more complex and human-like tasks such as personalized language learning or making healthcare decisions.

In short, if we consider machine learning agents as students, TRL is the helpful tutor they need to excel.


## Background
Transformer Reinforcement Learning (TRL) is an innovative machine learning algorithm that combines two powerful techniques: transformers and reinforcement learning (RL).

Transformers were introduced by Vaswani et al. in 2017 as a neural network architecture for natural language processing. They are able to learn long-range dependencies between words in a sentence by using self-attention mechanisms. Transformers have achieved state-of-the-art performance in a variety of language-related tasks, such as language translation and text classification.

Reinforcement learning, on the other hand, is a type of machine learning that allows an agent to learn by interacting with an environment. The agent takes actions in the environment and receives feedback in the form of rewards or penalties. The goal of the agent is to maximize its cumulative reward over time by learning which actions lead to the best outcomes.

By combining transformers and RL, TRL is able to handle complex, high-dimensional state spaces. The transformer component allows the agent to represent the state of the environment in a way that captures important features and filters out noise. The RL component allows the agent to learn from its actions and adjust its behavior accordingly.

## How does Transformer RL works ?
At a high level, TRL operates as follows:

The agent observes the state of the environment and uses a transformer-based model to represent the state. The transformer helps the agent filter out irrelevant information and focus on the most important features.
Then the agent selects an action based on the current state, using a policy function that maps states to actions. The policy function is learned through RL, which allows the agent to learn from its past experiences and improve over time.
Afterwards the agent receives feedback from the environment in the form of a reward signal, which indicates how well the agent is performing. The agent uses this feedback to update its policy function and adjust its behavior for future actions.
The process repeats, with the agent observing the new state of the environment, selecting a new action, receiving feedback, and updating its policy function.
By repeating this process over and over, the agent is able to learn how to navigate the environment and maximize its cumulative reward over time.


## Architecture

Transformer RL uses sequence modeling to learn from past experiences and improve decision-making over time.

In simple terms, the architecture of Transformer RL consists of three main parts: the encoder, the decoder, and the value network. The encoder processes sequences of states and actions. The decoder generates future actions based on past ones. The value network estimates the rewards for certain actions, helping guide decisions.

The encoder uses a transformer structure, effective in tasks like language translation. It analyzes sequences using layers of attention and neural networks, focusing on relevant parts of the input to predict future outcomes.

Inputs to the encoder are state-action pairs converted into high-dimensional vectors. These go through transformer layers with attention mechanisms that pinpoint important sequence parts and neural networks that reveal complex relationships.

The encoder's output, a series of hidden states, feeds into the decoder. This part of the architecture, also built on the transformer model, creates sequences of actions. It uses a causal mask in its attention mechanism to ensure it only considers past data, crucial since future data isn't accessible in real-time decision-making.

The decoder outputs possible actions, from which the next action is chosen based on a balance of exploration and learning.

Lastly, the value network calculates expected rewards from actions by taking encoder's hidden states and outputting a reward value. It's trained on actual rewards received, refining its accuracy over time.


![image.png](attachment:image.png)


## Comparison

![image-2.png](attachment:image-2.png)

source: [The Power of Transformer Reinforcement Learning](https://dongreanay.medium.com/the-power-of-transformer-reinforcement-learning-5283ab1879c0)


## Load the model (4-bits quantized)


In [None]:
!pip install accelerate --q
!pip install -U "transformers>=4.39.0" --q
!pip install peft bitsandbytes --q
!pip install -U "trl>=0.8.3" --q

In [None]:
!pip install accelerate --q
!pip install bitsandbytes-cuda110

In [None]:
!pip install --upgrade huggingface_hub


In [None]:
import torch
from transformers import AutoTokenizer, AutoProcessor, TrainingArguments, LlavaForConditionalGeneration, BitsAndBytesConfig
from trl import SFTTrainer
from peft import LoraConfig


model_id = "llava-hf/llava-1.5-7b-hf"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

model = LlavaForConditionalGeneration.from_pretrained(model_id,
                                                      quantization_config=quantization_config,
                                                      torch_dtype=torch.float16)

## Create a Chat template set tokenizer and processor

In [None]:
LLAVA_CHAT_TEMPLATE = """A chat between a curious user and an artificial intelligence assistant. \
                        The assistant gives helpful, detailed, and polite answers to the user's questions. \
                        {% for message in messages %}{% if message['role'] == 'user' %}\
                        USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}<image>{% endif %}{% endfor %}\
                        {% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}"""

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.chat_template = LLAVA_CHAT_TEMPLATE
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer = tokenizer

## Craete a DataCollator

In [None]:
class LLavaDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            messages = example["messages"]
            text = self.processor.tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=False
            )
            texts.append(text)
            images.append(example["images"][0])

        batch = self.processor(texts, images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels

        return batch

data_collator = LLavaDataCollator(processor)

## Load the Dataset

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft")
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

In [None]:
train_dataset[0]

In [None]:
train_dataset.save_to_disk("/content/drive/MyDrive/MMRAG/LLAVA-Finetune/lava-instruct-mix-vsft-train")
eval_dataset.save_to_disk("/content/drive/MyDrive/MMRAG/LLAVA-Finetune/lava-instruct-mix-vsft-test")

## Set the Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="llava-1.5-7b-hf-ft-mix-vsft",
    report_to="tensorboard",
    learning_rate=1.4e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    logging_steps=5,
    num_train_epochs=1,
    push_to_hub=True,
    gradient_checkpointing=True,
    remove_unused_columns=False,
    fp16=True,
    bf16=False
)


## Set the LoRA config

### Brief Overview

LoRA makes fine-tuning neural networks more efficient by using two smaller matrices (update matrices) instead of adjusting the main weight matrix, which stays unchanged. These smaller matrices are trained to work with new data, combining with the unchanged weights to produce results. This method reduces the number of parameters that need to be trained, maintains the original weights unchanged for use in multiple models, and can be combined with other efficiency methods. LoRA's performance matches fully fine-tuned models and does not slow down the model during use. Typically, in Transformer models, LoRA is applied mainly to the attention blocks to keep the process simple and even more efficient. The number of parameters in LoRA depends on the size of these update matrices.

![lora.png](attachment:lora.png)

#### Common LoRA parameters in PEFT

As with other methods supported by PEFT, to fine-tune a model using LoRA, you need to:

1. Instantiate a base model.
2. Create a configuration (LoraConfig) where you define LoRA-specific parameters.
3. Wrap the base model with get_peft_model() to get a trainable PeftModel.
4. Train the PeftModel as you normally would train the base model.

LoraConfig allows you to control how LoRA is applied to the base model through the following parameters:

* r: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
* target_modules: The modules (for example, attention blocks) to apply the LoRA update matrices.
* lora_alpha: LoRA scaling factor.
* bias: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
* use_rslora: When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r.
* modules_to_save: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model’s custom head that is randomly initialized for the fine-tuning task.
* layers_to_transform: List of layers to be transformed by LoRA. If not specified, all layers in target_modules are transformed.
* layers_pattern: Pattern to match layer names in target_modules, if layers_to_transform is specified. By default PeftModel will look at common layer pattern (layers, h, blocks, etc.), use it for exotic and custom models.
* rank_pattern: The mapping from layer names or regexp expression to ranks which are different from the default rank specified by r.
* alpha_pattern: The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by lora_alpha.

Source: [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)

In [None]:
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules="all-linear"
)

## Create the SFTTrainerobject

In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    dataset_text_field="text",  # need a dummy field
    tokenizer=tokenizer,
    data_collator=data_collator,
    dataset_kwargs={"skip_prepare_dataset": True},
)

## Load and set Tensorboardfor logging

In [None]:
%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/MMRAG/LLAVA-Finetune

## Start the training!

In [None]:
trainer.train()


## Push the model to the HF Hub

In [None]:
trainer.push_to_hub()