In [None]:
# Copyright 2024 Reddit, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Use Case 1. Image Short Captions

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/reddit/kdd2024-tutorial-breaking-barriers/blob/master/Use_Case_1_Image_Short_Captions.ipynb)

## Overview

This notebook provides a hands-on guide to deploying and prompting different multimodal LLMs to generate short, descriptive captions for images. The challenges and limitations of using LLMs for image captioning will be discussed.

---

## Setting Up Google Colab
Google Colab provides a convenient platform to run Python code in the cloud, with access to powerful computing resources, including GPUs. For this tutorial, it is recommended to enable GPU acceleration:

1.   Click on *Runtime* in the top menu.
2.   Select *Change runtime type*.
3.   In the dialog that appears, under *Hardware accelerator*, choose **T4 GPU** (or any other GPU that you may have access to) if it is not already enabled.
4.   Click *Save*.

---

## Requirements

The code in this notebook is based on Transformers, so we need to install all necessary requirements.

In [None]:
# Install required Python packages
!pip install transformers bitsandbytes accelerate flash_attn

---

## Settings

Run the following cells to make some convenient settings.

In [None]:
# Disable Transformer warnings
import logging
logging.basicConfig(level=logging.INFO)

import transformers
transformers.logging.set_verbosity_error()

import warnings
warnings.filterwarnings('ignore')

# Set GPU device
import torch
torch.set_default_device("cuda") # or "cpu" is GPU is not available

# Import Garbage Collector
import gc

Run the following cell to get the run time on every cell execution:

In [None]:
!pip install ipython-autotime
%load_ext autotime

Run the following cell to enable wrap when printing long strings:

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML("""
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """))
get_ipython().events.register("pre_run_cell", set_css)

---

## Test Picture

We will use the first image in the tutorial dataset.

In [None]:
# Download images

from PIL import Image
import requests

def download_image(url):
  image = Image.open(requests.get(url, stream=True).raw)
  return image

In [None]:
image = download_image("https://raw.githubusercontent.com/reddit/kdd2024-tutorial-breaking-barriers/main/media/image1.jpg")
image

---

## Get Ready!

**❗Important❗** Despite our attempts to free the memory after testing each model, your GPU may get out of memory.

If this happens, you will need to restart your Colab session. To do so, go to **Runtime** menu, click on *Restart session* and run the cells in the **Settings** and **Test Picture** sections. Then continue testing the next model.

---
## Model 1: LLaVA-1.5-7B

LLaVa (proposed in [1] and improved in [2]) is an open-source auto-regressive language model, based on the transformer architecture, trained by fine-tuning Llama/Vicuna on GPT-generated multimodal instruction-following data. It is sometimes seen as an "open source version of GPT4".

##### **References**

[1] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. [arXiv:2304.08485](https://arxiv.org/abs/2304.08485) [cs.VCV]

[2] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. [arXiv:2310.03744](https://arxiv.org/abs/2310.03744) [cs.CV]

In [None]:
# Load Transformer pipeline

from transformers import BitsAndBytesConfig
from transformers import pipeline

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
pipe = pipeline("image-to-text",
                model="llava-hf/llava-1.5-7b-hf",
                model_kwargs={"quantization_config": quantization_config})

It is important to prompt the model wth a specific format, which is:
```bash
USER: <image>\n<prompt>\nASSISTANT:
```

In [None]:
# Set prompt and max output tokens
prompt = "USER: <image>\nDescribe the image in detail\nASSISTANT:"
max_new_tokens = 256

# Prompt
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": max_new_tokens})

# Get caption
caption = outputs[0]["generated_text"][len(prompt)-5:]

# Display caption
print(caption)

Now test with different prompts. For instance:
- `Generate a short caption for the image`
- `Write a very short caption for the image with less than 20 words`
- `Describe what you see in the picture`
- `You are an assistant for a person with visual impairment. Describe the picture in detail so that the person can have
a full idea of the contents. But make it short, without overwhelming the user with non important details of the picture.`


In [None]:
# Clean memory (as far as possible)
del pipe
del outputs
gc.collect()
torch.cuda.empty_cache()

---

## Model 2: nanoLLaVA

[nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA) is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.

It is based on the [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) small base LLM combined with a [CLIP-SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder.

Quyen-SE is a fine-tuned version of powerful small model Qwen-1.5–0.5B, with a 32k tokens context window. SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "qnguyen3/nanoLLaVA",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    "qnguyen3/nanoLLaVA",
    trust_remote_code=True)

The prompt uses the ChatML standard, however, without `\n` at the end of `<|im_end|>`.

In [None]:
# Text prompt
prompt = 'Describe the image in detail'

# Build actual prompt
messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(text)

In [None]:
# Process text
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# Process image
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# Generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]
caption = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True)

# Display caption
print(caption)

It can be clearly seen that this model is much more verbose than the previous one, so the prompt has to be adapted.

In [None]:
# Clean memory (as far as possible)
del model
del tokenizer
del text_chunks
del input_ids
del image_tensor
del output_ids
gc.collect()
torch.cuda.empty_cache()

---

## Model 3: Phi-3-vision-128k-instruct

The [Phi-3-Vision-128K-Instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

In [None]:
# Load model

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-vision-128k-instruct",
    torch_dtype="auto",
    device_map="cuda",
    trust_remote_code=True,
    _attn_implementation="eager") # "flash_attention_2" to enable flash attention
processor = AutoProcessor.from_pretrained(
    "microsoft/Phi-3-vision-128k-instruct",
    trust_remote_code=True)

In [None]:
# Prompt
prompt = "Describe the image in detail"

# Process inputs
messages = [
    {"role": "user", "content": f"<|image_1|>\n{prompt}"}
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda")

# Generate
generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}
generate_ids = model.generate(
    **inputs,
    eos_token_id=processor.tokenizer.eos_token_id,
    **generation_args)

# Remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
caption = processor.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False)[0]

# Display caption
print(caption)

Check out some other interesting use cases of this model:
[6 Real-World Uses of Microsoft’s Newest Phi-3 Vision-Language Model](https://towardsdatascience.com/6-real-world-uses-of-microsofts-newest-phi-3-vision-language-model-8ebbfa317fe8)

In [None]:
# Clean memory (as far as possible)
del model
del processor
del inputs
del generate_ids
gc.collect()
torch.cuda.empty_cache()

---

## Model 4: imp-v1-3b

The **Imp project** aims to provide a family of a strong multimodal small language models, and their [imp-v1-3b](https://huggingface.co/MILVLG/imp-v1-3b) is one of those models with only 3B parameters, build upon a small yet powerful Phi-2 (2.7B) and a powerful visual encoder SigLIP, and trained on the LLaVA-v1.5 training set. This model significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.

### Getting Started

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "MILVLG/imp-v1-3b",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    "MILVLG/imp-v1-3b",
    trust_remote_code=True)

In [None]:
# Text prompt
#prompt = "Describe the image in detail"
#prompt = "Generate a short caption for the image"
prompt = "Write a very short caption for the image with less than 20 words"
#prompt = "Describe what you see in the picture"

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(text)

In [None]:
# Process text and image
input_ids = tokenizer(text, return_tensors="pt").input_ids
image_tensor = model.image_preprocess(image)

# Generate the answer
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    images=image_tensor,
    use_cache=True)[0]
caption = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True)

# Display caption
print(caption)

In [None]:
# Clean memory (as far as possible)
del model
del tokenizer
del input_ids
del image_tensor
del output_ids
gc.collect()
torch.cuda.empty_cache()

### Improved Implementation (MLLMv1)

The following code shows a much better implementation that should be easier to integrate in a processing workflow.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

class MLLMv1:

  def __init__(self):
    torch.set_default_device("cuda")
    self.vision_model = AutoModelForCausalLM.from_pretrained(
      "MILVLG/imp-v1-3b",
      torch_dtype=torch.float16,
      device_map="auto",
      trust_remote_code=True)
    self.vision_tokenizer = AutoTokenizer.from_pretrained(
        "MILVLG/imp-v1-3b",
        trust_remote_code=True)

  def get_image_caption(self,
                        image: Image,
                        base_prompt="Write a very short caption for the image with less than 20 words") -> str:
    return self.prompt_llm(image, base_prompt)

  def get_image_description(self,
                            image: Image,
                            base_prompt="Write a short description for the image") -> str:
    return self.prompt_llm(image, base_prompt)

  def prompt_llm(self,
                 image: Image,
                 prompt: str,
                 max_new_tokens: int = 256,
                 temperature: float = 0.9,
                 top_k: int = 50,
                 top_p: float = 0.95) -> str:
    if image:
      text = self.vision_tokenizer.apply_chat_template(
          [{"role": "user", "content": f"<image>\n{prompt}"}],
          tokenize=False,
          add_generation_prompt=True
      )
      image_tensor = self.vision_model.image_preprocess(image)
    else:
      text = self.vision_tokenizer.apply_chat_template(
          [{"role": "user", "content": f"{prompt}"}],
          tokenize=False,
          add_generation_prompt=True
      )
      image_tensor = None
    input_ids = self.vision_tokenizer(text, return_tensors="pt").input_ids
    output_ids = self.vision_model.generate(
      input_ids,
      max_new_tokens=max_new_tokens,
      images=image_tensor,
      temperature=temperature,
      do_sample=True,
      top_k=top_k,
      top_p=top_p,
      use_cache=True)[0]
    response = self.vision_tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True)
    response = response.replace("\n", " ").strip().replace("  ", " ")
    return response

In [None]:
# Load model
ic = MLLMv1()

In [None]:
# Test one image
image = download_image(f"https://raw.githubusercontent.com/reddit/kdd2024-tutorial-breaking-barriers/main/media/image1.jpg")
display(image)
print(ic.get_image_caption(image))

In [None]:
# Test all images
for image_name in ["image1.jpg", "image2.png", "image3.png", "image4.png", "image5.png", "image6.png"]:
  image = download_image(f"https://raw.githubusercontent.com/reddit/kdd2024-tutorial-breaking-barriers/main/media/{image_name}")
  display(image)
  print(ic.get_image_caption(image))

In [None]:
# Clean memory (as far as possible)
del ic
gc.collect()
torch.cuda.empty_cache()

---

## Other Models

### Qwen-VL

Familiy of vision models [1], significantly upgraded for detailed recognition capabilities and text recognition abilities.

- [GitHub project](https://github.com/QwenLM/Qwen-VL)
- [Online demo](https://huggingface.co/spaces/Qwen/Qwen-VL-Max)

##### **References**
[1] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun-
yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-
Language Model for Understanding, Localization, Text Reading, and Beyond.
[arXiv:2308.12966](https://arxiv.org/abs/2308.12966) [cs.CV]


### CogVLM 2

A new generation of strong open-source models [1] based on Meta-Llama-3-8B-Instruct, supporting a 8K content length, an image resolution up to 1344 * 1344 and output in both Chinese and English. **A GPT-4V Level Multimodal LLM on Your Phone**.

- [GitHub project](https://github.com/THUDM/CogVLM2)
- [Model in Huggingface Hub](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)
- [Online demo](https://huggingface.co/spaces/THUDM/CogVLM-CogAgent)

##### **References**
[1] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji,
Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming
Ding, and Jie Tang. 2023. CogAgent: A Visual Language Model for GUI Agents.
[arXiv:2312.08914](https://arxiv.org/abs/2312.08914) [cs.CV]

### MiniCPM-Llama3-V-2.5

MiniCPM-V [1] is a series of end-side multimodal LLMs designed for vision-language understanding. MiniCPM-Llama3-V 2.5 is the latest and most capable model in the series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. The model can also support multimodal conversation for over 30 languages including English, Chinese, French, Spanish, German etc.

- [Project in GitHub](https://github.com/OpenBMB/MiniCPM-V)
- [Model in Huggingface Hub](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5)
- [Online demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5)

##### **References**

[1] MiniCPM-V Team. MiniCPM-V 2.0: An Efficient End-side MLLM with Strong OCR and Understanding Capabilities. Online: https://openbmb.vercel.app/minicpm-v-2-en

### Florence-2

Florence-2 [1] is an advanced vision foundation model trained by Microsoft with a sequence-to-sequence architecture that uses a prompt-based approach to handle a wide range of vision and vision-language tasks like captioning, object detection, and segmentation.

- [Model in Huggingface Hub](https://huggingface.co/microsoft/Florence-2-large)
- [Sample notebook](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)
- [Online demo](https://huggingface.co/spaces/SixOpen/Florence-2-large-ft)

##### **References**
[1] Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. [arXiv:2311.06242](https://arxiv.org/abs/2311.06242) [cs.CV]

---

# Discussion: How Image Captioning with Multimodal LLMs can improve Accessibility in Social Media
- **Automatic image captioning**: For visually impaired users, multimodal LLMs can automatically generate captions for images, making posts accessible and understandable.
- **Extended captions with context**: LLMs can provide detailed descriptions that go beyond simple captions, explaining the context, emotions, and key details in an image. This enhances the user experience for everyone.
- **Summarizing image content**: For users with limited time or attention spans, LLM-generated summaries of images can offer a quick overview of the content without having to examine the image in detail.
- **Translation of captions**: LLMs can translate captions into multiple languages, making Reddit posts accessible to a wider international audience.
- **Personalization based on user preferences**: LLMs can adapt their captions and descriptions based on user preferences and accessibility needs, offering a more personalized experience.
- **Improving content moderation**: By analyzing image content and captions, LLMs can help identify and flag potentially offensive or inappropriate images, contributing to a safer and more inclusive community.