# Running LLaVa-Onevision: a multi-modal model for image and video modalities on Google Colab

LLaVa-Onevision is a new Vision-Language Model that enables interaction with videos and images in one model. The model is based on a previuos series of models like [LLaVa-Interleave](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) and [LLaVa-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf). The architecutre is very simiilar to LLaVa-NeXT and employs anyres technique to handle high resolution images efficiently. The base LLM is [Qwen2-Instruct](https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f) where smallest model is only 0.5 billion parameters. That makes LLaVA-Onevision perfect for those who are short on computational resources

LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision
project page: https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main



First we need to install the latest `transformers` from `branch`, as the model has just been added and the PR isn't merged yet. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [1]:
# !pip install GPUtil numba -q

# import torch
# from GPUtil import showUtilization as gpu_usage
# from numba import cuda

# def free_gpu_cache():
#     print("Initial GPU Usage")
#     gpu_usage()                             

#     torch.cuda.empty_cache()

#     cuda.select_device(0)
#     cuda.close()
#     cuda.select_device(0)

#     print("GPU Usage after emptying the cache")
#     gpu_usage()

# free_gpu_cache()

Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |
|  1 |  0% |  0% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 |  0% |  1% |
|  1 |  0% |  0% |


In [3]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-u2lmqrcc
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-u2lmqrcc
  Resolved https://github.com/huggingface/transformers.git to commit 2e24ee4dfa39cc0bc264b89edbccc373c8337086
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [4]:
!nvidia-smi

Sun Sep 29 05:14:49 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P0             28W /   70W |     103MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [5]:
# we need av to be able to read the video
!pip install -q av tiktoken

In [6]:
# !pip3 install torch torchvision torchaudio -q --index-url https://download.pytorch.org/whl/cu117 --upgrade --force-reinstall
# import torch
# print(torch.__version__)


## Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [7]:
# # !pip install mkl-static mkl-include
# !pip uninstall bitsandbytes -y
# !pip install bitsandbytes -q

In [8]:
# !git clone https://github.com/Dao-AILab/flash-attention.git
# !cd flash-attention
# !export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
# !python setup.py install

In [9]:
# !pip install flash-attn --no-build-isolation

In [10]:
from transformers import BitsAndBytesConfig, LlavaOnevisionForConditionalGeneration, LlavaOnevisionProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

# model = LlavaOnevisionForConditionalGeneration.from_pretrained(
#     "llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
#     # quantization_config=quantization_config,
#     torch_dtype="float16", 
#     device_map='auto',
#     # offload_buffers=True
#     )
processor = LlavaOnevisionProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
processor.tokenizer.padding_side = "left" # set to 'left' for generation and 'right' for training (default in 'right')

# Uncomment below if you want to use 7B model and load it in consumer hardware
# Qunatizing model to 4bits will save memory up to 4 times
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    "llava-hf/llava-onevision-qwen2-7b-ov-hf",
    torch_dtype="float16", 
    # quantization_config=quantization_config,
    device_map='auto'
)

preprocessor_config.json:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/178 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

video_processor/preprocessor_config.json:   0%|          | 0.00/428 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/78.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

In [11]:
# # # ULTRA HARD
# model_hard = LlavaOnevisionForConditionalGeneration.from_pretrained(
#     "llava-hf/llava-onevision-qwen2-72b-ov-hf",
#     quantization_config=quantization_config,
#     device_map='auto'
# )

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [12]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [59]:
from huggingface_hub import hf_hub_download
import os

N = 8

video_folder = '/kaggle/input/rt-videos-test/test_tag_video/videos'
clips = []
for video_name in os.listdir(video_folder):
    if len(clips)>2:
        continue
    video_path = os.path.join(video_folder, video_name)
    if video_path.endswith('.DS_Store'):
        continue

    # # Download video from the hub
    # video_path_1 = hf_hub_download(repo_id="baltsat/RT-Videos", filename="0b7834cc1bb493636600674074345998.mp4", repo_type="dataset")
    # video_path_2 = hf_hub_download(repo_id="baltsat/RT-Videos", filename="0bbdcd822239ea2d32c0768364854fbb.mp4", repo_type="dataset")

    container = av.open(video_path)

    # sample uniformly N frames from the video (we can sample more for longer videos)
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / N).astype(int)
    clip = read_video_pyav(container, indices)
    print(video_path)
    
    # Check if the clip has the required number of frames
    if len(clip) < N:
        print(f"Skipping video {video_path} due to insufficient frames.")
        continue
    clips.append(clip)
    
print('Clips created')

/kaggle/input/rt-videos-test/test_tag_video/videos/7d6f46570858201ff82cb95e8b12d17a.mp4
/kaggle/input/rt-videos-test/test_tag_video/videos/a59414207bf543ac61fc763f9a17acc9.mp4
/kaggle/input/rt-videos-test/test_tag_video/videos/630d1fe128c5720f518a5b436247bf9b.mp4
Clips created


In [18]:
#@title Visualisation
# Will turn off for now. May use if you want.

# from matplotlib import pyplot as plt
# from matplotlib import animation
# from IPython.display import HTML

# # Select a random clip for visualization
# if clips:
#     random_index = np.random.randint(0, len(clips))
#     random_clip = clips[random_index]

#     # Visualisation
#     fig = plt.figure()
#     im = plt.imshow(random_clip[0, :, :, :])

#     plt.close()  # this is required to not display the generated image

#     def init():
#         im.set_data(random_clip[0, :, :, :])

#     def animate(i):
#         im.set_data(random_clip[i, :, :, :])
#         return im

#     anim = animation.FuncAnimation(fig, animate, init_func=init, frames=random_clip.shape[0], interval=100)
#     HTML(anim.to_html5_video())
# else:
#     print("No valid clips found.")

In [19]:
# Lets also load 2 images for generation from image data

# from PIL import Image
# import requests

# image_stop = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)
# image_snowman = Image.open(requests.get("https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg", stream=True).raw)

# image_snowman

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses `user` and `assistant` respectively. The chat format looks as follows:

`"<|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant <answer1><|im_end|><|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant "
`

In other words, you always need to end your prompt with `<|im_start|>assistant` if yuo want to chat with the model:.

Manually formatting your prompt can be error-prone. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [21]:
import json

# Load the JSON file
with open('/kaggle/input/tag-list/IAB_tags_list.json', 'r', encoding='utf-8') as file:
    iab_tags = json.load(file)

# Initialize arrays to store tags
tags_lvl1 = []
tags_lvl2 = {}
tags_lvl3 = {}

# Extract L1 tags
tags_lvl1 = list(iab_tags.keys())

# Extract L2 and L3 tags
for lvl1_tag in tags_lvl1:
    tags_lvl2[lvl1_tag] = []
    for lvl2_tag, lvl3_tags in iab_tags[lvl1_tag].items():
        tags_lvl2[lvl1_tag].append(lvl2_tag)
        if isinstance(lvl3_tags, list):
            tags_lvl3[lvl2_tag] = lvl3_tags
        else:
            tags_lvl3[lvl2_tag] = []

# Print the extracted tags for verification
print("L1 Tags:", tags_lvl1)
print("L2 Tags:", tags_lvl2)
print("L3 Tags:", tags_lvl3)

L1 Tags: ['Транспорт', 'Книги и литература', 'Бизнес и финансы', 'Карьера', 'Образование', 'События и достопримечательности', 'Семья и отношения', 'Изобразительное искусство', 'Еда и напитки', 'Здоровый образ жизни', 'Хобби и интересы', 'Дом и сад', 'Медицина', 'Фильмы и анимация', 'Музыка и аудио', 'Новости и политика', 'Личные финансы', 'Животные', 'Массовая культура', 'Недвижимость', 'Религия и духовность', 'Наука', 'Покупки', 'Спорт', 'Стиль и красота', 'Информационные технологии', 'Телевидение', 'Путешествия', 'Игры', 'NaN']
L2 Tags: {'Транспорт': ['Типы кузова автомобиля', 'Типы автомобилей', 'Автомобильная культура', 'Видео с видеорегистраторов', 'Мотоциклы', 'Помощь на дороге', 'Скутеры', 'Покупка и продажа автомобилей', 'Автострахование', 'Автозапчасти', 'Авторемонт', 'Автобезопасность', 'Выставки автомобилей', 'Автомобильные технологии', 'Прокат автомобилей'], 'Книги и литература': ['Книги по искусству и фотографии', 'Биографии', 'Детская литература', 'Комиксы и графические р

In [22]:
#@markdown ASNWERS
ANSWERS = f'''
[Путешествия, События и достопримечательности: Исторические места и достопримечательности]
[Еда и напитки: Кулинария]
[Массовая культура: Юмор и сатира]
[Дом и сад: Дизайн интерьера]
[Хобби и интересы]
[Массовая культура: Юмор и сатира]
[Путешествия: Направления путешествий: Азия]
[Хобби и интересы: Декоративно-прикладное искусство]
[Фильмы и анимация: Документальные фильмы]
[Дом и сад: Садоводство, Бизнес и финансы: Промышленность и сфера услуг: Сельское хозяйство]
[Изобразительное искусство: Современное искусство]
[Спорт: Борьба, Массовая культура]
[Музыка и аудио: Комедия и стендап (Музыка и аудио)]
[События и достопримечательности: Концерты и музыкальные мероприятия]
[Семья и отношения: Брак и гражданские союзы]
[Массовая культура]
'''

answers_lvl1 = '''
[Книги и литература, Бизнес и финансы]
[Карьера, Образование]
[События и достопримечательности]
[Семья и отношения']
[Изобразительное искусство], 
[Еда и напитки, Здоровый образ жизни, Хобби и интересы]
[Дом и сад]
'''

In [23]:
import tiktoken

def num_tokens_from_string(string, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    string = str(string)
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string('EXAMPLE STRING', "cl100k_base")

2

In [53]:
# Each "content" is a list of dicts and you can add image/video/text modalities

prompts = []
for i, clip in enumerate(clips):
    # LVL1
    # Prompt Setup with Best Practices and Grouped Tags
    tags_lvl1s = '\n'.join(tags_lvl1)
    PROMPT = f'''
    Вы эксперт в видео-тегировании с использованием IAB Content Taxonomy. 
    Ваша задача — на основе видео выбрать наиболее подходящие теги из списка ниже. Теги иерархичны по категориям и подкатегориям. Проанализируйте содержание видео и выберите соответствующие теги.

    **Возможные теги**:
    {tags_lvl1s}

    Ответь на русском. Твой ответ должен быть в квадратных скобках. Ответьте, используя один или несколько целевых тегов из списка возможных тегов. Примеры ответов:
    {answers_lvl1}
    
    Ты должен отвечать на русском языке! Разрешено использовать только русские символы в ответе. Начни ответ с квадратной скобки [
    '''

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": PROMPT},
                {"type": "video"},
                ],
        },
    ]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    prompts.append(prompt)
print('Prompts created')
print(num_tokens_from_string(prompts[0], "cl100k_base"))

Prompts created
615


In [54]:
# As you can see we got the formatted prompt
prompt
# prompt_2

"<|im_start|>user <video>\n\n    Вы эксперт в видео-тегировании с использованием IAB Content Taxonomy. \n    Ваша задача — на основе видео выбрать наиболее подходящие теги из списка ниже. Теги иерархичны по категориям и подкатегориям. Проанализируйте содержание видео и выберите соответствующие теги.\n\n    **Возможные теги**:\n    Транспорт\nКниги и литература\nБизнес и финансы\nКарьера\nОбразование\nСобытия и достопримечательности\nСемья и отношения\nИзобразительное искусство\nЕда и напитки\nЗдоровый образ жизни\nХобби и интересы\nДом и сад\nМедицина\nФильмы и анимация\nМузыка и аудио\nНовости и политика\nЛичные финансы\nЖивотные\nМассовая культура\nНедвижимость\nРелигия и духовность\nНаука\nПокупки\nСпорт\nСтиль и красота\nИнформационные технологии\nТелевидение\nПутешествия\nИгры\nNaN\n\n    Ответь на русском. Твой ответ должен быть в квадратных скобках. Ответьте, используя один или несколько целевых тегов из списка возможных тегов. Примеры ответов:\n    \n[Книги и литература, Бизнес

In [55]:
# import pandas as pd
# df = pd.read_csv('train_dataset_tag_video/baseline/train_data_categories.csv')
# tags_column = df['tags']

# # Find the tag with the largest number of tokens
# max_tokens_tag = max(tags_column, key=lambda tag: num_tokens_from_string(tag, "cl100k_base"))
# max_tokens = num_tokens_from_string(max_tokens_tag, "cl100k_base")

# # Print the result
# print(f"The tag with the largest number of tokens is: {max_tokens_tag}")
# print(f"Number of tokens: {max_tokens}")

In [56]:
generate_kwargs = {"max_new_tokens": 22, "do_sample": True, "top_p": 0.9}

# we still need to call the processor to tokenize the prompt and get pixel_values for videos
inputs = processor(text=prompts, videos=clips, padding=True, return_tensors="pt").to(model.device, torch.float16)
output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [57]:
def extract_substrings_after_assistant(strings_list):
    result = []
    for string in strings_list:
        # Find the position of 'assistant\n'
        assistant_index = string.find('assistant\n')
        if assistant_index != -1:
            # Extract the substring after 'assistant\n'
            substring = string[assistant_index + len('assistant\n'):]
            result.append(substring)
    return result

classes_lvl1 = extract_substrings_after_assistant(generated_text)

['[Дом и сад, Музыка и аудио]', '[Спорт, Outdoor]', '[Музыка и аудио]']

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities

prompts = []
for i, clip in enumerate(clips):
    # LVL2
    # Prompt Setup with Best Practices and Grouped Tags
    tags_lvl2s = tags_lvl2[classes_lvl1[i]]
    
    PROMPT = f'''
    Вы эксперт в видео-тегировании с использованием IAB Таксономии видео классификации. 
    Ваша задача — на основе видео выбрать наиболее подходящие теги из списка ниже. Теги иерархичны по категориям и подкатегориям. Проанализируйте содержание видео и выберите соответствующие теги.

    **Возможные теги**:
    {tags_lvl2s}

    Ответь на русском. Твой ответ должен быть в квадратных скобках. Ответьте, используя один или несколько целевых тегов из списка возможных тегов. Примеры ответов:
    {answers_lvl1}
    
    Ты должен отвечать на русском языке! Разрешено использовать только русские символы в ответе. Начни ответ с квадратной скобки [
    '''

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": PROMPT},
                {"type": "video"},
                ],
        },
    ]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    prompts.append(prompt)
print('Prompts created')
print(num_tokens_from_string(prompts[0], "cl100k_base"))

### LEGACY: Generate from images and image+video data

To generate from images we have to change the special token to `<image>` or indicate an "image" modality in the chat template, that's it! Let's see how it works

In [23]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this image?"},
              {"type": "image"},
              ],
      },
]

conversation_2_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What color is the sign?"},
              {"type": "image"},
              ],
      },
]

prompt_image = processor.apply_chat_template(conversation_image, add_generation_prompt=True)
prompt_2_image = processor.apply_chat_template(conversation_2_image, add_generation_prompt=True)

In [24]:
prompt

"<|im_start|>user <video>\n\n    You are an expert in video tagging using IAB Content Taxonomy. The Content Taxonomy provides a “common language” that can be used when describing content.\n    Your task is based on the video  match it to the most relevant tags from the list below. The tags are ierarchial by categories and subcategories. Please analyze the content carefully and match it to the appropriate tags.\n\n    **Possible Tags**:\n    ['Транспорт', 'Книги и литература', 'Бизнес и финансы', 'Карьера', 'Образование', 'События и достопримечательности', 'Семья и отношения', 'Изобразительное искусство', 'Еда и напитки', 'Здоровый образ жизни', 'Хобби и интересы', 'Дом и сад', 'Медицина', 'Фильмы и анимация', 'Музыка и аудио', 'Новости и политика', 'Личные финансы', 'Животные', 'Массовая культура', 'Недвижимость', 'Религия и духовность', 'Наука', 'Покупки', 'Спорт', 'Стиль и красота', 'Информационные технологии', 'Телевидение', 'Путешествия', 'Игры']\n\n    Answer in Russian. Your answ

In [25]:
inputs = processor(images=[image_snowman, image_stop], text=[prompt_image, prompt_2_image], padding=True, return_tensors="pt").to(model.device, torch.float16)

NameError: name 'image_snowman' is not defined

In [None]:
generate_kwargs = {"max_new_tokens": 50, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(generated_text)

We can feed images and videos in one go instead of running separate generations for image and video. Also we can interleave images with videos inside one prompt, although the training dataset didn't see that kind of examples.

For the processing just make sure to pass images/videos in the same order as they appear in the prompts, starting from the first prompt until the last prompt. You can pass all visual data as flattenned list as shown below, only order matters





In [None]:
inputs = processor(images=[image_snowman, image_stop], text=[prompt, prompt_image, prompt_2_image], videos=[clip_baby], padding=True, return_tensors="pt").to(model.device, torch.float16)

In [None]:
generate_kwargs = {"max_new_tokens": 40, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text)

In [None]:
# For multi-turn convwersations just continue stacking up messages in the chat template
conversation_multiturn = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
      {
          "role": "assistant",
          "content": [
              {"type": "text", "text": "I see a baby reading a book."},
              ],
      },
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is it funny?"},
              ],
      },
]

prompt_multiturn = processor.apply_chat_template(conversation_multiturn, add_generation_prompt=True)
print(prompt_multiturn)