<a href="https://colab.research.google.com/github/rapheal-sacr/blank-app/blob/main/transformers_doc/en/pytorch/video_text_to_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Transformers installation
#! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
#! pip install git+https://github.com/huggingface/transformers.git
#! pip install -q transformers accelerate flash_attn

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-wldsy49h
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-wldsy49h
  Resolved https://github.com/huggingface/transformers.git to commit d08b98b965176ea9cf8c8e8b24995c955b7e2ec9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Video-text-to-text

Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.

These models have nearly the same architecture as [image-text-to-text](https://huggingface.co/docs/transformers/main/en/tasks/../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".

In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.

To begin with, there are multiple types of video LMs:

- base models used for fine-tuning
- chat fine-tuned models for conversation
- instruction fine-tuned models

This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.

Let's begin installing the dependencies.

```bash
pip install -q transformers accelerate flash_attn
```

Let's initialize the model and the processor.

In [3]:
from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch
model_id = "llava-hf/llava-interleave-qwen-0.5b-hf"

processor = LlavaProcessor.from_pretrained(model_id)

model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto", dtype=torch.float16)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading weights:   0%|          | 0/715 [00:00<?, ?it/s]

Some models directly consume the `<video>` token, and others accept `<image>` tokens equal to the number of sampled frames. This model handles videos in the latter fashion. We will write a simple utility to handle image tokens, and another utility to get a video from a url and sample frames from it.

In [4]:
import uuid
import requests
import cv2
from PIL import Image
import os

def replace_video_with_images(text, frames):
  return text.replace("<video>", "<image>" * frames)

def sample_frames(video_source, num_frames):
    if video_source.startswith("http://") or video_source.startswith("https://"):
        # Handle URL
        response = requests.get(video_source)
        path_id = str(uuid.uuid4())
        path = f"./{path_id}.mp4"
        with open(path, "wb") as f:
            f.write(response.content)
    else:
        # Assume it's a local file path
        if not os.path.exists(video_source):
            raise FileNotFoundError(f"Local video file not found at: {video_source}")
        path = video_source

    video = cv2.VideoCapture(path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    if total_frames == 0:
        raise ValueError(f"Could not read any frames from video: {path}. Check if the file is valid.")

    # Avoid division by zero if num_frames is greater than or equal to total_frames
    interval = max(1, total_frames // num_frames) if num_frames > 0 else total_frames + 1
    frames = []
    for i in range(total_frames):
        ret, frame = video.read()
        if not ret:
            break # Break if we can't read more frames
        if i % interval == 0 and len(frames) < num_frames:
            pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
            frames.append(pil_img)
    video.release()
    return frames[:num_frames] # Ensure exactly num_frames are returned

Let's get our inputs. We will sample frames and concatenate them.

In [4]:
video_1 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"
video_2 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4"

video_1 = sample_frames(video_1, 6)
video_2 = sample_frames(video_2, 6)

videos = video_1 + video_2

videos

# [<PIL.Image.Image image mode=RGB size=1920x1080>,
# <PIL.Image.Image image mode=RGB size=1920x1080>,
# <PIL.Image.Image image mode=RGB size=1920x1080>, ...]

[<PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>,
 <PIL.Image.Image image mode=RGB size=1920x1080>]

Both videos have cats.

<div class="container">
  <div class="video-container">
    <video width="400" controls>
      <source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4" type="video/mp4">
    </video>
  </div>

  <div class="video-container">
    <video width="400" controls>
      <source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4" type="video/mp4">
    </video>
  </div>
</div>

Now we can preprocess the inputs.

This model has a prompt template that looks like following. First, we'll put all the sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Add `assistant` at the end of the prompt to trigger the model to give answers. Then we can preprocess.

In [7]:
video_a = "/content/example-vid.mp4.mp4" # Directly use the path to the local video file
print(video_a)
video_b = sample_frames(video_a, 90)
user_prompt = "Give me a prompt to generate a video like this video"
toks = "<image>" * 90 # Correct the number of image tokens to match the sampled frames
prompt = "<|im_start|>user"+ toks + f"\n{user_prompt}<|im_end|><|im_start|>assistant"
inputs = processor(text=prompt, images=video_b, return_tensors="pt").to(model.device, model.dtype)

/content/example-vid.mp4.mp4


We can now call [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) for inference. The model outputs the question in our input and answer, so we only take the text after the prompt and `assistant` part from the model output.

In [8]:
output = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.7)
print(processor.decode(output[0][2:], skip_special_tokens=True)[len(user_prompt)+10:])

Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 540.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 128.12 MiB is free. Process 163287 has 14.61 GiB memory in use. Of the allocated memory 12.54 GiB is allocated by PyTorch, and 1.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

And voila!

To learn more about chat templates and token streaming for video-text-to-text models, refer to the [image-text-to-text](https://huggingface.co/docs/transformers/main/en/tasks/../tasks/image_text_to_text) task guide because these models work similarly.