## Overview

In this notebook we will walk you through examples of how to use Twelve Labs Marengo embedding model to create video embeddings, use Chroma to store and query those embeddings, and use Twelve Labs Pegasus model to chat with the returned videos. We will also compare Pegasus with a leading open source model.

We will:
1. Create Video Embeddings Using the Twelve Labs Marengo Engine
2. Store Video Embeddings in a Chroma Database
3. Query Embeddings in our Chroma Database to Find Relevant Video Segments
4. Use Twelve Labs Pegasus to Chat with the Returned Video Segment
5. Use an Open Source Model to Chat with the Returned Video Segment
6. Compare Pegasus to the Open Source model
7. Use Chroma and Twelve Labs Embeddings to Search Multiple Videos
8. Use Pegasus to Chat with a Full Video
9. Use an Open Source Model to Chat with a Full Video

## Install Libraries

In [1]:
#Install Twelve Labs and Chroma libraries
!pip install --upgrade twelvelabs
!pip install --upgrade chromadb



In [None]:
#Install libraries for use with the open source model
!pip install protobuf==3.20.3
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install av

In [None]:
# #Extra Things to install if you're not on colab
# !python -m pip install pillow
# !python -m pip install sentencepiece
# !python -m pip install matplotlib

## Preparing the Video Data

### Using our video Data

This demo uses video data from a Twelve Labs google drive folder. To use it, you'll need to link the folder to your google drive, and then mount your google drive to this colab.

### Linking the folder to our Google Drive:
Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link

To Link this to the correct spot in _your_ google drive:
1. Go to "Shared with me" in Google Drive.
2. Locate the shared folder you want to access.
3. Select "Organize" -> "Add Shortcut"
4. Choose "My Drive" as the destination and click "Add".

Now this folder should be accessible at `/content/drive/MyDrive/TwelveLabs-Chroma`

### Mounting Drive
The cell below will mount your Drive, which we can then use to load the videos

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Set Video Path

Here we set the path for the videos we will be working with.

In [4]:
video_folder_path = "/content/drive/MyDrive/its_nathalievictoria"

### Upscale Video Resolution
Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with `upscale_video`.



In [None]:
import numpy as np
import subprocess
import os

def upscale_video(input_file, output_path, target_width=854, target_height=480):

    output_file = os.path.join(output_path, os.path.basename(input_file))

    if os.path.exists(output_file):
        print(f"Skipping {input_file} as {output_file} already exists.")
        return


    """
    Upscale a video to the target width and height using FFmpeg.

    Args:
        input_file (str): Path to the input video file.
        output_file (str): Path to save the upscaled video.
        target_width (int): Desired output width. Default is 854.
        target_height (int): Desired output height. Default is 480.
    """
    # FFmpeg command to upscale the video
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_file,                              # Input file
        '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions
        '-c:a', 'copy',                                # Copy audio stream without re-encoding
        output_file,                                    # Output file
        "-y"
    ]

    # Run the FFmpeg command
    subprocess.run(ffmpeg_command)

    print(f"Upscaled video saved to {output_file}")


In [None]:
upscaled_video_dir = video_folder_path + "upscaled_videos/"

In [None]:
#Upscale all .mp4 videos
# Create output directory if it doesn't exist
if not os.path.exists(upscaled_video_dir):
    os.makedirs(upscaled_video_dir)

# Iterate over all files in the raw video directory
for filename in os.listdir(video_folder_path):
    # Check if the file is a video file
    input_filepath = os.path.join(video_folder_path, filename)
    if filename.endswith(".mp4"):
        upscale_video(input_filepath, upscaled_video_dir)

## Create Video Embeddings Using the Twelve Labs Marengo Engine

Here we will use the Twelve Labs Marengo Engine to create embeddings for our video.

In [5]:
from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

In [6]:
from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

In [7]:
import chromadb

# Initialize Chroma client
chroma_client = chromadb.Client()

### Create Video Embeddings and Format for Chroma
Here we create video embeddings using Marengo and format for Chroma. To upload data to chroma you need three separate lists for all the data that you want to upload: `embeddings`, `metadatas`, and `ids`

In [None]:
def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

# Create video embeddings and format for Chroma
def create_video_embeddings(client,video_file,segment_length,task_id=None):

    #upload video to twelve labs if it does not already exist
    video_name = os.path.basename(video_file)

    if task_id == None or task_id == "":
        task = client.embed.task.create(
            engine_name="Marengo-retrieval-2.7",
            video_file=video_file,
            video_clip_length=segment_length
        )
        print(
            f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}"
        )

        status = task.wait_for_done(
            sleep_interval=2,
            callback=on_task_update
        )

        print(f"Embedding done: {status}")

        task_id = task.id

    #fetch embeddings
    task = client.embed.task.retrieve(task_id)

    #format for chroma
    embeddings = []
    metadatas = []
    ids = []

    idx = 0

    print("embeddings",task.video_embeddings)

    if task.video_embeddings is not None:
        for v in task.video_embeddings:

            metadata = {
                "embedding_scope":v.embedding_scope,
                "start_offset_sec":v.start_offset_sec,
                "end_offset_sec":v.end_offset_sec,
                "video_file":video_file,
                "video_name":video_name,
                "task_id":task.id,
                "video_segment_number":idx
            }


            embedding = v.values
            id = task.id + "_" + str(idx)

            metadatas.append(metadata)
            embeddings.append(embedding)
            ids.append(id)

            idx += 1

    return (ids,metadatas,embeddings,task_id)

In [None]:
#set the segment duration and the video we will be working with
segment_duration = 6
current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"

In [None]:
#Get embeddings to upload to Chroma

#Set task_id if you already have one, otherwise set to empty string
task_id = ""
ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)

## Store Video Embeddings in a Chroma Database

Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma

In [None]:
#Fetch or create a Chroma Collection
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

In [None]:
#Add embeddings and metadata to our collection
collection.add(
    metadatas = metadatas,
    embeddings = embeddings,
    ids=ids
)

## Query Embeddings in our Chroma Database to Find Relevant Video Segments

### Testing the Vector Search
Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding and expect it's distance to itself is zero.

In [None]:
#use first embedding as a test search
test_segment_embeddings = embeddings[0]

results = collection.query(
    query_embeddings=[test_segment_embeddings],
    n_results=4
)

print("search embeddings for:",ids[0])
print("found:", results["ids"][0][0])
print("distance:",results["distances"][0][0])

#assert that the first video's text embedding is distance 0 from itself
assert results["ids"][0][0] == ids[0]
assert results["distances"][0][0] == 0

### Querying our Vector Database
With our queries, we can embed them, and then perform a vector search in our database. While these embeddings are related to segments of an overall video, they include the full video name associated with them.


In [None]:
query = "What are the ingredients for birra tacos?"

In [None]:
import os


def query_chroma(collection,query,n_results=1):
    #Create embedding for query
    embedding = twelvelabs_client.embed.create(
        engine_name="Marengo-retrieval-2.6",
        text=query,
        text_truncate="start",
    )

    query_embeddings = embedding.text_embedding.float

    #Search Chroma database with query embedding

    response = collection.query(
        query_embeddings=query_embeddings,
        n_results=n_results,
        # return_metadata=MetadataQuery(distance=True),
    )


    return response

response = query_chroma(collection,query)

# Print the properties and distance of the most similar object
print(response["ids"][0][0])
print(response["distances"][0][0])
print(response["metadatas"][0][0])

# Get the path for the found video segment for the next step
found_video_metadata = response["metadatas"][0][0]


## Splitting Videos into Segments
We will now split the videos into 6 second segments.
This will match the segment duration of our embeddings allowing us to submit _only_ this video chunk to our model for a RAG use case

In [None]:
split_video_dir = video_folder_path + "split_videos/"

def split_video(input_path, output_dir, segment_duration=6):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    filename = os.path.splitext(os.path.basename(input_path))[0]
    filetype = os.path.splitext(os.path.basename(input_path))[1]

    # Split video into segments
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_path,             # Input video file
        '-c', 'copy',                  # Copy both video and audio codecs
        '-f', 'segment',               # Segment mode
        '-segment_time', str(segment_duration),  # Segment length
        '-reset_timestamps', '1',      # Reset timestamps for each segment
        output_dir + filename + '_%03d' + filetype  # Output filename pattern (e.g., output_001.mp4)
    ]

    # Run the command
    subprocess.run(ffmpeg_command)

    print("Video split into 6-second segments successfully.")

In [None]:
#Split the video into segments
split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)

## Use Twelve Labs Pegasus to Chat with the Returned Video Segment
These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.

### Uploading Video Segment to Pegasus
First we will create an index for our video uploads and the Pegasus Engine, then upload them.

In [None]:
#create or retrieve pegasus index
engines = [
        {
            "name": "pegasus1.1",
            "options": ["visual", "conversation"]
        }
    ]

index_name = "cooking_video_index"
indices_list = twelvelabs_client.index.list(name=index_name)

if len(indices_list) == 0:
    index = twelvelabs_client.index.create(
        name=index_name,
        engines=engines,

    )
    print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}")
else:
    index = indices_list[0]
    print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")

### Get Video Segment File Name

In [None]:
#Get video segment filename
found_video_segment_number = int(found_video_metadata["video_segment_number"])
found_video_file = found_video_metadata["video_file"]
found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0]
found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1]
found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}"

found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype
print(found_video_segment_path)

### Upload Video to Pegasus and Get Video Id

In [None]:
def upload_video_to_twelve_labs(index,video_path):

    #upload our video to our twelve labs index
    task = twelvelabs_client.task.create(
        index_id=index.id,
        file = video_path
    )
    print(f"Task created: id={task.id} status={task.status}")

    task.wait_for_done(sleep_interval=5, callback=on_task_update)

    if task.status != "ready":
      raise RuntimeError(f"Indexing failed with status {task.status}")
    print(f"The unique identifer of your video is {task.video_id}.")

    #return the video id
    return task.video_id

In [None]:
#Set video_id if you already have one, otherwise set to empty string
video_id = ""

In [None]:
#Upload video to get video id to chat with in Pegasus
if video_id == "":
    video_id = upload_video_to_twelve_labs(index,found_video_segment_path)

### Calling Pegasus
Here we query the video segmement with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.

In [None]:
#chat with the video segment using Pegasus with the query we used to find it
res = twelvelabs_client.generate.text(
  video_id=video_id,
  prompt=query
)
segment_answer = res.data
print(f"query: {query}")
print(f"{segment_answer}")

## Use an Open Source Model to Chat with the Returned Video Segment
First, we need to sample the videos ourselves for the model to consume. We'll modify the [LLaVa-NeXT-Video Sampling code](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing#scrollTo=hqpPqDKuQUTq) to get a uniform sample of 8 frames for each video.

And we can do this for all of the video segments in our folder.

`read_video_pyav` comes directly from the [LLaVa-NeXT-Video collab notebook](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing#scrollTo=hqpPqDKuQUTq) and it formats videos in the correct numpy representation for inference.


In [None]:
import av
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_video(video_path, num_samples=8):
    container = av.open(video_path)

    # sample uniformly num_samples frames from the video
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / num_samples).astype(int)

    sampled_frames = read_video_pyav(container, indices)

    return sampled_frames

def process_videos_in_folder(folder_path):
    sample_info = {}

    # Supported video file extensions
    video_extensions = ('.mp4', '.avi', '.mov', '.mkv')

    for filename in os.listdir(folder_path):
        simple_video_name = os.path.splitext(os.path.basename(filename))[0]
        if filename.lower().endswith(video_extensions):
            video_path = os.path.join(folder_path, filename)
            try:
                sampled_clip = sample_video(video_path)
                sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path}
            except Exception as e:
                print(f"Error processing {filename}: {str(e)}")

    return sample_info

In [None]:
sampled_video_info = process_videos_in_folder(split_video_dir)

In [None]:
# Get video segment found in our Chroma query
video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']

### Setting up our Model
We'll set up our model in 4-bit quantization to speed up inference.

In [None]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

In [None]:
#To use later to play the videos in the notebook itself

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
# pick one at random just to see
video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video']

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())

### Running the Model
Now that we have our query and the relevant video, we can feed them into the model to get an output.

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

In [None]:
inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(open_source_segment_generated_text[0])

## Compare Pegasus to the Open Source model

As we can see Pegasus does a better job at answering our query

In [None]:
print(f"query {query}")
print("pegasus answer")
print(segment_answer)
print("open source answer")
print(open_source_segment_generated_text[0])


## Using Chroma and Twelve Labs Embeddings to Search Multiple Videos
With a series of videos, we can better show the value of Semantic Search and RAG by giving our model the full video from a query, out of a potential large set of videos. The Semantic Search allows us to find the specific video that we need to answer the query.


### Embedding our Video Database:
First we will embed all of our videos and store those embeddings in Chroma

In [None]:
#embed and store task ID's for all videos
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

In [None]:
#store twelve labs task ids for each video
task_ids = {}

In [None]:
#get embeddings and metadata for each video
#store task ids so we don't upload videos multiple times
for filename in os.listdir(upscaled_video_dir):

    if filename.endswith(".mp4"):

        if (filename in task_ids.keys()):
            task_id = task_ids[filename]
        else:
            task_id = None

        file_path = os.path.join(upscaled_video_dir, filename)

        ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id)

        task_ids[filename] = task_id

        collection.add(
            metadatas = metadatas,
            embeddings = embeddings,
            ids=ids
        )


In [None]:
print(task_ids)

### Querying our Database
Here we query the database for full videos to chat with.

In [None]:
response = query_chroma(collection,query)

In [None]:
found_full_video_name = response["metadatas"][0][0]["video_name"]
print(found_full_video_name)

## Use Pegasus to Chat with a Full Video
We already have an index created, so we just need to upload the videos to this index then call Pegasus.

In [None]:
#store pegasus video ids so that we don't upload videos multiple times
pegasus_video_ids = {}

In [None]:

for upscaled_video in os.listdir(upscaled_video_dir):
    upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video)
    print(upscaled_video_path)
    if upscaled_video not in pegasus_video_ids:
        video_id = upload_video_to_twelve_labs(index,upscaled_video_path)
        pegasus_video_ids[upscaled_video] = video_id


In [None]:
print(pegasus_video_ids)

### Calling Pegasus to Chat with Full Video

In [None]:
video_id = pegasus_video_ids[found_full_video_name]
print(video_id)

In [None]:
res = twelvelabs_client.generate.text(
  video_id=video_id,
  prompt=query
)
full_video_answer = res.data
print(f"query {query}")
print(f"{full_video_answer}")

### Compare full video answer to segment answer

In [None]:
print(f"segment answer: \n{segment_answer}")

## Use an Open Source Model to Chat with a Full Video
After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers

In [None]:
#sample all of the videos:
sampled_database_video_info = process_videos_in_folder(upscaled_video_dir)
print(sampled_database_video_info.keys())

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

In [None]:
inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(generated_text[0])

### Compare Result to Pegasus Answer
As we can see the open source model cannot give us an answer when chatting with the entire video

In [None]:
print(f"Pegasus answer: \n{full_video_answer}")