SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: Apache-2.0

# Part 1: Understanding Videos using Metropolis Video Search and Summarization AI Blueprint (VSS)

## Summarization and Vector-RAG

In this notebook we'll explore the [NVIDIA AI blueprint for video search and summarization (VSS)](https://build.nvidia.com/nvidia/video-search-and-summarization/blueprintcard) which can be used to generate insights for input videos. Here we will be using a zero-shot foundation model. However, fine-tuned models can be integrated with VSS to enable specific tasks.

### Learning Objectives:
This notebook explores the following topics:
* VSS REST APIs for file management and summarization
* Implementation of summarization using Vector-RAG
* VSS configuration options through the REST APIs for specific summarization use cases 


### Table of Contents

**[Set Up the Environment](#Set-Up-the-Environment)**  
**[VSS API Overview](#VSS-API-Overview)**  
**[Video Summarization](#Video-Summarization)**  
&nbsp;&nbsp;&nbsp;&nbsp;[File Uploading](#File-Uploading)  
&nbsp;&nbsp;&nbsp;&nbsp;[Summarize](#Summarize)  
&nbsp;&nbsp;&nbsp;&nbsp;[Summarization Pipeline and CA-RAG](#Summarization-Pipeline-and-CA-RAG)  

**[Summarization Configurations - Intelligent Traffic System](#Summarization-Configurations---Intelligent-Traffic-System)**  
&nbsp;&nbsp;&nbsp;&nbsp;[Prompts](#Prompts)  
&nbsp;&nbsp;&nbsp;&nbsp;[Chunk Duration](#Chunk-Duration)  
&nbsp;&nbsp;&nbsp;&nbsp;[Model Parameters](#Model-Parameters)  

**[Review](#Review)**

<img src="assets/vss_arch_diagram.png" alt="vss arch diagram" width=1000>

---
### Part 0: Set Up the Environment

Make sure you have a VSS instance running before proceeding. If needed, adjust the vss_url parameter to point to your VSS instance. 

There are also two sample videos should in the images folder:
1) A video of a warehouse
2) A video of a synthetically generated traffic intersection

In [None]:
vss_url = "http://localhost:8100"
warehouse_video = "assets/warehouse.mp4"
traffic_video = "assets/traffic.mp4"

 The next cell will install all the necessary Python packages for this notebook

In [None]:
import sys 
python_exe = sys.executable
!{python_exe} -m pip install -r requirements.txt 

Let's also import the following libraries.

In [None]:
import json
import requests
from IPython.display import Markdown, Video, display
import time

---
### Part 1: VSS API Overview

VSS provides REST API endpoints that are used to interact with the blueprint. These endpoints are the integration point of VSS into your own application or service. 

The APIs include the following 
- Alerts
- Files
- Health Check
- Live Stream
- Metrics
- Models
- Recommended Configs
- Summarization and Q&A

You can also refer to the [API Glossary in the documentation](https://docs.nvidia.com/vss/latest/content/API_doc.html#vss-api-glossary) for more details on how to use each API.

<img src="assets/swagger_docs.png" alt="VSS Swagger" width=1000>

For the REST API endpoints, a 200 status code is returned on success and the response is in JSON format. The following helper function is defined to verify the request responses and help debug if any errors occur. 

In [None]:
#helper function to verify responses 
def check_response(response, text=False):
    print(f"Response Code: {response.status_code}")
    if response.status_code == 200:
        print("Response Status: Success")
        if text:
            print(response.text)
            return response.text
        else:
            print(json.dumps(response.json(), indent=4))
            return response.json()
    else:
        print("Response Status: Error")
        print(response.text)
        return None 

We will be exploring the endpoints below:

In [None]:
files_endpoint = vss_url + "/files" #upload and manage files
summarize_endpoint = vss_url + "/summarize" #summarize uploaded content 
health_endpoint = vss_url + "/health/ready" #check the status of the VSS server
models_endpoint = vss_url + "/models" #view the configured model in VSS

Let's use the health endpoint to verify your VSS instance is running. It should return a 200 status code. 

<div class="alert alert-block alert-info">
        <b>Note:</b>  It takes about 10 minutes once the server has started for VSS to boot up as it download and sets up required models. Make sure the following cell outputs "Response Code: 200" before proceeding.
    </div>

In [None]:
try:
    resp = requests.get(vss_url + "/health/ready")
    resp = check_response(resp, text=True)
except Exception as e:
    print(f'Server not ready: {e}')

The models endpoint will return the VLM available to use for summarization requests. This is based on the the startup configuration for VSS. This VLM could be configured to use one of the tightly integrated VLMs (VILA-1.5, NVILA) or a VLM with an OpenAI compatible REST API interface like GPT-4o. 

In [None]:
try:
    resp = requests.get(vss_url + "/models")
    resp = check_response(resp)
    configured_vlm = resp["data"][0]["id"]
except Exception as e:
    print(f'Server not ready: {e}')

In [None]:
print(f"Configured VLM: {configured_vlm}")

---
### Part 2: Video Summarization 

This section will show how to upload a video file and make a request to VSS to produce a simple summary over a synthetically generated two minute video of a traffic intersection.

In [None]:
Video(traffic_video, width=1000, embed=True)

---
#### 2.1: File Uploading

The first step to summarize a video with VSS is to upload a video file through the REST APIs. Several endpoints are available to interact with the files. 

<img src="assets/file_endpoints.png" alt="file endpoints" width=1000>


To send a request with the video file, it should be opened with "rb" to get the binary content of the file. Then, we can add it as a file in the body of the request. The request should also specify the ```purpose``` as "vision" and ```media_type``` as "video". A single image could also be uploaded with media_type "image" to summarize a single image file. 

This request can then be posted to the ```/files``` endpoint as a multipart form. 

In [None]:
with open(traffic_video, "rb") as file:
    files = {"file": ("traffic_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response = requests.post(files_endpoint, data=data, files=files) #post file upload request 
response = check_response(response)
video_id = response["id"] #save file ID for summarization request

Once posted, a unique ID will be returned that is used to reference this uploaded file in the summarization request. 

To view all the uploaded files, send a get request to the ```/files``` endpoint. 

In [None]:
resp = requests.get(files_endpoint, params={"purpose":"vision"})
resp = check_response(resp)

---
#### 2.2: Summarize

Once a video or image has been uploaded, the ```/summarize``` endpoint can be called to generate a summary. 

<img src="assets/summarize_endpoint.png" alt="summarize endpoint" width=1000>


In the body of the request, the video ID should be included along with the prompt and model options. We will explore all these options later in the notebook.

In [None]:
body = {
    "id": video_id, #id of file returned after upload 
    "prompt": "Write a caption based on the video clip.",
    "caption_summarization_prompt": "Combine sequential captions to create more concise descriptions.",
    "summary_aggregation_prompt": "Write a summary of the video. ",
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.8,
    "top_p": 0.8,
    "chunk_duration": 5,
    "chunk_overlap_duration": 0
}

The body can then be posted to the ```/summarize``` endpoint to start the summarization process. 
<div class="alert alert-block alert-info">
        <b>Note:</b>  Depending on the video length and configuration options this request will take some time to return. 
    </div>

In [None]:
response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
generic_summary = response["choices"][0]["message"]["content"]
summary_id = response["id"] #save to inspect later

The request response includes the summary output, some metadata and a unique request ID. The summary is returned in a format similar to the OpenAI API specification. You can extract the summary from the first message in the choices list: ```response["choices"][0]["message"]["content"]```. Run the next cell to render the summary output in the notebook. 

In [None]:
display(Markdown("### Summary Output")) 
markdown_string = "\n".join(f"> {line}" for line in generic_summary.splitlines())
display(Markdown(markdown_string)) #render summary output as markdown

The summary output is very short and does not capture many details or any timestamp information. This is becuase the parameters provided in the summarization request were very generic. To improve the response, the prompt, caption_summarization_prompt, summary_aggregation_prompt, chunk duration, chunk_overlap_duration, temperature and top_p can all be tuned to improve the output. To understand how to best configure the summarization request, we need to dive into how the summarization pipeline works. 

---

####  2.3: Summarization Pipeline and CA-RAG

Summarization is a multi-stage pipeline that involves a series of VLM and LLM calls to understand the contents of the input video. This pipeline is GPU accelerated and can run with optimized VLM and LLM NIMs. To produce an informative summary, the LLM is augmented with details of the video generated by the VLM which along with a vector database makes up the Context Aware RAG module of VSS. 

When a summarization request is posted, the input video is first split into many smaller segments or 'chunks'. The size of these chunks is configured by the ```chunk_duration``` parameter. Typical sizes for a chunk are between 10 and 60 seconds. 

Each video chunk is processed in parallel by a VLM. The VLM will inspect a chunk of video by taking in a few frames sampled from the video segment and then produce a text description of what occurred in the chunk. The text description output of the VLM can be influenced by the ```prompt``` parameter. 

<img alt="summarization diagram" src="assets/summarization_diagram.png" width=1000>


A batch of VLM dense captions is then provided to a LLM along with the 'caption_summarization_prompt' to condense the captions and reduce any repetitive information. The diagram shows this step with a batch size of 2. The batch size is configurable in the CA-RAG configuration yaml file when VSS is launched. 

The final step in summarization is a single LLM call that takes in the condensed captions and produces the final summary output. The generation of this summary is controlled by the ```summary_aggregation_prompt``` parameter. 

---
### Part 3: Summarization Configurations - Intelligent Traffic System

This section will walk through the available configuration options and how they can be tuned to turn VSS into an intelligent traffic system capable of producing traffic reports. 

Several options are available to tune the summary output. The most important are the prompts supplied to the VLM and LLM. 
These prompts can be supplied through the configuration file as a defaults or given directly to the summarize endpoint making the request. 

<img alt="summarization prompts" src="assets/summarization_prompts_diagram.png" width=1000>

#### 3.1: Prompts

A set of three prompts are used to control the summary generation through the three stages shown in the diagram. The following sections will show how to improve upon the generic prompts used earlier to produce a more informative summary for warehouse related use cases. 

##### Prompt (VLM)

The VLM prompt must include enough information so the model knows what it should be looking for in the video. If the summary is missing important details it is likely becuase the VLM did not extract those details from the video chunks in the first place. 

Often what works well is a three part prompt: 

1) Persona 
2) Details
3) Format

For example: 

> "You are an intelligent traffic system. You must monitor and take note of all traffic related events. Start each event description with a start and end time stamp."


Giving the VLM the persona of an intelligent traffic system orients its reponses to include relevant details needed to generate a traffic report. We can then add in the prompt any specific details it should be looking for such as traffic events. Finally we often want the summary report to include timestamp information so we must tell the VLM to include the time stamp in the descriptions. 

When VSS chunks the video and provides sampled frames from a chunk to the VLM, it also includes timestamps so the model knows where in the video each frame occured. The model can then use this timestamp information in the output to correlate when events occured in the video. 

With this more specific prompt, the VLM will produce more detailed captions with relevant information which is critical to getting a good summary. 

In [None]:
prompt = "You are an intelligent traffic system. You must monitor and take note of all traffic related events. Start each event description with a start and end time stamp."

##### Caption Summarization Prompt (LLM) 

Often, the text descriptions generated by the VLM can be repetetitive in sequential or overlaping chunks. For very long videos, this can be a waste of tokens when producing the final summary. To condense the VLM captions an LLM is used to combine the VLM outputs to produce a more concise description over a batch of chunks.

This prompt can typically stay the same across use cases as it just needs to instruct the LLM to combine similar descriptions together. 

For example:
>"You will be given captions from sequential clips of a video. Aggregate captions in the format start_time:end_time:caption based on whether captions are related to one another or create a continuous scene"

In [None]:
caption_summarization_prompt = "You will be given captions from sequential clips of a video. Aggregate captions in the format start_time:end_time:caption based on whether captions are related to one another or create a continuous scene"

##### Summary Aggregation Prompt (LLM)

The summary aggregation prompt is used to generate the final summary returned by summarization endpoint. It is used in a single LLM call along with all the aggregated captions to generate the summary output. 

This prompt should reiterate what details need to be included in the summary and any formatting options. Keep in mind that at this stage, the summary can only include details that are present in aggregated captions produced in the previous stage.

For example:
>"Based on the available information, generate a traffic report that is organized chronologically and in logical sections.Give each section a descriptive heading of what occurs and the time range. This should be a concise, yet descriptive summary of all the important events. The format should be intuitive and easy for a user to read and understand what happened. Format the output in Markdown so it can be displayed nicely."

In [None]:
summary_aggregation_prompt = "Based on the available information, generate a traffic report that is organized chronologically and in logical sections.Give each section a descriptive heading of what occurs and the time range. This should be a concise, yet descriptive summary of all the important events. The format should be intuitive and easy for a user to read and understand what happened. Format the output in Markdown so it can be displayed nicely."

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0
}

response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary = response["choices"][0]["message"]["content"]


Run the following cells to render the summary outputs of the generic prompts and the tuned prompts side by side. 

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Generic Prompts </h1>
    {generic_summary}
  </div>
  <div style="flex: 1;">
  <h1> Tuned Prompts </h1>
    \n{summary}
  </div>
</div>
"""

In [None]:
Markdown(markdown_string)

The summary output with the tuned prompts should be signficantly more informative compared to the summary generated with the generic prompts. It should include much more detailed information related to the traffic video, timestamps of relevant events and be in an easy to ready format. 

---
#### 3.2: Chunk Duration

In addition to the prompts, the ```chunk_duration``` (also known as chunk size) is very important to tune based on the use case. The chunk size determines the temporal granularity at which the VLM will view the video.

<img alt="chunk duration" src="assets/chunk_duration.png" width=1000>

The number of chunks and frames processed by the VLM can be calculated with the following: 

$ Number\ of\ Chunks = \frac{Video\ Length\ (s)}{Chunk\ Size\ (s)} $  <!-- Display-style math -->  
$ Processed\ Frames = Frames\ per\ Chunk * Number\ of\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = Frames\ per\ Chunk * \frac{Video\ Length (s)}{Chunk\ Size (s)}  $  <!-- Display-style math -->  


Now lets plug in some real numbers. 

Video Length = 2 minutes (120 seconds)   
Chunk Size = 5 seconds   
Frames per Chunk = 10 (default value in VSS)   

$ Number\ of\ Chunks = \frac{120}{5} = 24\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = 10 * 24 = 240\ Frames  $  <!-- Display-style math -->  


If the chunk size is adjusted to 30 seconds: 

$ Number\ of\ Chunks = \frac{120}{30} = 4\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = 10 * 4 = 40\ Frames  $  <!-- Display-style math -->  



Based on the formula, summarization with a lower value for the chunk size, results in more frames from the video being processed and used in summary generation. A higher value chunk sizes will result in fewer frames being used to generate the summary. 

With fewer frames (longer chunk size), the summary generation will be faster however, finer details or quick events may be missed because the VLM did not see as many frames from the video. 

With a larger number of frames (shorter chunk size), the summary generation will be slower but will include more details and have a higher likelihood of catching quick details and events such as a worker dropping the box. 

The optimal chunk size depends on the use cases and must be tuned to find the right balance between processing time and temporal resolution of the summary. 

Additionally, a ```chunk_overlap_duration``` can also be added to the summarization request to configure overlap between chunks. This can help capture events that may occur at chunk boundaries. 

The following cells will produce a side-by-side comparison of a summary with a chunk size of 30 compared to a chunk size of 5. 

In [None]:
prompt = "You are an intelligent traffic monitoring system that will be given a clip from a camera overlooking a four way intersection. You must inspect the clip and write a detailed description of what occurs. End each sentence with a timestamp."
caption_summarization_prompt = "If any descriptions have the same meaning and are sequential then combine them under one sentence and merge the time stamps to a range. Format the timestamps as 'mm:ss'"
summary_aggregation_prompt = "Write out a detailed time line based on the descriptions. The output should be a bulleted list in the format 'mm:ss-mm:ss Description' that includes the timestamp and description of what occured."

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 30,
    "chunk_overlap_duration": 5
}

start_t = time.time() 
response = requests.post(summarize_endpoint, json=body)
summary_30_time = time.time() - start_t 
response = check_response(response)
summary_30 = response["choices"][0]["message"]["content"]

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 5,
    "chunk_overlap_duration": 1
}

start_t = time.time() 
response = requests.post(summarize_endpoint, json=body)
summary_5_time = time.time() - start_t
response = check_response(response)
summary_5 = response["choices"][0]["message"]["content"]

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Chunk Size: 30 seconds </h1>
  <h1> Generation time: {round(summary_30_time, 2)} seconds </h1>
    \n{summary_30}
  </div>
  <div style="flex: 1;">
  <h1> Chunk Size: 5 seconds </h1>
  <h1> Generation time: {round(summary_5_time, 2)} seconds </h1>
    \n{summary_5}
  </div>
</div>
"""

In [None]:
Markdown(markdown_string)

From the two summaries, the one produced with a chunk size of 5 seconds should have many more details and finer grain timestamp information however the generation time is higher.

---
#### 3.3: Model Parameters

The summarize API also accepts parameters to control the LLM during summary generation. The important ones to note are:
- max_tokens 
- temperature 
- top_p 

The ```max_tokens``` parameter controls the maximum length of summary generation. If the summary generation is longer than the max_tokens then it will be cut off. If you find the summary is getting cutoff then increase the max tokens. If the summary is too verbose, then reduce the max tokens. 

The ```temperature``` and ```top_p``` parameters influence the probabilities when choosing the next output token. Higher temperature means there is a more equal likelihood between all tokens. A high top_p allows for more variety of tokens to get chosen. This variety can be a good thing when being used to come up with new ideas, creative writing or getting varied outputs. However, it can also lead to outputs with hallucinations. 

For use cases where repeatability and low chance of hallucinations are important, a low ```temperature``` and low ```top_p``` should be used.   

The following cells will compare summaries with high and low values of ```temperature``` and ```top_p```. 

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.9,
    "top_p": 0.9,
    "chunk_duration": 20
}

response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary_high_t = response["choices"][0]["message"]["content"]

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": configured_vlm,
    "max_tokens": 1024,
    "temperature": 0.1,
    "top_p": 0.1,
    "chunk_duration": 20
} 
response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary_low_t = response["choices"][0]["message"]["content"]

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Low Temperature, Low Top P</h1>
    \n{summary_low_t}
  </div>
  <div style="flex: 1;">
  <h1> High Temperature, High Top P </h1>
    \n{summary_high_t}
  </div>
</div>
"""
Markdown(markdown_string)

There may be little difference in the quality of the outputs when adjusting `temperature` and `top_p`. Inspect the side-by-side comparison above and see if you can spot any differences in the output. Generally, a value of 0.2 for both `temperature` and `top_p` will provide good quality outputs for summarization use cases. In practice, the prompts and chunk size will have a larger effect on the quality of output compared to `temperature` and `top_p`. 

---
### Review

In this notebook you learned the following:
- How to use VSS to summarize a video
- How to adjust prompts to affect summarization quality
- How to adjust parameters to affect summarization quality

In Lab 2, we'll kick things up a notch by implementing a question-and-answer system.