# Introduction To Visual Search and Summarization Blueprint

In this notebook, we care going to use camera footage from a warehouse to ensure safety policies are being followed.

## Environment Setup

This is some basic imports, settings, and helper functions. No need to make any changes, but feel free to take a look!

In [1]:
# python imports
from IPython.display import Markdown, Video
from pathlib import Path
import requests
from typing import Literal, Any
from urllib.parse import urljoin

In [2]:
# paths and locations
VSS_URL = "http://via-server:8100"
WAREHOUSE_VIDEO = Path("./warehouse.mp4")

In [3]:
# helper functions for talking to the API
def check_response(response):
    """Return the proper response."""
    print(f"Response Code: {response.status_code}")
    response.raise_for_status()
    try:
        return response.json()
    except requests.exceptions.JSONDecodeError:
        return response.text

def vss_api_call(
        target: str,
        params: None | dict[str, str] = None,
        verb: Literal["get"] | Literal ["post"] = "get",
        json: Any = None,
        files: Any = None,
        data: Any = None,
    ) -> dict[str, Any] | str:
    """Query the VIA Agent API and handle the response."""
    # determine the full url
    url = urljoin(VSS_URL, target)

    # query the api
    if verb == "get":
        response = requests.get(url, params=params)
    elif verb == "post":
        response = requests.post(url, params=params, json=json, data=data, files=files)
    else:
        raise ValueError("Verb must be either get or post.")

    return check_response(response)

## API Overview

VSS provides a REST API that operates the blueprint. 
This API is the integration point for your own custom applications.
You can browse the API documentation at http://HOSTNAME:8100/docs. The API actions are split into categories. 

Today, we are going to focus on:
- Health Check
- Models
- Files
- Summarization
- Alerts

```
TODO: route through proxy
```

### Health Checks API

Let's first start by checking out the health checks. Typically, these are going to be called your deployment system to check the status of the server, but we can check them here using `curl`.

There are two endpoints: a liveness probe and a readiness probe. A liveness probe tells us if the service is working. The readiness probe tells us if the service is ready to handle requests. These probes will return a `200` status if they are healthy and an error status code (`4XX` or `5XX`) if the probe is not healthy.

In [4]:
# Liveness Probe
# Is the service running?
! curl -i {VSS_URL}/health/live

HTTP/1.1 200 OK
[1mdate[0m: Wed, 14 May 2025 18:12:25 GMT
[1mserver[0m: uvicorn
[1mcontent-length[0m: 0



In [5]:
# Readiness Probe
# Is the service ready for requests?
! curl -i {VSS_URL}/health/ready

HTTP/1.1 200 OK
[1mdate[0m: Wed, 14 May 2025 18:12:28 GMT
[1mserver[0m: uvicorn
[1mcontent-length[0m: 0



### Models API

The models endpoint will return the LLM available to use for summarization requests. This is based on the the startup configuration for VSS. This LLM could be configured to point to any OpenAI compatible LLM. 

In [6]:
models_response = vss_api_call("models")

models_response

Response Code: 200


{'object': 'list',
 'data': [{'id': 'nvila',
   'created': 1747203075,
   'object': 'model',
   'owned_by': 'NVIDIA',
   'api_type': 'internal'}]}

### Files API

To start processing a video, it must first be uploaded to VSS. 
Several endpoints are available to interact with the files. 

For our first example, we will be using the `WAREHOUSE_VIDEO`.

In [7]:
Video(WAREHOUSE_VIDEO)

We can upload new videos using the `files` endpoint and the `post` verb.

This endpoint requires three inputs:
- the file contents
- the file's purpose
- the file's type

When uploading a video for processing, the purpose is `vision` and the type is `video`.

To summarize a single image file, a single image could also be uploaded with media_type "image". 

In [8]:
with WAREHOUSE_VIDEO.open("rb") as file:
    # setup query data
    files = {"file": (WAREHOUSE_VIDEO.name, file)}
    data = {"purpose":"vision", "media_type":"video"}

    # query the vss api
    upload_response = vss_api_call(
        "files",
        verb="post",
        files=files,
        data=data
    )

upload_response

Response Code: 200


{'id': 'ba77324d-8a2f-4506-80ea-5304dcdfd6ff',
 'bytes': 156822870,
 'filename': 'warehouse.mp4',
 'purpose': 'vision',
 'media_type': 'video'}

Check out the unique id in the response. That is how we will reference this file going forward.

We can also list all of the available files.

In [9]:
vss_api_call("files", params={"purpose":"vision"})

Response Code: 200


{'data': [{'id': '98dfa079-e22e-4dbd-9ccb-8160463d2745',
   'bytes': 8141473,
   'filename': 'its.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '70c97577-95b5-4b24-aec7-6ece51ce59f4',
   'bytes': 8141473,
   'filename': 'its.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '7234f84f-7063-4627-a362-fc49584dc8c6',
   'bytes': 156822870,
   'filename': 'warehouse.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '2d1557b8-7b74-408a-9bc9-22957f95280e',
   'bytes': 156822870,
   'filename': 'warehouse.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '60d5fb0e-022d-485e-8b20-ca5d90fa0a0f',
   'bytes': 156822870,
   'filename': 'warehouse.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '585b0e46-5ace-4045-802f-eb5aa87ec7f6',
   'bytes': 156822870,
   'filename': 'warehouse.mp4',
   'purpose': 'vision',
   'media_type': 'video'},
  {'id': '054530dd-bc28-40dc-8e4d-7052103a58cf',
   'bytes': 156822870,
   'fi

### Summarization APIs - Summarize

Once a video or image has been uploaded, the ```/summarize``` endpoint can be called to generate a summary. 

In the body of the request, we will include the video ID and model's ID from above. There are also some prompt and model options. We will explore all these options later in the workshop.

In [10]:
body = {
    "id": upload_response.get("id"),
    "prompt": "Write a caption based on the video clip.",
    "caption_summarization_prompt": "Combine sequential captions to create more concise descriptions.",
    "summary_aggregation_prompt": "Write a summary of the video. ",
    "model": models_response["data"][0]["id"],
    "max_tokens": 1024,
    "temperature": 0.8,
    "top_p": 0.8,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0
}

We can now request a summary. Depending on the video length and configuration options this request may take some time to return. 

In [11]:
summarize_response = vss_api_call("summarize", verb="post", json=body)

summarize_response

Response Code: 200


{'id': 'd8ac132e-cd39-455d-86c0-62deee7f50a3',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'message': {'content': 'The video appears to show a worker in a warehouse performing various tasks. The worker is seen picking up a box, driving a forklift, and walking through the warehouse while wearing a yellow vest. Additionally, a man in a yellow vest is shown setting up yellow caution tape and later walking through the warehouse, holding a yellow rope.',
    'tool_calls': [],
    'role': 'assistant'}}],
 'created': 1747246368,
 'model': 'nvila',
 'media_info': {'type': 'offset', 'start_offset': 0, 'end_offset': 210},
 'object': 'summarization.completion',
 'usage': {'query_processing_time': 14, 'total_chunks_processed': 11}}

Let's check out the summary we generated:

In [13]:
Markdown("> " + summarize_response["choices"][0]["message"]["content"])

> The video appears to show a worker in a warehouse performing various tasks. The worker is seen picking up a box, driving a forklift, and walking through the warehouse while wearing a yellow vest. Additionally, a man in a yellow vest is shown setting up yellow caution tape and later walking through the warehouse, holding a yellow rope.

If everything went perfectly, that answer was pretty basic. Without much context on the situation, and your intentions, the model struggles to extract meaning. In every situation, the summarization prompts will need to be tuned.

### Customizing Summarization Prompts 

The VSS Blueprint processes videos with a four stage pipeline:
- **Chunking:** videos are split into windows of a few seconds
- **Dense Captioning:** generate descriptions of each window
- **Aggregated Captioning:** aggregate dense captions to cover longer windows
- **Summarization:** generate a caption that describes all chunks 

![Summarization Pipeline](summarization_diagram.png)

A set of three prompts are used to control the summary generation through the three stages shown in the diagram. 

`prompt` must include enough information so the model knows what it should be looking for in the video. If the summary is missing important details it is likely becuase the VLM did not extract those details from the video chunks in the first place. 

Often what works well is a three part prompt: 

1) Persona 
2) Details
3) Format

For example: 

In [24]:
prompt = (
    "You are a warehouse monitoring system. "
    "Describe the events you are seeing. "
    "In addition to general descriptions, look for: "
    "policy violations, equipment usage, and restricted areas. "
    "All employees must wear helmets and vests in the warehouse. "
    "Employees must not cross yellow safety tape.  "
    "Start each sentence with start and end timestamp of the event."
)

Often, the text descriptions generated by the VLM can be repetetitive in sequential or overlaping chunks. To condense the VLM captions efficiently, an LLM is used.

`caption_summarization_prompt` is the prompt used to instruct this aggragation. This prompt can typically stay the same across use cases as it just needs to instruct the LLM to combine similar descriptions together. 

For example:

In [25]:
caption_summarization_prompt = (
    "You will be given captions from sequential clips of a video. "
    "Aggregate captions in the format start_time:end_time:caption "
    "based on whether captions are related to one another or "
    "create a continuous scene."
)

`summary_aggregation_prompt` is used to generate the final summary. It is used in a single LLM call along with all the aggregated captions to generate the summary output. 

This prompt should reiterate what details need to be included in the summary and any formatting options. Keep in mind that at this stage, the summary can only include details that are present in aggregated captions produced in the previous stage.

For example:

In [26]:
summary_aggregation_prompt = (
    "Based on the available information, "
    "generate a summary that captures the important events in the video. "
    "The summary should be organized chronologically and in logical sections. "
    "This should be a concise, yet descriptive summary of all the important events. "
    "The format should be intuitive and easy for a user to read and understand what happened. "
    "Format the output in Markdown so it can be displayed nicely."
)

Now that our prompts follow best practices, let's rerun the summarization with `enable_chat` turned on.

This instructs the VSS agent to also create a database of video details. We'll talk more about this in the next section. Until then, expect this run to take a bit longer than before.

In [59]:
body = {
    "id": upload_response.get("id"),
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": models_response["data"][0]["id"],
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "seed": 1,
    "num_frames_per_chunk": 5,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0,
    "enable_chat": True
}

summarize_response = vss_api_call("summarize", verb="post", json=body)

summarize_response

Response Code: 200


{'id': 'f2f6e2d0-be8e-4719-bf3f-002db034ebe5',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
    'tool_calls': [],
    'role': 'assistant'}}],
 'created': 1747250367,
 'model': 'nvila',
 'media_info': {'type': 'offset', 'start_offset': 0, 'end_offset': 210},
 'object': 'summarization.completion',
 'usage': {'query_processing_time': 45, 'total_chunks_processed': 11}}

In [61]:
Markdown(summarize_response["choices"][0]["message"]["content"])

**Warehouse Safety Policy Compliance**
=====================================

**Initial Compliance (0:20)**
---------------------------

* The warehouse is well-lit and organized with clear aisle markings and labels on shelves.
* A worker is wearing a helmet and high-visibility vest, complying with safety policy.
* The worker is using a forklift to move a pallet of boxes, following proper equipment usage.

**Policy Violations (20:40)**
---------------------------

* Multiple workers are seen walking down aisles without wearing helmets or vests, violating safety policy.
* Another worker is picking up a box without wearing a helmet or vest, further violating policy.

**Mixed Compliance (40:60)**
-------------------------

* A man is not wearing a helmet or vest, violating policy, but is carrying a box properly and not crossing yellow safety tape.
* This sequence of events continues with the man repeating the same actions, showing both policy violation and proper behavior.

**Forklift Operation (60:80)**
---------------------------

* A worker is operating a forklift, wearing a helmet and vest, complying with safety policy.
* The worker continues to operate the forklift, moving it in different directions, with no policy violations or restricted areas visible.

**Compliance and Safety Tape Usage (80:100)**
-----------------------------------------

* An employee is wearing a helmet and vest, complying with warehouse safety policy.
* The employee is not crossing yellow safety tape, also complying with policy.
* This sequence of events continues with the employee repeating the same actions, showing compliance with safety policies.

**Safety Tape Blockage and Compliance (100:120)**
---------------------------------------------

* The employee is wearing a helmet and vest, complying with warehouse safety policy.
* The employee is using yellow safety tape to block off an area, following common warehouse safety practices.
* The employee is not crossing the yellow safety tape, complying with policy.

**Intermittent Compliance (120:180)**
-----------------------------------

* Workers are seen wearing helmets and vests, complying with safety policy, but with some instances of non-compliance.
* An employee is seen crossing yellow safety tape, violating warehouse safety policy.

**Non-Compliance (180:200)**
-------------------------

* The employee is not wearing a helmet, violating safety policy.
* This non-compliance continues until the end of the observed time period.

Now that's a lot more like it! This is a lot more useful information. 
Already, we can identify policy violations and should be able to flag them in real-time.

### Summarization APIs - Chat Completions

That summary  was great, but there is so much more insight in the video that we haven't used.

Part of the blueprint's Summarization APIs is a chat completion endpoint. 
Chatting with your videos allows you to discover new information without modifying your existing pipelines.
Here is a very basic class that will interact with this endpoint.

In [62]:
class Chat:
    """A simple wrapper to chat with a video."""

    def __init__(self, video_id: str, model_id: str):
        """Initialize the class."""
        self.video_id = video_id
        self.model_id = model_id

    def query(self, msg: str):
        """Chat with the VSS agent."""
        # format the message
        messages = [{
            "content": msg,
            "role": "user"
        }]

        # generate request payload
        payload = {
            "id": self.video_id,
            "messages": messages,
            "model": self.model_id,
        }

        # query the vss agent
        response_data = vss_api_call(
            "chat/completions",
            verb="post",
            json=payload
        )
        answer = response_data["choices"][0]["message"]["content"]

        return answer

In [63]:
chat_client = Chat(
    video_id=upload_response.get("id"),
    model_id=models_response["data"][0]["id"]
)

Uh oh! 🫨 **Surprise Audit Today** 🫨

Today we will be doing a new audit of forklift usage.
There is a sign-out sheet for the forklift, and we want to ensure it is accurate.
Our sign sheet indicates one signout about a minute after this recording started.

Because this is a new audit, we have no collected any historical data. 
However, we can chat with our video library.

In [64]:
chat_client.query("Did the forklift appear?")

Response Code: 200


'Yes, the forklift appeared in the video. It was being operated by a worker in the warehouse.'

In [65]:
chat_client.query("When did the forklift first appear?")

Response Code: 200


'The forklift first appeared in the frame at time [60.0, 64.0].'

Excellent! We were able to complete the audit quickly. Even better, we were already in compliance.

# 🥳 Complete!

Congratulations! We have succewsfully built an intelligent surveilance system.

Feel free to save and close this file and return to the workshop instructions.