# Introduction To Visual Search and Summarization Blueprint

In this notebook, we care going to use camera footage from a warehouse to ensure safety policies are being followed.

## Environment Setup

This is some basic imports, settings, and helper functions. No need to make any changes, but feel free to take a look!

In [None]:
# python imports
from IPython.display import Markdown, Video
from pathlib import Path
import requests
from typing import Literal, Any
from urllib.parse import urljoin

In [None]:
# paths and locations
VSS_URL = "http://via-server:8100"
WAREHOUSE_VIDEO = Path("./assets/warehouse.mp4")

In [None]:
# helper functions for talking to the API
def check_response(response):
    """Return the proper response."""
    print(f"Response Code: {response.status_code}")
    response.raise_for_status()
    try:
        return response.json()
    except requests.exceptions.JSONDecodeError:
        return response.text

def vss_api_call(
        vss_url: str,
        target: str,
        params: None | dict[str, str] = None,
        verb: Literal["get"] | Literal ["post"] = "get",
        json: Any = None,
        files: Any = None,
        data: Any = None,
    ) -> dict[str, Any] | str:
    """Query the VIA Agent API and handle the response."""
    # determine the full url
    url = urljoin(vss_url, target)

    # query the api
    if verb == "get":
        response = requests.get(url, params=params)
    elif verb == "post":
        response = requests.post(url, params=params, json=json, data=data, files=files)
    else:
        raise ValueError("Verb must be either get or post.")

    return check_response(response)

## API Overview

VSS provides a REST API that operates the blueprint. 
This API is the integration point for your own custom applications.
You can browse the API documentation at http://HOSTNAME:8100/docs. The API actions are split into categories. 

Today, we are going to focus on:
- Health Check
- Models
- Files
- Summarization
- Alerts

```
TODO: route through proxy
```

### Health Checks API

Let's first start by checking out the health checks. Typically, these are going to be called your deployment system to check the status of the server, but we can check them here using `curl`.

There are two endpoints: a liveness probe and a readiness probe. A liveness probe tells us if the service is working. The readiness probe tells us if the service is ready to handle requests. These probes will return a `200` status if they are healthy and an error status code (`4XX` or `5XX`) if the probe is not healthy.

In [None]:
# Liveness Probe
# Is the service running?
! curl -i {VSS_URL}/health/live

In [None]:
# Readiness Probe
# Is the service ready for requests?
! curl -i {VSS_URL}/health/ready

### Models API

The models endpoint will return the LLM available to use for summarization requests. This is based on the the startup configuration for VSS. This LLM could be configured to point to any OpenAI compatible LLM. 

In [None]:
models_response = vss_api_call(VSS_URL, "models")

models_response

### Files API

To start processing a video, it must first be uploaded to VSS. 
Several endpoints are available to interact with the files. 

For our first example, we will be using the `WAREHOUSE_VIDEO`.

In [None]:
Video(WAREHOUSE_VIDEO, width=350)

We can upload new videos using the `files` endpoint and the `post` verb.

This endpoint requires three inputs:
- the file contents
- the file's purpose
- the file's type

When uploading a video for processing, the purpose is `vision` and the type is `video`.

To summarize a single image file, a single image could also be uploaded with media_type "image". 

In [None]:
with WAREHOUSE_VIDEO.open("rb") as file:
    # setup query data
    files = {"file": (WAREHOUSE_VIDEO.name, file)}
    data = {"purpose":"vision", "media_type":"video"}

    # query the vss api
    upload_response = vss_api_call(
        VSS_URL, 
        "files",
        verb="post",
        files=files,
        data=data
    )

upload_response

Check out the unique id in the response. That is how we will reference this file going forward.

We can also list all of the available files.

In [None]:
vss_api_call(VSS_URL, "files", params={"purpose":"vision"})

### Summarization APIs - Summarize

Once a video or image has been uploaded, the ```/summarize``` endpoint can be called to generate a summary. 

In the body of the request, we will include the video ID and model's ID from above. There are also some prompt and model options. We will explore all these options later in the workshop.

In [None]:
body = {
    "id": upload_response.get("id"),
    "prompt": "Write a caption based on the video clip.",
    "caption_summarization_prompt": "Combine sequential captions to create more concise descriptions.",
    "summary_aggregation_prompt": "Write a summary of the video. ",
    "model": models_response["data"][0]["id"],
    "max_tokens": 1024,
    "temperature": 0.8,
    "top_p": 0.8,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0
}

We can now request a summary. Depending on the video length and configuration options this request may take some time to return. 

In [None]:
summarize_response = vss_api_call(VSS_URL, "summarize", verb="post", json=body)

summarize_response

Let's check out the summary we generated:

In [None]:
Markdown("> " + summarize_response["choices"][0]["message"]["content"])

If everything went perfectly, that answer was pretty basic. Without much context on the situation, and your intentions, the model struggles to extract meaning. In every situation, the summarization prompts will need to be tuned.

### Customizing Summarization Prompts 

The VSS Blueprint processes videos with a four stage pipeline:
- **Chunking:** videos are split into windows of a few seconds
- **Dense Captioning:** generate descriptions of each window
- **Aggregated Captioning:** aggregate dense captions to cover longer windows
- **Summarization:** generate a caption that describes all chunks 

![Summarization Pipeline](./assets/summarization_diagram.png)

A set of three prompts are used to control the summary generation through the three stages shown in the diagram. 

`prompt` must include enough information so the model knows what it should be looking for in the video. If the summary is missing important details it is likely becuase the VLM did not extract those details from the video chunks in the first place. 

Often what works well is a three part prompt: 

1) Persona 
2) Details
3) Format

For example: 

In [None]:
prompt = (
    "You are a warehouse monitoring system. "
    "Describe the events you are seeing. "
    "In addition to general descriptions, look for: "
    "policy violations, equipment usage, and restricted areas. "
    "All employees must wear helmets and vests in the warehouse. "
    "Employees must not cross yellow safety tape.  "
    "Start each sentence with start and end timestamp of the event."
)

Often, the text descriptions generated by the VLM can be repetetitive in sequential or overlaping chunks. To condense the VLM captions efficiently, an LLM is used.

`caption_summarization_prompt` is the prompt used to instruct this aggragation. This prompt can typically stay the same across use cases as it just needs to instruct the LLM to combine similar descriptions together. 

For example:

In [None]:
caption_summarization_prompt = (
    "You will be given captions from sequential clips of a video. "
    "Aggregate captions in the format start_time:end_time:caption "
    "based on whether captions are related to one another or "
    "create a continuous scene."
)

`summary_aggregation_prompt` is used to generate the final summary. It is used in a single LLM call along with all the aggregated captions to generate the summary output. 

This prompt should reiterate what details need to be included in the summary and any formatting options. Keep in mind that at this stage, the summary can only include details that are present in aggregated captions produced in the previous stage.

For example:

In [None]:
summary_aggregation_prompt = (
    "Based on the available information, "
    "generate a summary that captures the important events in the video. "
    "The summary should be organized chronologically and in logical sections. "
    "This should be a concise, yet descriptive summary of all the important events. "
    "The format should be intuitive and easy for a user to read and understand what happened. "
    "Format the output in Markdown so it can be displayed nicely."
)

Now that our prompts follow best practices, let's rerun the summarization with `enable_chat` turned on.

This instructs the VSS agent to also create a database of video details. We'll talk more about this in the next section. Until then, expect this run to take a bit longer than before.

In [None]:
body = {
    "id": upload_response.get("id"),
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": models_response["data"][0]["id"],
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "seed": 1,
    "num_frames_per_chunk": 5,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0,
    "enable_chat": True
}

summarize_response = vss_api_call(VSS_URL, "summarize", verb="post", json=body)

summarize_response

In [None]:
Markdown(summarize_response["choices"][0]["message"]["content"])

Now that's a lot more like it! This is a lot more useful information. 
Already, we can identify policy violations and should be able to flag them in real-time.

### Summarization APIs - Chat Completions

That summary  was great, but there is so much more insight in the video that we haven't used.

Part of the blueprint's Summarization APIs is a chat completion endpoint. 
Chatting with your videos allows you to discover new information without modifying your existing pipelines.
Here is a very basic class that will interact with this endpoint.

In [None]:
class Chat:
    """A simple wrapper to chat with a video."""

    def __init__(self, vss_url: str, video_id: str, model_id: str):
        """Initialize the class."""
        self.vss_url = vss_url
        self.video_id = video_id
        self.model_id = model_id

    def query(self, msg: str):
        """Chat with the VSS agent."""
        # format the message
        messages = [{
            "content": msg,
            "role": "user"
        }]

        # generate request payload
        payload = {
            "id": self.video_id,
            "messages": messages,
            "model": self.model_id,
        }

        # query the vss agent
        response_data = vss_api_call(
            self.vss_url,
            "chat/completions",
            verb="post",
            json=payload
        )
        answer = response_data["choices"][0]["message"]["content"]

        return answer

In [None]:
chat_client = Chat(
    VSS_URL,
    video_id=upload_response.get("id"),
    model_id=models_response["data"][0]["id"]
)

# Uh oh! 

🫨 **Surprise Audit Today** 🫨

Today we will be doing a new audit of forklift usage.
There is a sign-out sheet for the forklift, and we want to ensure it is accurate.
Our sign sheet indicates one signout about a minute after this recording started.

Because this is a new audit, we have no collected any historical data. 
However, we can chat with our video library.

In [None]:
chat_client.query("Did the forklift appear?")

In [None]:
chat_client.query("When did the forklift first appear?")

Excellent! We were able to complete the audit quickly. Even better, we were already in compliance.

# 🥳 Complete!

Congratulations! We have succewsfully built an intelligent surveilance system.

Feel free to save and close this file and return to the workshop instructions.