# Azure AI Content Understanding
## Video Search with Azure Content Understanding and Azure AI Search

<img src="https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/media/overview/component-overview.png">

## Objective
This document is meant to present a guideline on how to leverage the Azure Video Content Understanding API for AI Search.
The sample will demonstrate the following steps:
1. Process a video file from Azure Blob storage with the Azure Video Content Understanding service to generate a video description grounding document.
2. Process the video description grounding document with Azure Search client to generate an Azure Search index.
3. Utilize OpenAI completion and embedding models to search through content in the video search index.

## Settings
1.	Azure AI services: Go to Access Control (IAM) in resource, grant yourself role **Cognitive Services User**
3.	Azure OpenAI: Go to Access Control (IAM) in resource, grant yourself role **Cognitive Services OpenAI User**
7.	Azure AI Search: Go to Access Control (IAM) in resource, grant yourself role **Search Index Data Contributor**. Go to Access Control (IAM) in resource, grant yourself role **Search Service Contributor**.

https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview

In [1]:
import cv2
import datetime
import json
import matplotlib.pyplot as plt
import os
import requests
import sys
import time

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure_content_understanding import AzureContentUnderstandingClient
from dotenv import load_dotenv
from IPython.display import Video
from langchain.schema import Document, StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores.azuresearch import AzureSearch
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

In [2]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [3]:
print(f"Today is: {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is: 31-Jan-2025 10:45:21


In [4]:
JSON_DIR = "json"
DOCUMENTS_DIR = "documents"

## Load environment variables

In [5]:
load_dotenv("azure.env")

AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")

AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_KEY = os.getenv("AZURE_SEARCH_KEY")

In [6]:
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = "gpt-4o"
AZURE_OPENAI_CHAT_API_VERSION = "2025-01-01-preview"

AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = "text-embedding-ada-002"
AZURE_OPENAI_EMBEDDING_API_VERSION = "2025-01-01-preview"
AZURE_AI_SERVICE_API_VERSION = "2024-12-01-preview"

AZURE_SEARCH_INDEX_NAME = "azure_cu_movieteaser"

## Helper

In [7]:
def convert_values_to_strings(json_obj):
    """
    Convert all values in a JSON object to strings.

    Args:
        json_obj (dict): A dictionary representing the JSON object.

    Returns:
        list: A list of string representations of the values in the JSON object.
    """

    return [str(value) for value in json_obj]

In [8]:
def remove_markdown(json_obj):
    """
    Remove 'markdown' keys from all segments in a JSON object.

    Args:
        json_obj (list): A list of dictionaries representing the JSON object.

    Returns:
        list: The modified JSON object with 'markdown' keys removed from each segment.
    """
    for segment in json_obj:
        if 'markdown' in segment:
            del segment['markdown']

    return json_obj

In [9]:
def get_scene_description(scene_description):
    """
    Process a scene description to generate a list of Document objects.

    This function extracts audio-visual segments from the provided scene description,
    removes any 'markdown' keys, converts the segment values to strings, and formats
    them into JSON strings. Each formatted string is then wrapped in a Document object.

    Args:
        scene_description (dict): A dictionary containing the scene description with
                                  audio-visual segments.

    Returns:
        list: A list of Document objects, each containing a formatted JSON string
              representing a video segment with scene description and transcript.
    """
    audio_visual_segments = scene_description["result"]["contents"]

    filtered_audio_visual_segments = remove_markdown(audio_visual_segments)

    audio_visual_splits = [
        "The following is a json string representing a video segment with scene description and transcript ```"
        + v + "```"
        for v in convert_values_to_strings(filtered_audio_visual_segments)
    ]

    docs = [Document(page_content=v) for v in audio_visual_splits]

    return docs

In [10]:
def load_into_index(docs):
    """
    Embed and index a list of documents using Azure OpenAI and Azure Search.

    This function creates embeddings for the provided documents using Azure OpenAI,
    and then indexes these documents in an Azure Search service.

    Args:
        docs (list): A list of Document objects to be embedded and indexed.

    Returns:
        AzureSearch: An AzureSearch object containing the indexed documents.
    """
    # Azure OpenAI Embeddings
    aoai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        azure_ad_token_provider=token_provider)

    # Loading to the vector store
    vector_store: AzureSearch = AzureSearch(
        azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
        azure_search_key=AZURE_SEARCH_KEY,
        index_name=AZURE_SEARCH_INDEX_NAME,
        embedding_function=aoai_embeddings.embed_query)
    vector_store.add_documents(documents=docs)

    return vector_store

In [11]:
def get_index_stats(index_name):
    """
    Retrieves and displays the statistics of an Azure AI Search index.

    Parameters:
    index_name (str): The name of the Azure AI Search index.

    Returns:
    tuple: A tuple containing the following:
        - document_count (int): The number of documents in the index.
        - storage_size (int): The storage size of the index in bytes.

    This function sends a GET request to the Azure Cognitive Search service to retrieve the statistics
    of the specified index. It prints the status and statistics of the index, including the document count
    and storage size. If the request fails, it prints the status code of the failed request.
    """
    url = AZURE_SEARCH_ENDPOINT + "/indexes/" + index_name + "/stats?api-version=2021-04-30-Preview"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_SEARCH_KEY,
    }
    response = requests.get(url, headers=headers)
    print("Azure Cognitive Search index status for:", index_name, "\n")

    if response.status_code == 200:
        res = response.json()
        print(json.dumps(res, indent=2))
        document_count = res['documentCount']
        storage_size = res['storageSize']

    else:
        print("Request failed with status code:", response.status_code)

    return document_count, storage_size

In [12]:
def index_status(index_name):
    """
    Retrieves and displays the status of an Azure AI Search index.

    Parameters:
    index_name (str): The name of the Azure AI Search index.

    Returns:
    None: This function prints the status of the specified index in a formatted JSON structure.
          If the request fails, it prints an error message.

    This function sends a GET request to the Azure AI Search service to retrieve the status
    of the specified index. It prints the index name and the status information in a formatted
    JSON structure. If the request fails, it catches the exception and prints a failure message.
    """
    print("Azure AI Search Index:", index_name, "\n")

    headers = {"Content-Type": "application/json", "api-key": AZURE_SEARCH_KEY}
    params = {"api-version": "2021-04-30-Preview"}
    index_status = requests.get(AZURE_SEARCH_ENDPOINT + "/indexes/" +
                                index_name,
                                headers=headers,
                                params=params)
    try:
        print(json.dumps((index_status.json()), indent=5))
    except:
        print("Request failed")

In [13]:
def get_fields_result(res_string):
    """
    Extracts various fields from a string.

    Parameters:
    res_string (str): A string.

    Returns:
    tuple: A tuple containing the following extracted fields:
        - scene_desc (str): The scene description.
        - kind (str): The kind of the segment.
        - startTimeMs (int): The start time in milliseconds.
        - endTimeMs (int): The end time in milliseconds.
        - width (int): The width of the segment.
        - height (int): The height of the segment.
        - keyFrameTimesMs (list of int): A list of key frame times in milliseconds.
        - transcriptPhrases (list of str): A list of transcript phrases.
    """
    # Extract scene desc
    start_value_string = res_string.find('"valueString": "') + len(
        '"valueString": "')
    end_value_string = res_string.find('"}', start_value_string)
    scene_desc = res_string[start_value_string:end_value_string]

    # Extract kind
    start_kind = res_string.find('"kind": "') + len('"kind": "')
    end_kind = res_string.find('"', start_kind)
    kind = res_string[start_kind:end_kind]

    # Extract startTimeMs
    start_startTimeMs = res_string.find('"startTimeMs": ') + len(
        '"startTimeMs": ')
    end_startTimeMs = res_string.find(',', start_startTimeMs)
    startTimeMs = int(res_string[start_startTimeMs:end_startTimeMs])

    # Extract endTimeMs
    start_endTimeMs = res_string.find('"endTimeMs": ') + len('"endTimeMs": ')
    end_endTimeMs = res_string.find(',', start_endTimeMs)
    endTimeMs = int(res_string[start_endTimeMs:end_endTimeMs])

    # Extract width
    start_width = res_string.find('"width": ') + len('"width": ')
    end_width = res_string.find(',', start_width)
    width = int(res_string[start_width:end_width])

    # Extract height
    start_height = res_string.find('"height": ') + len('"height": ')
    end_height = res_string.find(',', start_height)
    height = int(res_string[start_height:end_height])

    # Extract KeyFrameTimesMs
    start_keyFrameTimesMs = res_string.find('"KeyFrameTimesMs": [') + len(
        '"KeyFrameTimesMs": [')
    end_keyFrameTimesMs = res_string.find(']', start_keyFrameTimesMs)
    keyFrameTimesMs_str = res_string[start_keyFrameTimesMs:end_keyFrameTimesMs]
    keyFrameTimesMs = [int(x) for x in keyFrameTimesMs_str.split(',')]

    # Extract transcriptPhrases
    start_transcriptPhrases = res_string.find('"transcriptPhrases": [') + len(
        '"transcriptPhrases": [')
    end_transcriptPhrases = res_string.find(']', start_transcriptPhrases)
    transcriptPhrases_str = res_string[
        start_transcriptPhrases:end_transcriptPhrases]
    transcriptPhrases = [
        x.strip() for x in transcriptPhrases_str.split(',') if x.strip()
    ]

    return scene_desc, kind, startTimeMs, endTimeMs, width, height, keyFrameTimesMs, transcriptPhrases

In [14]:
def display_results(docs):
    """
    Displays the extracted fields from a list of document objects.

    Parameters:
    docs (list): A list of document objects, each containing a page_content attribute with JSON-like data.

    Returns:
    None: This function prints the extracted fields for each document.
    """
    for i, doc in enumerate(docs, start=1):
        print("\033[1;31;34m")
        res_string = doc.page_content.split("```")[1].replace("'", "\"")

        print(f"Results {i}:\n")
        scene_desc, kind, startTimeMs, endTimeMs, width, height, keyFrameTimesMs, transcriptPhrases = get_fields_result(
            res_string)

        print("Scene description:", scene_desc)
        print("Kind:", kind)
        print("StartTimeMs:", startTimeMs)
        print("EndTimeMs:", endTimeMs)
        print("Width:", width)
        print("Height:", height)
        print("KeyFrameTimesMs:", keyFrameTimesMs)
        print("TranscriptPhrases:", transcriptPhrases)

In [15]:
def conversational_search(rag_chain, query):
    """
    Perform a conversational search using a Retrieval-Augmented Generation (RAG) chain.

    This function takes a query, invokes the RAG chain with the query, and prints the result.

    Args:
        rag_chain (dict): A dictionary representing the RAG chain, which includes the context retriever,
                          prompt, language model, and output parser.
        query (str): The search query to be processed by the RAG chain.

    Returns:
        None
    """
    print(rag_chain.invoke(query))

## File to Analyze

In [16]:
video_file = os.path.join(DOCUMENTS_DIR, "movie.mp4")

In [17]:
Video(video_file, width=512)

In [18]:
vid = cv2.VideoCapture(video_file)

fps = int(vid.get(cv2.CAP_PROP_FPS))
nbframes = int(vid.get(cv2.CAP_PROP_FRAME_COUNT))
duration = nbframes / fps
min_vid = int(duration // 60)
sec_vid = int(duration % 60)

print(f"Duration of {video_file} = {min_vid} minutes and {sec_vid} seconds")
print(f"Total number of frames = {nbframes}")
print(f"FPS = {fps}")
!ls $video_file -lh

Duration of documents/movie.mp4 = 2 minutes and 31 seconds
Total number of frames = 3474
FPS = 23
-rwxrwxrwx 1 root root 17M Jan 31 08:37 documents/movie.mp4


## Generate Video Segment Description
Create a custom analyzer with pre-defined schema. The custom analyzer schema is defined in [./video_content_understanding_basic.json](./video_content_understanding_basic.json)

In [19]:
ANALYZER_TEMPLATE_PATH = os.path.join(JSON_DIR, "video_content_understanding.json")

ANALYZER_ID = f"videoanalyzer{datetime.datetime.today().strftime('%d%b%Y%H%M%S')}"

In [20]:
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default")

# Create the Azure Content Understanding client
client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent=
    "azure-ai-content-understanding-python/search_with_video",  # This header is used for sample usage telemetry
)

In [21]:
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
result = client.poll_result(response)

print("\033[1;31;35m")
print(json.dumps(result, indent=4))

[1;31;35m
{
    "id": "984e0b51-75af-4cd6-9420-3e604df15a01",
    "status": "Succeeded",
    "result": {
        "analyzerId": "videoanalyzer31Jan2025104543",
        "description": "Generating content understanding from video.",
        "createdAt": "2025-01-31T10:46:00Z",
        "lastModifiedAt": "2025-01-31T10:46:00Z",
        "config": {
            "locales": [
                "en-US",
                "es-ES",
                "es-MX",
                "fr-FR",
                "hi-IN",
                "it-IT",
                "ja-JP",
                "ko-KR",
                "pt-BR",
                "zh-CN"
            ],
            "returnDetails": true,
            "enableFace": false
        },
        "fieldSchema": {
            "name": "Content Understanding",
            "fields": {
                "segmentDescription": {
                    "type": "string",
                    "description": "Detailed summary of the video segment, focusing on people, places, and actions 

### Use the created analyzer to extract video segment description

In [22]:
start = time.time()

# Submit the video for content analysis
response = client.begin_analyze(ANALYZER_ID, file_location=video_file)

# Wait for the analysis to complete and get the content analysis result
video_result = client.poll_result(response, timeout_seconds=3600)

elapsed = time.time() - start
print(f"Done in {time.strftime('%H:%M:%S.' + str(elapsed % 1)[2:15], time.gmtime(elapsed))}")

Done in 00:02:17.3298244476318


In [23]:
print(json.dumps(video_result, indent=4))

{
    "id": "bf16f076-1d6e-4504-87b6-ed7f961934fa",
    "status": "Succeeded",
    "result": {
        "analyzerId": "videoanalyzer31Jan2025104543",
        "apiVersion": "2024-12-01-preview",
        "createdAt": "2025-01-31T10:46:13Z",
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' det

### Pre-process the video segmentation descriptions with Azure Content Understanding

In [24]:
segments = get_scene_description(video_result)
print(f"Number of documents: {len(segments)} segments.")

Number of documents: 102 segments.


In [25]:
for i, segment in enumerate(segments):
    print("\033[1;31;34m")
    print(f"Scene {i+1}: {segment.page_content}")

[1;31;34m
Scene 1: The following is a json string representing a video segment with scene description and transcript ```{'fields': {'segmentDescription': {'type': 'string', 'valueString': 'The video opens with the Warner Bros. Pictures logo, indicating the production company behind the film.'}}, 'kind': 'audioVisual', 'startTimeMs': 0, 'endTimeMs': 1418, 'width': 1280, 'height': 534, 'KeyFrameTimesMs': [737], 'transcriptPhrases': []}```
[1;31;34m
Scene 2: The following is a json string representing a video segment with scene description and transcript ```{'fields': {'segmentDescription': {'type': 'string', 'valueString': 'The Warner Bros. Pictures logo fades to black, signifying the end of the opening logo sequence.'}}, 'kind': 'audioVisual', 'startTimeMs': 1418, 'endTimeMs': 3253, 'width': 1280, 'height': 534, 'KeyFrameTimesMs': [2048, 2662], 'transcriptPhrases': []}```
[1;31;34m
Scene 3: The following is a json string representing a video segment with scene description and transcr

## Embed and index the chunks
Add the scene description segments as documents to Azure Search.

In [26]:
start = time.time()

vector_store = load_into_index(segments)

elapsed = time.time() - start
print(f"Done in {time.strftime('%H:%M:%S.' + str(elapsed % 1)[2:15], time.gmtime(elapsed))}")

Done in 00:00:13.7576577663421


In [42]:
# Please wait a couple of seconds to get the results
index_status(AZURE_SEARCH_INDEX_NAME)

Azure AI Search Index: azure_cu_movieteaser 

{
     "@odata.context": "https://azureaisearch-sr.search.windows.net/$metadata#indexes/$entity",
     "@odata.etag": "\"0x8DD41E4CD175395\"",
     "name": "azure_cu_movieteaser",
     "defaultScoringProfile": null,
     "fields": [
          {
               "name": "id",
               "type": "Edm.String",
               "searchable": false,
               "filterable": true,
               "retrievable": true,
               "sortable": false,
               "facetable": false,
               "key": true,
               "indexAnalyzer": null,
               "searchAnalyzer": null,
               "analyzer": null,
               "normalizer": null,
               "synonymMaps": []
          },
          {
               "name": "content",
               "type": "Edm.String",
               "searchable": true,
               "filterable": false,
               "retrievable": true,
               "sortable": false,
               "facetabl

In [43]:
document_count, storage_size = get_index_stats(AZURE_SEARCH_INDEX_NAME)

Azure Cognitive Search index status for: azure_cu_movieteaser 

{
  "@odata.context": "https://azureaisearch-sr.search.windows.net/$metadata#Microsoft.Azure.Search.V2021_04_30_Preview.IndexStatistics",
  "documentCount": 102,
  "storageSize": 2134579
}


In [44]:
print("Number of documents in the Azure AI Search index =", f"{document_count:,}")
print("Size of the index =", round(storage_size / (1024 * 1024), 2), "MB")

Number of documents in the Azure AI Search index = 102
Size of the index = 2.04 MB


## Retrieve relevant content
#### Execute a pure vector similarity search

In [30]:
query = "ford"

# Perform a similarity search
docs = vector_store.similarity_search(
    query=query,
    k=3,
    search_type="similarity",
)

display_results(docs)

[1;31;34m
Results 1:

Scene description: A car with multiple occupants inside is driving. The driver, identifiable by a red bandana, suggests a group potentially involved in a gang or criminal activity.
Kind: audioVisual
StartTimeMs: 130380
EndTimeMs: 131548
Width: 1280
Height: 534
KeyFrameTimesMs: [130976]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 127200', '"endTimeMs": 131840', '"text": "Tao and Sue are never going to find peace in this world as long as that gang"s around."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: Two young individuals stand outside at night, with visible emergency vehicle lights in the background, suggesting a recent or ongoing incident.
Kind: audioVisual
StartTimeMs: 127419
EndTimeMs: 128545
Width: 1280
Height: 534
KeyFrameTimesMs: [127986]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 127200', '"endTimeMs": 131840', '"text": "Tao and Sue are never going to find peace in this world as long as th

#### Execute hybrid search. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [31]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)

display_results(docs)

[1;31;34m
Results 1:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a connection or pride in its ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A car with multiple occupants inside is driving. The driver, identifiable by a red bandana, suggests a group potentially involved in a gang or criminal activity.
Kind: audioVisual
StartTimeMs: 130380
EndTimeMs: 131548
Width: 1280
Height: 534
KeyFrameTimesMs: [130976]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 127200', '"endTimeMs": 131840', '"text": "Tao and Sue are never going to find peace in this world as lo

## Others queries

In [32]:
query = "priest"

# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: Inside the church, a congregation is seated, attentively listening or participating in the service. The setting denotes a formal gathering, with individuals dressed in suits and formal attire, highlighting the solemnity of the occasion.
Kind: audioVisual
StartTimeMs: 49007
EndTimeMs: 50676
Width: 1280
Height: 534
KeyFrameTimesMs: [49556, 50089]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 49120', '"endTimeMs": 52880', '"text": "Dorothy mentioned specifically that it was her desire for you to go to confession."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: Inside a room, possibly part of a post-service gathering, individuals are engaged in conversation. The setting is intimate, with people interacting in a home-like environment, indicating a social aspect following the formal church service.
Kind: audioVisual
StartTimeMs: 51552
EndTimeMs: 52928
Width: 1280
Height: 534
KeyFrameTimesMs: [51563, 52259]

In [33]:
query = "birthday cake"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: A birthday cake is revealed with the words "Happy Birthday Dad" written in blue icing. This indicates a celebratory occasion, likely a birthday gathering for a father.
Kind: audioVisual
StartTimeMs: 26443
EndTimeMs: 27819
Width: 1280
Height: 534
KeyFrameTimesMs: [27154]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: An individual is seated at a table, seemingly reflective or engaged in thought, possibly considering the conversation or the birthday celebration.
Kind: audioVisual
StartTimeMs: 29154
EndTimeMs: 30781
Width: 1280
Height: 534
KeyFrameTimesMs: [29693, 30225]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 27920', '"endTimeMs": 30560', '"text": "Maybe it"s time you started thinking about taking it easier."', '"confidence": 1', '"words": [']
[1;31;34m
Re

In [34]:
query = "Clint Eastwood"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The segment showcases a series of credits introducing the director, Clint Eastwood. The text gradually appears on a black screen, creating a dramatic and anticipatory atmosphere.
Kind: audioVisual
StartTimeMs: 19394
EndTimeMs: 23315
Width: 1280
Height: 534
KeyFrameTimesMs: [20191, 20969, 21747, 22526]
TranscriptPhrases: []
[1;31;34m
Results 2:

Scene description: The video segment features a title card displaying the credits for the film. It lists key contributors such as the story writers, screenplay writer, producer, and director. The names mentioned include David Johannson, Nick Schenk, Robert Lorenz, Bill Gerber, and Clint Eastwood. The text is prominently displayed on the screen, indicating the official website and production company logos.
Kind: audioVisual
StartTimeMs: 142851
EndTimeMs: 143894
Width: 1280
Height: 534
KeyFrameTimesMs: [143386]
TranscriptPhrases: []
[1;31;34m
Results 3:

Scene description: A close-up shot of a person cle

In [35]:
query = "American flag"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: An older man is seen standing outside a house, with an American flag visible in the background. He seems to be focused on something, possibly watching the couple from the previous segment. The setting suggests a typical suburban neighborhood.
Kind: audioVisual
StartTimeMs: 40541
EndTimeMs: 41416
Width: 1280
Height: 534
KeyFrameTimesMs: [40997]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 40480', '"endTimeMs": 42720', '"text": "These Chinese after moving this they rode for."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: The exterior of a house is shown at night. A car is parked in front, and an American flag is visible, indicating a residential area in the United States.
Kind: audioVisual
StartTimeMs: 61228
EndTimeMs: 62521
Width: 1280
Height: 534
KeyFrameTimesMs: [61884]
TranscriptPhrases: []
[1;31;34m
Results 3:

Scene description: The setting changes to a daytime scene on a porch. A person in fo

In [38]:
query = "Gun"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: An intense moment occurs as a gun is fired from a vehicle. The shot captures the muzzle flash, emphasizing the sudden and violent nature of the act.
Kind: audioVisual
StartTimeMs: 119036
EndTimeMs: 119536
Width: 1280
Height: 534
KeyFrameTimesMs: [119304]
TranscriptPhrases: []
[1;31;34m
Results 2:

Scene description: A person is seen holding a shotgun in a defensive posture. The scene is tense, suggesting a confrontation or a threat. The setting appears to be outside a residential home at night.
Kind: audioVisual
StartTimeMs: 71655
EndTimeMs: 73323
Width: 1280
Height: 534
KeyFrameTimesMs: [72205, 72737]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 71880', '"endTimeMs": 73200', '"text": "Get off my lawn."', '"confidence": 1', '"words": [']
[1;31;34m
Results 3:

Scene description: A person is seen standing at a doorway with a shotgun, suggesting vigilance or readiness. The setting is domestic, indicating a potential threat or the

In [39]:
query = "White pickup"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The scene continues with the individual standing resolutely in front of a white truck, maintaining eye contact with someone off-screen.
Kind: audioVisual
StartTimeMs: 15724
EndTimeMs: 16725
Width: 1280
Height: 534
KeyFrameTimesMs: [15727, 16218]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 15760', '"endTimeMs": 16480', '"text": "That"s me."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a connection or pride in its ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 3:

Scene descripti

In [41]:
query = "emergency vehicle"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: Two young individuals stand outside at night, with visible emergency vehicle lights in the background, suggesting a recent or ongoing incident.
Kind: audioVisual
StartTimeMs: 127419
EndTimeMs: 128545
Width: 1280
Height: 534
KeyFrameTimesMs: [127986]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 127200', '"endTimeMs": 131840', '"text": "Tao and Sue are never going to find peace in this world as long as that gang"s around."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A vehicle is seen driving past, possibly being watched or followed, as part of a larger narrative involving surveillance or pursuit.
Kind: audioVisual
StartTimeMs: 135844
EndTimeMs: 137095
Width: 1280
Height: 534
KeyFrameTimesMs: [136464]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 136520', '"endTimeMs": 137760', '"text": "They won"t have a chance."', '"confidence": 1', '"words": [']
[1;31;34m
Results 3:

Scene descrip

In [50]:
query = "a person who is cleaning a car"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a connection or pride in its ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Kind: audioVisual
StartTimeMs: 108734
EndTimeMs: 109401
Width: 1280
Height: 534
KeyFrameTimesMs: [109065]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 108640', '"endTimeMs": 109360', '"text": "Just a gang."', '"confidence": 1', '"words": [']
[1;31;34m
Results 3:

Scene description: A car with multiple occupants inside is driving. The driver, identifiable by a red bandana, suggests a group potentially involved in a gang or criminal

## Video Q&A
We can utilize OpenAI GPT completion models + Azure Search to conversationally search for and chat about the results. (If you are using GitHub Codespaces, there will be an input prompt near the top of the screen)

In [51]:
# Setup rag chain
prompt_video = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""

In [52]:
print(prompt_video)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:


In [53]:
def setup_rag_chain(vector_store):
    """
    Set up a Retrieval-Augmented Generation (RAG) chain using Azure OpenAI and Azure Search.

    This function configures a RAG chain by creating a retriever from the provided vector store,
    formatting documents, and setting up a language model with a prompt template.

    Args:
        vector_store (AzureSearch): An AzureSearch object used to retrieve documents based on similarity.

    Returns:
        dict: A dictionary representing the RAG chain, which includes the context retriever, prompt,
              language model, and output parser.
    """
    # Retriever
    retriever = vector_store.as_retriever(search_type="similarity", k=3)
    # Prompt
    prompt = ChatPromptTemplate.from_template(prompt_video)
    # LLM
    llm = AzureChatOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        openai_api_version=AZURE_OPENAI_CHAT_API_VERSION,
        azure_deployment=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        azure_ad_token_provider=token_provider,
        temperature=0.7,
    )

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # RAG chain
    rag_chain = ({
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
                 | prompt
                 | llm
                 | StrOutputParser())

    return rag_chain

In [55]:
# Enter a query. Enter an empty query to stop
rag_chain = setup_rag_chain(vector_store)

while True:
    query = input("Ask the video: ")
    if query == "":
        break
    conversational_search(rag_chain, query)

Ask the video:  hello


Hello! How can I assist you today?


Ask the video:  what is this video?


The video consists of multiple scenes, including a car driving away in the dark suggesting an escape or pursuit, individuals arranging items possibly as a tribute, and a confrontation involving young individuals and a person holding a shotgun. It appears to involve themes of conflict, remembrance, and possibly a character's past related to military service. The dialogue hints at tension and unresolved issues among the characters.


Ask the video:  who is the director?


The director is Clint Eastwood.


Ask the video:  do we have some violence in it?


Yes, the context does suggest the presence of violence. There is a scene with a person holding a shotgun in a defensive posture, indicating a possible confrontation or threat. Additionally, the presence of emergency vehicle lights implies a recent or ongoing incident.


Ask the video:  Do we have cars?


Yes, the context describes multiple scenes involving cars, including a "Gran Torino Sport" being cleaned and cars driving in various scenarios.


Ask the video:  


## Post Processing

In [56]:
# Delete the analyzer if it is no longer needed
client.delete_analyzer(ANALYZER_ID)

<Response [204]>