# Azure AI Content Understanding
## Video Search with Azure Content Understanding and Azure AI Search

<img src="https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/media/overview/component-overview.png">

## Objective
This document is meant to present a guideline on how to leverage the Azure Video Content Understanding API for AI Search.
The sample will demonstrate the following steps:
1. Process a video file from Azure Blob storage with the Azure Video Content Understanding service to generate a video description grounding document.
2. Process the video description grounding document with Azure Search client to generate an Azure Search index.
3. Utilize OpenAI completion and embedding models to search through content in the video search index.

## Settings
1.	Azure AI services: Go to Access Control (IAM) in resource, grant yourself role **Cognitive Services User**
3.	Azure OpenAI: Go to Access Control (IAM) in resource, grant yourself role **Cognitive Services OpenAI User**
7.	Azure AI Search: Go to Access Control (IAM) in resource, grant yourself role **Search Index Data Contributor**. Go to Access Control (IAM) in resource, grant yourself role **Search Service Contributor**.

https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/overview

In [1]:
import cv2
import datetime
import json
import matplotlib.pyplot as plt
import os
import requests
import sys
import time

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure_content_understanding import AzureContentUnderstandingClient
from dotenv import load_dotenv
from IPython.display import Video
from langchain.schema import Document, StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores.azuresearch import AzureSearch
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from moviepy import *

In [2]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [3]:
print(f"Today is: {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is: 31-Jan-2025 12:49:36


In [4]:
JSON_DIR = "json"
DOCUMENTS_DIR = "documents"

## Load environment variables

In [5]:
load_dotenv("azure.env")

AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")

AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_KEY = os.getenv("AZURE_SEARCH_KEY")

In [6]:
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = "gpt-4o"
AZURE_OPENAI_CHAT_API_VERSION = "2025-01-01-preview"

AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = "text-embedding-ada-002"
AZURE_OPENAI_EMBEDDING_API_VERSION = "2025-01-01-preview"
AZURE_AI_SERVICE_API_VERSION = "2024-12-01-preview"

AZURE_SEARCH_INDEX_NAME = "azure_cu_movieteaser"

## Helper

In [7]:
def convert_values_to_strings(json_obj):
    """
    Convert all values in a JSON object to strings.

    Args:
        json_obj (dict): A dictionary representing the JSON object.

    Returns:
        list: A list of string representations of the values in the JSON object.
    """

    return [str(value) for value in json_obj]

In [8]:
def remove_markdown(json_obj):
    """
    Remove 'markdown' keys from all segments in a JSON object.

    Args:
        json_obj (list): A list of dictionaries representing the JSON object.

    Returns:
        list: The modified JSON object with 'markdown' keys removed from each segment.
    """
    for segment in json_obj:
        if 'markdown' in segment:
            del segment['markdown']

    return json_obj

In [9]:
def get_scene_description(scene_description):
    """
    Process a scene description to generate a list of Document objects.

    This function extracts audio-visual segments from the provided scene description,
    removes any 'markdown' keys, converts the segment values to strings, and formats
    them into JSON strings. Each formatted string is then wrapped in a Document object.

    Args:
        scene_description (dict): A dictionary containing the scene description with
                                  audio-visual segments.

    Returns:
        list: A list of Document objects, each containing a formatted JSON string
              representing a video segment with scene description and transcript.
    """
    audio_visual_segments = scene_description["result"]["contents"]

    filtered_audio_visual_segments = remove_markdown(audio_visual_segments)

    audio_visual_splits = [
        "The following is a json string representing a video segment with scene description and transcript ```"
        + v + "```"
        for v in convert_values_to_strings(filtered_audio_visual_segments)
    ]

    docs = [Document(page_content=v) for v in audio_visual_splits]

    return docs

In [10]:
def load_into_index(docs):
    """
    Embed and index a list of documents using Azure OpenAI and Azure Search.

    This function creates embeddings for the provided documents using Azure OpenAI,
    and then indexes these documents in an Azure Search service.

    Args:
        docs (list): A list of Document objects to be embedded and indexed.

    Returns:
        AzureSearch: An AzureSearch object containing the indexed documents.
    """
    # Azure OpenAI Embeddings
    aoai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        azure_ad_token_provider=token_provider)

    # Loading to the vector store
    vector_store: AzureSearch = AzureSearch(
        azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
        azure_search_key=AZURE_SEARCH_KEY,
        index_name=AZURE_SEARCH_INDEX_NAME,
        embedding_function=aoai_embeddings.embed_query)
    vector_store.add_documents(documents=docs)

    return vector_store

In [11]:
def get_index_stats(index_name):
    """
    Retrieves and displays the statistics of an Azure AI Search index.

    Parameters:
    index_name (str): The name of the Azure AI Search index.

    Returns:
    tuple: A tuple containing the following:
        - document_count (int): The number of documents in the index.
        - storage_size (int): The storage size of the index in bytes.

    This function sends a GET request to the Azure Cognitive Search service to retrieve the statistics
    of the specified index. It prints the status and statistics of the index, including the document count
    and storage size. If the request fails, it prints the status code of the failed request.
    """
    url = AZURE_SEARCH_ENDPOINT + "/indexes/" + index_name + "/stats?api-version=2021-04-30-Preview"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_SEARCH_KEY,
    }
    response = requests.get(url, headers=headers)
    print("Azure Cognitive Search index status for:", index_name, "\n")

    if response.status_code == 200:
        res = response.json()
        print(json.dumps(res, indent=2))
        document_count = res['documentCount']
        storage_size = res['storageSize']

    else:
        print("Request failed with status code:", response.status_code)

    return document_count, storage_size

In [12]:
def index_status(index_name):
    """
    Retrieves and displays the status of an Azure AI Search index.

    Parameters:
    index_name (str): The name of the Azure AI Search index.

    Returns:
    None: This function prints the status of the specified index in a formatted JSON structure.
          If the request fails, it prints an error message.

    This function sends a GET request to the Azure AI Search service to retrieve the status
    of the specified index. It prints the index name and the status information in a formatted
    JSON structure. If the request fails, it catches the exception and prints a failure message.
    """
    print("Azure AI Search Index:", index_name, "\n")

    headers = {"Content-Type": "application/json", "api-key": AZURE_SEARCH_KEY}
    params = {"api-version": "2021-04-30-Preview"}
    index_status = requests.get(AZURE_SEARCH_ENDPOINT + "/indexes/" +
                                index_name,
                                headers=headers,
                                params=params)
    try:
        print(json.dumps((index_status.json()), indent=5))
    except:
        print("Request failed")

In [13]:
def get_fields_result(res_string):
    """
    Extracts various fields from a string.

    Parameters:
    res_string (str): A string.

    Returns:
    tuple: A tuple containing the following extracted fields:
        - scene_desc (str): The scene description.
        - kind (str): The kind of the segment.
        - startTimeMs (int): The start time in milliseconds.
        - endTimeMs (int): The end time in milliseconds.
        - width (int): The width of the segment.
        - height (int): The height of the segment.
        - keyFrameTimesMs (list of int): A list of key frame times in milliseconds.
        - transcriptPhrases (list of str): A list of transcript phrases.
    """
    # Extract scene desc
    start_value_string = res_string.find('"valueString": "') + len(
        '"valueString": "')
    end_value_string = res_string.find('"}', start_value_string)
    scene_desc = res_string[start_value_string:end_value_string]

    # Extract kind
    start_kind = res_string.find('"kind": "') + len('"kind": "')
    end_kind = res_string.find('"', start_kind)
    kind = res_string[start_kind:end_kind]

    # Extract startTimeMs
    start_startTimeMs = res_string.find('"startTimeMs": ') + len(
        '"startTimeMs": ')
    end_startTimeMs = res_string.find(',', start_startTimeMs)
    startTimeMs = int(res_string[start_startTimeMs:end_startTimeMs])

    # Extract endTimeMs
    start_endTimeMs = res_string.find('"endTimeMs": ') + len('"endTimeMs": ')
    end_endTimeMs = res_string.find(',', start_endTimeMs)
    endTimeMs = int(res_string[start_endTimeMs:end_endTimeMs])

    # Extract width
    start_width = res_string.find('"width": ') + len('"width": ')
    end_width = res_string.find(',', start_width)
    width = int(res_string[start_width:end_width])

    # Extract height
    start_height = res_string.find('"height": ') + len('"height": ')
    end_height = res_string.find(',', start_height)
    height = int(res_string[start_height:end_height])

    # Extract KeyFrameTimesMs
    start_keyFrameTimesMs = res_string.find('"KeyFrameTimesMs": [') + len(
        '"KeyFrameTimesMs": [')
    end_keyFrameTimesMs = res_string.find(']', start_keyFrameTimesMs)
    keyFrameTimesMs_str = res_string[start_keyFrameTimesMs:end_keyFrameTimesMs]
    keyFrameTimesMs = [int(x) for x in keyFrameTimesMs_str.split(',')]

    # Extract transcriptPhrases
    start_transcriptPhrases = res_string.find('"transcriptPhrases": [') + len(
        '"transcriptPhrases": [')
    end_transcriptPhrases = res_string.find(']', start_transcriptPhrases)
    transcriptPhrases_str = res_string[
        start_transcriptPhrases:end_transcriptPhrases]
    transcriptPhrases = [
        x.strip() for x in transcriptPhrases_str.split(',') if x.strip()
    ]

    return scene_desc, kind, startTimeMs, endTimeMs, width, height, keyFrameTimesMs, transcriptPhrases

In [14]:
def display_results(docs):
    """
    Displays the extracted fields from a list of document objects.

    Parameters:
    docs (list): A list of document objects, each containing a page_content attribute with JSON-like data.

    Returns:
    None: This function prints the extracted fields for each document.
    """
    for i, doc in enumerate(docs, start=1):
        print("\033[1;31;34m")
        res_string = doc.page_content.split("```")[1].replace("'", "\"")

        print(f"Results {i}:\n")
        scene_desc, kind, startTimeMs, endTimeMs, width, height, keyFrameTimesMs, transcriptPhrases = get_fields_result(
            res_string)

        print("Scene description:", scene_desc)
        print("Kind:", kind)
        print("StartTimeMs:", startTimeMs)
        print("EndTimeMs:", endTimeMs)
        print("Width:", width)
        print("Height:", height)
        print("KeyFrameTimesMs:", keyFrameTimesMs)
        print("TranscriptPhrases:", transcriptPhrases)

In [15]:
def conversational_search(rag_chain, query):
    """
    Perform a conversational search using a Retrieval-Augmented Generation (RAG) chain.

    This function takes a query, invokes the RAG chain with the query, and prints the result.

    Args:
        rag_chain (dict): A dictionary representing the RAG chain, which includes the context retriever,
                          prompt, language model, and output parser.
        query (str): The search query to be processed by the RAG chain.

    Returns:
        None
    """
    print(rag_chain.invoke(query))

In [16]:
def generate_subclip(video_file, start_time_ms, end_time_ms):
    """
    Generates a subclip from a video file.

    This function takes a video file and extracts a subclip from it, starting
    slightly before the specified start time and ending slightly after the
    specified end time. The subclip is then saved as "sample.mp4".

    Parameters
    ----------
    video_file : str
        The path to the video file.
    start_time_ms : int
        The start time of the subclip in milliseconds.
    end_time_ms : int
        The end time of the subclip in milliseconds.

    Returns
    -------
    None
    """
    start = int(start_time_ms / 1000) - 1
    end =  int(end_time_ms / 1000) + 1
    clip = VideoFileClip(video_file).subclipped(start, end)
    clip.write_videofile("sample.mp4", codec="libx264")

## File to Analyze

In [17]:
video_file = os.path.join(DOCUMENTS_DIR, "movie.mp4")

In [18]:
Video(video_file, width=512)

In [19]:
vid = cv2.VideoCapture(video_file)

fps = int(vid.get(cv2.CAP_PROP_FPS))
nbframes = int(vid.get(cv2.CAP_PROP_FRAME_COUNT))
duration = nbframes / fps
min_vid = int(duration // 60)
sec_vid = int(duration % 60)

print(f"Duration of {video_file} = {min_vid} minutes and {sec_vid} seconds")
print(f"Total number of frames = {nbframes}")
print(f"FPS = {fps}")
!ls $video_file -lh

Duration of documents/movie.mp4 = 2 minutes and 31 seconds
Total number of frames = 3474
FPS = 23
-rwxrwxrwx 1 root root 17M Jan 31 08:37 documents/movie.mp4


## Generate Video Segment Description
Create a custom analyzer with pre-defined schema. The custom analyzer schema is defined in [./video_content_understanding_basic.json](./video_content_understanding_basic.json)

In [20]:
ANALYZER_TEMPLATE_PATH = os.path.join(JSON_DIR, "video_content_understanding.json")

ANALYZER_ID = f"videoanalyzer{datetime.datetime.today().strftime('%d%b%Y%H%M%S')}"

In [21]:
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default")

# Create the Azure Content Understanding client
client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent=
    "azure-ai-content-understanding-python/search_with_video",  # This header is used for sample usage telemetry
)

In [22]:
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
result = client.poll_result(response)

print("\033[1;31;35m")
print(json.dumps(result, indent=4))

[1;31;35m
{
    "id": "622f01b0-3804-4299-9973-4baf2b8d6e28",
    "status": "Succeeded",
    "result": {
        "analyzerId": "videoanalyzer31Jan2025124942",
        "description": "Generating content understanding from video.",
        "createdAt": "2025-01-31T12:49:43Z",
        "lastModifiedAt": "2025-01-31T12:49:43Z",
        "config": {
            "locales": [
                "en-US",
                "es-ES",
                "es-MX",
                "fr-FR",
                "hi-IN",
                "it-IT",
                "ja-JP",
                "ko-KR",
                "pt-BR",
                "zh-CN"
            ],
            "returnDetails": true,
            "enableFace": false
        },
        "fieldSchema": {
            "name": "Content Understanding",
            "fields": {
                "segmentDescription": {
                    "type": "string",
                    "description": "Detailed summary of the video segment, focusing on people, places, and actions 

### Use the created analyzer to extract video segment description

In [23]:
start = time.time()

# Submit the video for content analysis
response = client.begin_analyze(ANALYZER_ID, file_location=video_file)

# Wait for the analysis to complete and get the content analysis result
video_result = client.poll_result(response, timeout_seconds=3600)

elapsed = time.time() - start
print(f"Done in {time.strftime('%H:%M:%S.' + str(elapsed % 1)[2:5], time.gmtime(elapsed))}")

Done in 00:02:31.631


In [24]:
print(json.dumps(video_result, indent=4))

{
    "id": "80f9e90a-fcd9-4be5-9e2c-42875662c16a",
    "status": "Succeeded",
    "result": {
        "analyzerId": "videoanalyzer31Jan2025124942",
        "apiVersion": "2024-12-01-preview",
        "createdAt": "2025-01-31T12:49:45Z",
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' detected with severity 'Low'. Please use the field content with caution."
            },
            {
                "code": "HarmfulContentDetected",
                "message": "Content of category 'Violence' det

### Pre-process the video segmentation descriptions with Azure Content Understanding

In [25]:
segments = get_scene_description(video_result)
print(f"Number of documents: {len(segments)} segments.")

Number of documents: 102 segments.


In [26]:
for i, segment in enumerate(segments, start=1):
    print("\033[1;31;34m")
    print(f"Scene {i}: {segment.page_content}")

[1;31;34m
Scene 1: The following is a json string representing a video segment with scene description and transcript ```{'fields': {'segmentDescription': {'type': 'string', 'valueString': 'The video opens with the iconic Warner Bros. Pictures logo, signaling the beginning of a movie. The logo is presented in its typical format, showing the emblem with the text "A TimeWarner Company" beneath it.'}}, 'kind': 'audioVisual', 'startTimeMs': 0, 'endTimeMs': 1418, 'width': 1280, 'height': 534, 'KeyFrameTimesMs': [737], 'transcriptPhrases': []}```
[1;31;34m
Scene 2: The following is a json string representing a video segment with scene description and transcript ```{'fields': {'segmentDescription': {'type': 'string', 'valueString': 'The Warner Bros. Pictures logo fades out, and the screen transitions to black, maintaining a sense of anticipation for what is to follow.'}}, 'kind': 'audioVisual', 'startTimeMs': 1418, 'endTimeMs': 3253, 'width': 1280, 'height': 534, 'KeyFrameTimesMs': [2048, 26

## Embed and index the chunks
Add the scene description segments as documents to Azure Search.

In [27]:
start = time.time()

vector_store = load_into_index(segments)

elapsed = time.time() - start
print(f"Done in {time.strftime('%H:%M:%S.' + str(elapsed % 1)[2:5], time.gmtime(elapsed))}")

Done in 00:00:09.137


In [28]:
# Please wait a couple of seconds to get the results
index_status(AZURE_SEARCH_INDEX_NAME)

Azure AI Search Index: azure_cu_movieteaser 

{
     "@odata.context": "https://azureaisearch-sr.search.windows.net/$metadata#indexes/$entity",
     "@odata.etag": "\"0x8DD41F3835CFFD5\"",
     "name": "azure_cu_movieteaser",
     "defaultScoringProfile": null,
     "fields": [
          {
               "name": "id",
               "type": "Edm.String",
               "searchable": false,
               "filterable": true,
               "retrievable": true,
               "sortable": false,
               "facetable": false,
               "key": true,
               "indexAnalyzer": null,
               "searchAnalyzer": null,
               "analyzer": null,
               "normalizer": null,
               "synonymMaps": []
          },
          {
               "name": "content",
               "type": "Edm.String",
               "searchable": true,
               "filterable": false,
               "retrievable": true,
               "sortable": false,
               "facetabl

In [29]:
document_count, storage_size = get_index_stats(AZURE_SEARCH_INDEX_NAME)

Azure Cognitive Search index status for: azure_cu_movieteaser 

{
  "@odata.context": "https://azureaisearch-sr.search.windows.net/$metadata#Microsoft.Azure.Search.V2021_04_30_Preview.IndexStatistics",
  "documentCount": 102,
  "storageSize": 2135075
}


In [30]:
print("Number of documents in the Azure AI Search index =", f"{document_count:,}")
print("Size of the index =", round(storage_size / (1024 * 1024), 2), "MB")

Number of documents in the Azure AI Search index = 102
Size of the index = 2.04 MB


## Retrieve relevant content
#### Execute a pure vector similarity search

In [31]:
query = "ford"

# Perform a similarity search
docs = vector_store.similarity_search(
    query=query,
    k=3,
    search_type="similarity",
)

display_results(docs)

[1;31;34m
Results 1:

Scene description: A close-up shot shows someone meticulously polishing a Gran Torino Sport car. The focus is on the care and attention given to the vehicle, indicating its importance or sentimental value to the person cleaning it.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a sense of pride and ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worke

In [32]:
generate_subclip(video_file, 25065, 25761)
Video("sample.mp4", width=512)

{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2018-11-15T20:06:06.000000Z'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [1280, 534], 'bitrate': 847, 'fps': 23.976023976023978, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 127, 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 144.92000000000002, 'bitrate': 978, 'start':

                                                       

MoviePy - Done.
MoviePy - Writing video sample.mp4



                                                                       

MoviePy - Done !
MoviePy - video ready sample.mp4


#### Execute hybrid search. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [33]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)

display_results(docs)

[1;31;34m
Results 1:

Scene description: A close-up shot shows someone meticulously polishing a Gran Torino Sport car. The focus is on the care and attention given to the vehicle, indicating its importance or sentimental value to the person cleaning it.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a sense of pride and ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worke

## Others queries

In [34]:
query = "priest"

# Perform a hybrid search using the search_type parameter
docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: Inside the church, several people are seated in the pews, dressed formally. The setting suggests a solemn or formal occasion, possibly a service or ceremony.
Kind: audioVisual
StartTimeMs: 49007
EndTimeMs: 50676
Width: 1280
Height: 534
KeyFrameTimesMs: [49556, 50089]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 49120', '"endTimeMs": 52880', '"text": "Dorothy mentioned specifically that it was her desire for you to go to confession."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: Inside the church, a group of people is seated in the pews. The atmosphere is somber, suggesting a service or ceremony is taking place. The attendees are dressed in formal attire, indicating the seriousness of the occasion.
Kind: audioVisual
StartTimeMs: 49007
EndTimeMs: 50676
Width: 1280
Height: 534
KeyFrameTimesMs: [49556, 50089]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 49120', '"endTimeMs": 52880', '"t

In [35]:
generate_subclip(video_file, 49556, 50089)
Video("sample.mp4", width=512)

{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2018-11-15T20:06:06.000000Z'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [1280, 534], 'bitrate': 847, 'fps': 23.976023976023978, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 127, 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 144.92000000000002, 'bitrate': 978, 'start':

                                                       

MoviePy - Done.
MoviePy - Writing video sample.mp4



                                                                       

MoviePy - Done !
MoviePy - video ready sample.mp4


In [36]:
query = "birthday cake"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The camera focuses on a beautifully decorated birthday cake with the words "Happy Birthday Dad" written in blue frosting. This sets a celebratory and familial tone.
Kind: audioVisual
StartTimeMs: 26443
EndTimeMs: 27819
Width: 1280
Height: 534
KeyFrameTimesMs: [27154]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A birthday cake with the message "Happy Birthday Dad" is displayed on a table, surrounded by plates and utensils. This signifies a celebration, likely a family gathering, honoring a father figure.
Kind: audioVisual
StartTimeMs: 26443
EndTimeMs: 27819
Width: 1280
Height: 534
KeyFrameTimesMs: [27154]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words"

In [38]:
generate_subclip(video_file, 26443, 27819)
Video("sample.mp4", width=512)

{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2018-11-15T20:06:06.000000Z'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [1280, 534], 'bitrate': 847, 'fps': 23.976023976023978, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 127, 'metadata': {'Metadata': '', 'creation_time': '2018-11-15T20:06:06.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 11/15/2018.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 144.92000000000002, 'bitrate': 978, 'start':

                                                       

MoviePy - Done.
MoviePy - Writing video sample.mp4



                                                                       

MoviePy - Done !
MoviePy - video ready sample.mp4


In [39]:
query = "Clint Eastwood"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The segment showcases a series of credits introducing the director, Clint Eastwood. The text gradually becomes clearer, emphasizing the director"s name. This segment sets the stage for the film, highlighting the prominent role of Clint Eastwood in its creation.
Kind: audioVisual
StartTimeMs: 19394
EndTimeMs: 23315
Width: 1280
Height: 534
KeyFrameTimesMs: [20191, 20969, 21747, 22526]
TranscriptPhrases: []
[1;31;34m
Results 2:

Scene description: The segment showcases a series of credits introducing the director, Clint Eastwood. The text gradually appears on a black screen, creating a dramatic and anticipatory atmosphere.
Kind: audioVisual
StartTimeMs: 19394
EndTimeMs: 23315
Width: 1280
Height: 534
KeyFrameTimesMs: [20191, 20969, 21747, 22526]
TranscriptPhrases: []
[1;31;34m
Results 3:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehi

In [40]:
query = "American flag"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The young woman stands in front of the porch, with an American flag visible in the background, suggesting a patriotic setting.
Kind: audioVisual
StartTimeMs: 100517
EndTimeMs: 101393
Width: 1280
Height: 534
KeyFrameTimesMs: [100956]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 99880', '"endTimeMs": 101280', '"text": "We"ve got beer', 'too."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: The exterior of a house is shown, illuminated by a porch light. An American flag is displayed prominently, and a car is parked in front of the house, suggesting a typical suburban setting. The scene conveys a sense of nighttime tranquility.
Kind: audioVisual
StartTimeMs: 61228
EndTimeMs: 62521
Width: 1280
Height: 534
KeyFrameTimesMs: [61884]
TranscriptPhrases: []
[1;31;34m
Results 3:

Scene description: An elderly man is standing outside a house, with an American flag visible in the background. He appears to be obse

In [41]:
query = "Gun"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The person with the gun remains in a defensive stance, still facing the group. The setting continues to be the residential porch area, with tension in the air.
Kind: audioVisual
StartTimeMs: 74866
EndTimeMs: 76243
Width: 1280
Height: 534
KeyFrameTimesMs: [75563]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 75920', '"endTimeMs": 76480', '"text": "Thank you."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A person is seen holding a gun in a defensive posture. The background indicates a residential area, possibly from a front porch, with dim lighting suggesting nighttime. The audio includes a stern command, "Get off my lawn."
Kind: audioVisual
StartTimeMs: 71655
EndTimeMs: 73323
Width: 1280
Height: 534
KeyFrameTimesMs: [72205, 72737]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 71880', '"endTimeMs": 73200', '"text": "Get off my lawn."', '"confidence": 1', '"words": [']
[1;31;34m
Result

In [42]:
query = "White pickup"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: The older man is now standing outside his white pickup truck, facing a group of young individuals on the street, suggesting a confrontation.
Kind: audioVisual
StartTimeMs: 11386
EndTimeMs: 13347
Width: 1280
Height: 534
KeyFrameTimesMs: [12041, 12696]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 10760', '"endTimeMs": 14760', '"text": "Never knows how you come across somebody once in a while you shouldn"t have messed with."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: The scene transitions to a street corner with a stop sign. A white pickup truck passes by, while a group of people stand near a brick building, indicating an urban setting.
Kind: audioVisual
StartTimeMs: 3253
EndTimeMs: 4963
Width: 1280
Height: 534
KeyFrameTimesMs: [3809, 4382]
TranscriptPhrases: []
[1;31;34m
Results 3:

Scene description: A scene is shown depicting a quiet urban street corner. A white pickup truck drives past a stop 

In [43]:
query = "emergency vehicle"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: A nighttime street view shows a vehicle rapidly driving away, possibly the shooter"s getaway. The scene then briefly displays a military medal, suggesting a connection to a character"s past or achievements.
Kind: audioVisual
StartTimeMs: 121038
EndTimeMs: 123081
Width: 1280
Height: 534
KeyFrameTimesMs: [121720, 122416]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 122720', '"endTimeMs": 123920', '"text": "What was it like to kill a man?"', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: Two individuals stand outside at night with a police car in the background, indicating a serious or possibly dangerous situation.
Kind: audioVisual
StartTimeMs: 127419
EndTimeMs: 128545
Width: 1280
Height: 534
KeyFrameTimesMs: [127986]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 127200', '"endTimeMs": 131840', '"text": "Tao and Sue are never going to find peace in this world as long as that gang"s around

In [44]:
query = "a person who is cleaning a car"

docs = vector_store.hybrid_search(query=query, k=3)
display_results(docs)

[1;31;34m
Results 1:

Scene description: A close-up shot of a person cleaning a "Gran Torino Sport" car with a cloth. This action highlights attention to detail and care for the vehicle, suggesting a sense of pride and ownership.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worked hard your whole life."', '"confidence": 1', '"words": [']
[1;31;34m
Results 2:

Scene description: A close-up shot shows someone meticulously polishing a Gran Torino Sport car. The focus is on the care and attention given to the vehicle, indicating its importance or sentimental value to the person cleaning it.
Kind: audioVisual
StartTimeMs: 24358
EndTimeMs: 26443
Width: 1280
Height: 534
KeyFrameTimesMs: [25065, 25761]
TranscriptPhrases: ['{"speaker": "speaker"', '"startTimeMs": 26240', '"endTimeMs": 27600', '"text": "Dad', 'you worke

## Video Q&A
We can utilize OpenAI GPT completion models + Azure Search to conversationally search for and chat about the results. (If you are using GitHub Codespaces, there will be an input prompt near the top of the screen)

In [45]:
# Setup rag chain
prompt_video = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""

In [46]:
print(prompt_video)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:


In [47]:
def setup_rag_chain(vector_store):
    """
    Set up a Retrieval-Augmented Generation (RAG) chain using Azure OpenAI and Azure Search.

    This function configures a RAG chain by creating a retriever from the provided vector store,
    formatting documents, and setting up a language model with a prompt template.

    Args:
        vector_store (AzureSearch): An AzureSearch object used to retrieve documents based on similarity.

    Returns:
        dict: A dictionary representing the RAG chain, which includes the context retriever, prompt,
              language model, and output parser.
    """
    # Retriever
    retriever = vector_store.as_retriever(search_type="similarity", k=3)
    # Prompt
    prompt = ChatPromptTemplate.from_template(prompt_video)
    # LLM
    llm = AzureChatOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        openai_api_version=AZURE_OPENAI_CHAT_API_VERSION,
        azure_deployment=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        azure_ad_token_provider=token_provider,
        temperature=0.7,
    )

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # RAG chain
    rag_chain = ({
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
                 | prompt
                 | llm
                 | StrOutputParser())

    return rag_chain

In [48]:
# Enter a query. Enter an empty query to stop
rag_chain = setup_rag_chain(vector_store)

while True:
    query = input("Ask the video: ")
    if query == "":
        break
    conversational_search(rag_chain, query)

Ask the video:  hello


Hello! How can I assist you today?


Ask the video:  what is this video?




Ask the video:  what are the people names?


The people mentioned in the context are Tao and Sue.


Ask the video:  who is the director?


The director is Clint Eastwood.


Ask the video:  Do we have some cars?


Yes, the context describes scenes with cars, including a Gran Torino Sport car being polished and a car filled with individuals driving by.


Ask the video:  Do we have a ford car?


Yes, the context mentions a "Gran Torino Sport" car, which is a model manufactured by Ford.


Ask the video:  


## Post Processing

In [49]:
# Delete the analyzer if it is no longer needed
client.delete_analyzer(ANALYZER_ID)

<Response [204]>