# Extract Content from Your File

This notebook demonstrate you can use Content Understanding API to extract semantic content from multimodal files.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [1]:
import logging
import json
import os
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2024-12-01-preview")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/content_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.19.0 Python/3.13.1 (macOS-15.3-arm64-arm-64bit-Mach-O)'
No body was attached to the request
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from AzureCliCredential


## Video Content
Video output provides detailed information about audiovisual content, specifically video shots. Here are the key features it offers:

1. Shot Information: Each shot is defined by a start and end time, along with a unique identifier. For example, Shot 0:0.0 to 0:2.800 includes a transcript and key frames.
1. Transcript: The API includes a transcript of the audio, formatted in WEBVTT, which allows for easy synchronization with the video. It captures spoken content and specifies the timing of the dialogue.
1. Key Frames: It provides a series of key frames (images) that represent important moments in the video shot, allowing users to visualize the content at specific timestamps.
1. Description: Each shot is accompanied by a description, providing context about the visuals presented. This helps in understanding the scene or subject matter without watching the video.
1. Audio Visual Metadata: Details about the video such as dimensions (width and height), type (audiovisual), and the presence of key frame timestamps are included.
1. Transcript Phrases: The output includes specific phrases from the transcript, along with timing and speaker information, enhancing the usability for applications like closed captioning or search functionalities.

In [2]:
ANALYZER_ID = "content-video-sample-" + str(uuid.uuid4())
ANALYZER_TEMPLATE_FILE = '../analyzer_templates/content_video.json'
ANALYZER_SAMPLE_FILE = '../data/FlightSimulator.mp4'

# Create analyzer
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_FILE)
result = client.poll_result(response)

# Analyzer file
response = client.begin_analyze(ANALYZER_ID, file_location=ANALYZER_SAMPLE_FILE)
result = client.poll_result(response)

print(json.dumps(result, indent=2))


INFO:python.content_understanding_client:Analyzer content-video-sample-8d1242a6-cbe0-4414-af65-b53fee8ed15c create request accepted.
INFO:python.content_understanding_client:Request result is ready after 0.00 seconds.
INFO:python.content_understanding_client:Analyzing file ../data/FlightSimulator.mp4 with analyzer: content-video-sample-8d1242a6-cbe0-4414-af65-b53fee8ed15c
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.content_understanding_client:Request cdb3b11c-7df8-4e48-92ed-854478992b4e in progress ...
INFO:python.c

{
  "id": "cdb3b11c-7df8-4e48-92ed-854478992b4e",
  "status": "Succeeded",
  "result": {
    "analyzerId": "content-video-sample-8d1242a6-cbe0-4414-af65-b53fee8ed15c",
    "apiVersion": "2024-12-01-preview",
    "createdAt": "2025-03-08T08:40:35Z",
    "contents": [
      {
        "markdown": "# Shot 00:00.000 => 00:01.467\n## Transcript\n```\nWEBVTT\n\n00:01.400 --> 00:06.560\n<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.\n```\n## Key Frames\n- 00:00.726 ![](keyFrame.726.jpg)",
        "fields": {},
        "kind": "audioVisual",
        "startTimeMs": 0,
        "endTimeMs": 1467,
        "width": 1080,
        "height": 608,
        "KeyFrameTimesMs": [
          726
        ],
        "transcriptPhrases": [
          {
            "speaker": "speaker",
            "startTimeMs": 1400,
            "endTimeMs": 6560,
            "text": "When it comes to the neural TTS, in order to get a good voice, it's better to have good 

In [4]:
import pandas as pd

result_df = pd.DataFrame(result['result']['contents'])

print(result_df)

                                             markdown fields         kind  \
0   # Shot 00:00.000 => 00:01.467\n## Transcript\n...     {}  audioVisual   
1   # Shot 00:01.467 => 00:03.233\n## Transcript\n...     {}  audioVisual   
2   # Shot 00:03.233 => 00:07.367\n## Transcript\n...     {}  audioVisual   
3   # Shot 00:07.367 => 00:08.200\n## Transcript\n...     {}  audioVisual   
4   # Shot 00:08.200 => 00:11.367\n## Transcript\n...     {}  audioVisual   
5   # Shot 00:11.367 => 00:13.567\n## Transcript\n...     {}  audioVisual   
6   # Shot 00:13.567 => 00:16.100\n## Transcript\n...     {}  audioVisual   
7   # Shot 00:16.100 => 00:19.433\n## Transcript\n...     {}  audioVisual   
8   # Shot 00:19.433 => 00:23.967\n## Transcript\n...     {}  audioVisual   
9   # Shot 00:23.967 => 00:30.033\n## Transcript\n...     {}  audioVisual   
10  # Shot 00:30.033 => 00:33.200\n## Transcript\n...     {}  audioVisual   
11  # Shot 00:33.200 => 00:35.267\n## Transcript\n...     {}  audioVisual   

In [5]:
KEY_FRAME_DIR = '../data/key_frames'

# Crate key frame directory if it does not exist

Path(KEY_FRAME_DIR).mkdir(parents=True, exist_ok=True)

# Delete files in the key frame directory
for file in os.listdir(KEY_FRAME_DIR):
    os.remove(os.path.join(KEY_FRAME_DIR, file))
            

In [None]:
# from IPython.display import Markdown, display

# for index, row in result_df.iterrows():
#     markdown_str = ""
#     for col in result_df.columns:
#         # Expand arrays for specific columns
#         if col in ["keyFrameTimesMs", "transcriptPhrases"] and isinstance(row[col], list):
#             markdown_str += f"**{col}:**\n"
#             for i, item in enumerate(row[col]):
#                 markdown_str += f"  - **{i}:** {item}\n"
#         else:
#             markdown_str += f"**{col}:** {row[col]}\n"
#     markdown_str += "\n---\n"  # Markdown horizontal rule as a separator between rows
#     display(Markdown(markdown_str))

In [6]:
# Assuming result_df is your pandas DataFrame and each row in 'KeyFrameTimesMs' is a list of integers
for times in result_df['KeyFrameTimesMs']:
    for time_ms in times:
        key_frame = client.get_image_from_analyze_operation(response, f'keyFrame.{time_ms}')
        # Process key_frame as needed
        file_path = f'{KEY_FRAME_DIR}/keyFrame.{time_ms}.jpg'
        with open(file_path, 'wb') as file:
            file.write(key_frame)

In [8]:
result_df_temp = result_df.copy()

result_df_temp

Unnamed: 0,markdown,fields,kind,startTimeMs,endTimeMs,width,height,KeyFrameTimesMs,transcriptPhrases
0,# Shot 00:00.000 => 00:01.467\n## Transcript\n...,{},audioVisual,0,1467,1080,608,[726],"[{'speaker': 'speaker', 'startTimeMs': 1400, '..."
1,# Shot 00:01.467 => 00:03.233\n## Transcript\n...,{},audioVisual,1467,3233,1080,608,"[2046, 2640]","[{'speaker': 'speaker', 'startTimeMs': 1400, '..."
2,# Shot 00:03.233 => 00:07.367\n## Transcript\n...,{},audioVisual,3233,7367,1080,608,"[4059, 4884, 5709, 6534]","[{'speaker': 'speaker', 'startTimeMs': 1400, '..."
3,# Shot 00:07.367 => 00:08.200\n## Transcript\n...,{},audioVisual,7367,8200,1080,608,[7788],"[{'speaker': 'speaker', 'startTimeMs': 7600, '..."
4,# Shot 00:08.200 => 00:11.367\n## Transcript\n...,{},audioVisual,8200,11367,1080,608,"[8976, 9768, 10560]","[{'speaker': 'speaker', 'startTimeMs': 7600, '..."
5,# Shot 00:11.367 => 00:13.567\n## Transcript\n...,{},audioVisual,11367,13567,1080,608,"[12078, 12804]","[{'speaker': 'speaker', 'startTimeMs': 7600, '..."
6,# Shot 00:13.567 => 00:16.100\n## Transcript\n...,{},audioVisual,13567,16100,1080,608,"[14190, 14817, 15444]","[{'speaker': 'speaker', 'startTimeMs': 13440, ..."
7,# Shot 00:16.100 => 00:19.433\n## Transcript\n...,{},audioVisual,16100,19433,1080,608,"[16929, 17754, 18579]","[{'speaker': 'speaker', 'startTimeMs': 13440, ..."
8,# Shot 00:19.433 => 00:23.967\n## Transcript\n...,{},audioVisual,19433,23967,1080,608,"[20196, 20955, 21714, 22473, 23232]","[{'speaker': 'speaker', 'startTimeMs': 13440, ..."
9,# Shot 00:23.967 => 00:30.033\n## Transcript\n...,{},audioVisual,23967,30033,1080,608,"[24816, 25674, 26532, 27390, 28248, 29106]","[{'speaker': 'speaker', 'startTimeMs': 24080, ..."


In [9]:
result_df_temp['markdown'] = result_df_temp['markdown'].str.replace("![](", "![](../data/key_frames/", case=False, regex=False)

In [10]:
from IPython.display import Markdown, display

target_column = "markdown"

for index, row in result_df_temp.iterrows():
    content = row[target_column]
    if isinstance(content, list):
        for item in content:
            display(Markdown(str(item)))
    else:
        display(Markdown(str(content)))

# Shot 00:00.000 => 00:01.467
## Transcript
```
WEBVTT

00:01.400 --> 00:06.560
<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.
```
## Key Frames
- 00:00.726 ![](../data/key_frames/keyFrame.726.jpg)

# Shot 00:01.467 => 00:03.233
## Transcript
```
WEBVTT

00:01.400 --> 00:06.560
<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.
```
## Key Frames
- 00:02.046 ![](../data/key_frames/keyFrame.2046.jpg)
- 00:02.640 ![](../data/key_frames/keyFrame.2640.jpg)

# Shot 00:03.233 => 00:07.367
## Transcript
```
WEBVTT

00:01.400 --> 00:06.560
<v Speaker>When it comes to the neural TTS, in order to get a good voice, it's better to have good data.
```
## Key Frames
- 00:04.059 ![](../data/key_frames/keyFrame.4059.jpg)
- 00:04.884 ![](../data/key_frames/keyFrame.4884.jpg)
- 00:05.709 ![](../data/key_frames/keyFrame.5709.jpg)
- 00:06.534 ![](../data/key_frames/keyFrame.6534.jpg)

# Shot 00:07.367 => 00:08.200
## Transcript
```
WEBVTT

00:07.600 --> 00:13.320
<v Speaker>To achieve that, we build a universal TTS model based on 3,000 hours of data.
```
## Key Frames
- 00:07.788 ![](../data/key_frames/keyFrame.7788.jpg)

# Shot 00:08.200 => 00:11.367
## Transcript
```
WEBVTT

00:07.600 --> 00:13.320
<v Speaker>To achieve that, we build a universal TTS model based on 3,000 hours of data.
```
## Key Frames
- 00:08.976 ![](../data/key_frames/keyFrame.8976.jpg)
- 00:09.768 ![](../data/key_frames/keyFrame.9768.jpg)
- 00:10.560 ![](../data/key_frames/keyFrame.10560.jpg)

# Shot 00:11.367 => 00:13.567
## Transcript
```
WEBVTT

00:07.600 --> 00:13.320
<v Speaker>To achieve that, we build a universal TTS model based on 3,000 hours of data.
00:13.440 --> 00:23.640
<v Speaker>We actually accumulated tons of the data so that this universal model is able to capture the nuance of the audio and generate a more natural voice for the algorithm.
```
## Key Frames
- 00:12.078 ![](../data/key_frames/keyFrame.12078.jpg)
- 00:12.804 ![](../data/key_frames/keyFrame.12804.jpg)

# Shot 00:13.567 => 00:16.100
## Transcript
```
WEBVTT

00:13.440 --> 00:23.640
<v Speaker>We actually accumulated tons of the data so that this universal model is able to capture the nuance of the audio and generate a more natural voice for the algorithm.
```
## Key Frames
- 00:14.190 ![](../data/key_frames/keyFrame.14190.jpg)
- 00:14.817 ![](../data/key_frames/keyFrame.14817.jpg)
- 00:15.444 ![](../data/key_frames/keyFrame.15444.jpg)

# Shot 00:16.100 => 00:19.433
## Transcript
```
WEBVTT

00:13.440 --> 00:23.640
<v Speaker>We actually accumulated tons of the data so that this universal model is able to capture the nuance of the audio and generate a more natural voice for the algorithm.
```
## Key Frames
- 00:16.929 ![](../data/key_frames/keyFrame.16929.jpg)
- 00:17.754 ![](../data/key_frames/keyFrame.17754.jpg)
- 00:18.579 ![](../data/key_frames/keyFrame.18579.jpg)

# Shot 00:19.433 => 00:23.967
## Transcript
```
WEBVTT

00:13.440 --> 00:23.640
<v Speaker>We actually accumulated tons of the data so that this universal model is able to capture the nuance of the audio and generate a more natural voice for the algorithm.
```
## Key Frames
- 00:20.196 ![](../data/key_frames/keyFrame.20196.jpg)
- 00:20.955 ![](../data/key_frames/keyFrame.20955.jpg)
- 00:21.714 ![](../data/key_frames/keyFrame.21714.jpg)
- 00:22.473 ![](../data/key_frames/keyFrame.22473.jpg)
- 00:23.232 ![](../data/key_frames/keyFrame.23232.jpg)

# Shot 00:23.967 => 00:30.033
## Transcript
```
WEBVTT

00:24.080 --> 00:29.120
<v Speaker>What we liked about cognitive services offerings were that they had a much higher fidelity.
00:29.600 --> 00:32.880
<v Speaker>And they sounded a lot more like an actual human voice.
```
## Key Frames
- 00:24.816 ![](../data/key_frames/keyFrame.24816.jpg)
- 00:25.674 ![](../data/key_frames/keyFrame.25674.jpg)
- 00:26.532 ![](../data/key_frames/keyFrame.26532.jpg)
- 00:27.390 ![](../data/key_frames/keyFrame.27390.jpg)
- 00:28.248 ![](../data/key_frames/keyFrame.28248.jpg)
- 00:29.106 ![](../data/key_frames/keyFrame.29106.jpg)

# Shot 00:30.033 => 00:33.200
## Transcript
```
WEBVTT

00:29.600 --> 00:32.880
<v Speaker>And they sounded a lot more like an actual human voice.
```
## Key Frames
- 00:30.822 ![](../data/key_frames/keyFrame.30822.jpg)
- 00:31.614 ![](../data/key_frames/keyFrame.31614.jpg)
- 00:32.406 ![](../data/key_frames/keyFrame.32406.jpg)

# Shot 00:33.200 => 00:35.267
## Transcript
```
WEBVTT

00:33.720 --> 00:37.080
<v Speaker>Orlando ground 9555 requesting the end of pushback.
```
## Key Frames
- 00:33.891 ![](../data/key_frames/keyFrame.33891.jpg)
- 00:34.584 ![](../data/key_frames/keyFrame.34584.jpg)

# Shot 00:35.267 => 00:37.700
## Transcript
```
WEBVTT

00:33.720 --> 00:37.080
<v Speaker>Orlando ground 9555 requesting the end of pushback.
```
## Key Frames
- 00:36.069 ![](../data/key_frames/keyFrame.36069.jpg)
- 00:36.861 ![](../data/key_frames/keyFrame.36861.jpg)

# Shot 00:37.700 => 00:39.200
## Transcript
```
WEBVTT

00:38.800 --> 00:41.160
<v Speaker>9555 request to end pushback received.
```
## Key Frames
- 00:38.181 ![](../data/key_frames/keyFrame.38181.jpg)
- 00:38.676 ![](../data/key_frames/keyFrame.38676.jpg)

# Shot 00:39.200 => 00:42.033
## Transcript
```
WEBVTT

00:38.800 --> 00:41.160
<v Speaker>9555 request to end pushback received.
```
## Key Frames
- 00:39.897 ![](../data/key_frames/keyFrame.39897.jpg)
- 00:40.590 ![](../data/key_frames/keyFrame.40590.jpg)
- 00:41.283 ![](../data/key_frames/keyFrame.41283.jpg)

# Shot 00:42.033 => 00:43.866
## Transcript
```
WEBVTT

```
## Key Frames
- 00:42.636 ![](../data/key_frames/keyFrame.42636.jpg)
- 00:43.230 ![](../data/key_frames/keyFrame.43230.jpg)

In [11]:
client.delete_analyzer(ANALYZER_ID)

INFO:python.content_understanding_client:Analyzer content-video-sample-8d1242a6-cbe0-4414-af65-b53fee8ed15c deleted.


<Response [204]>