# Video Analysis with Amazon Nova Models

In this notebook, we demonstrate how to use Amazon Nova models for a video understanding application. 

## Use Case Description

We will use Nova to analyze security footage for an AnyCompany Telecom retail store. We will use Amazon Nova to directly analyze video for high level description of the footage. Then, we will complete frame by frame analyis for a more detailed task. In this module you will complete the following exercises:


- **Summarizing Video:** Extracting a brief summary and description of the provided video
- **Frame by frame analysis:** Using image frames extracted from the video to complete a task.

## Setup

This module will use ffmpeg, which is an opensource framework for video

In [None]:
!sudo apt-get update
#ffmpeg install
!sudo apt-get -q install ffmpeg -y

In [None]:
%pip install --upgrade -r requirements.txt -q

In [None]:
# AWS/Sagemaker imports
import boto3
import sagemaker
from sagemaker import get_execution_role
import shutil

# Core functionality
import os
import json
import base64
from pathlib import Path

# Video and image processing
from PIL import Image
from IPython.display import Video, Markdown, display, HTML, Image as IPyImage
from lib.frames import VideoProcessor

# Logging (if needed)
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)



In [None]:
# Get the region from the SageMaker session
region_name = sagemaker.Session().boto_region_name
print(f"Current AWS Region: {region_name}")

In [None]:
MICRO_MODEL_ID = "us.amazon.nova-micro-v1:0"
LITE_MODEL_ID = "us.amazon.nova-lite-v1:0"
PRO_MODEL_ID = "us.amazon.nova-pro-v1:0"
PREMIER_MODEL_ID = "us.amazon.nova-premier-v1:0"

## 1. Summarizing Video: Extracting a brief summary and description of the provided video
In this use case we will use Amazon Nova to analyze video using the Invoke API. 

Amazon Nova models can process video content in two ways:
1. **Base64 Method**: Include encoded video directly in the payload (limited to 25MB total payload size)
2. **S3 URI Method**: Reference larger videos (up to 1GB) stored in S3 buckets

We'll use the base64 method here. We define a function to call Nova to analyze the video and function to encode as input in the payload to Nova

In [None]:
def call_nova(
    model,
    messages,
    system_message="",
    streaming=False,
    max_tokens=3000,
    temp=0.1,
    top_p=0.99,
    top_k=20,
    tools=None,
    verbose=False,
):
    """Call Amazon Nova models with various parameters.
    
    Args:
        model (str): The model ID to use
        messages (list): List of message objects with role and content
        system_message (str, optional): System prompt. Defaults to "".
        streaming (bool, optional): Whether to use streaming API. Defaults to False.
        max_tokens (int, optional): Maximum tokens to generate. Defaults to 512.
        temp (float, optional): Temperature parameter. Defaults to 0.7.
        top_p (float, optional): Top-p parameter. Defaults to 0.99.
        top_k (int, optional): Top-k parameter. Defaults to 20.
        tools (list, optional): List of tool specifications. Defaults to None.
        verbose (bool, optional): Whether to print request body. Defaults to False.
        
    Returns:
        tuple or stream: Model response and content text if not streaming, else stream
    """
    client = boto3.client("bedrock-runtime")
    
    # Prepare system prompt
    system_list = [{"text": system_message}]
    
    # Prepare inference parameters
    inf_params = {
        "max_new_tokens": max_tokens,
        "top_p": top_p,
        "top_k": top_k,
        "temperature": temp,
    }
    
    # Build request body
    request_body = {
        "messages": messages,
        "system": system_list,
        "inferenceConfig": inf_params,
    }
    
    # Add tool configuration if provided
    if tools is not None:
        tool_config = []
        for tool in tools:
            tool_config.append({"toolSpec": tool})
        request_body["toolConfig"] = {"tools": tool_config}
    
    if verbose:
        print("Request Body", request_body)
    
    if not streaming:
        # Use synchronous API
        response = client.invoke_model(modelId=model, body=json.dumps(request_body))
        model_response = json.loads(response["body"].read())
        return model_response, model_response["output"]["message"]["content"][0]["text"]
    else:
        # Use streaming API
        response = client.invoke_model_with_response_stream(
            modelId=model, body=json.dumps(request_body)
        )
        return response["body"]


def get_base64_encoded_value(media_path):
    """Convert media file to base64 encoded string.
    
    Args:
        media_path (str): Path to the media file
        
    Returns:
        str: Base64 encoded string
    """
    with open(media_path, "rb") as media_file:
        binary_data = media_file.read()
        base_64_encoded_data = base64.b64encode(binary_data)
        base64_string = base_64_encoded_data.decode("utf-8")
        return base64_string

def print_output(content_text):
    """Display model output as Markdown.
    
    Args:
        content_text (str): Text to display
    """
    display(Markdown(content_text))

Let's take a look at the video we are working with. It displays a telecom retail store with employees dressed in pink.

In [None]:
#display the video

video_path = "video/store-footage-01.mp4"

# Verify file exists
if os.path.exists(video_path):
    # Display video with controls and specified dimensions
    display(Video(video_path, 
                 embed=True, 
                 width=800,  # Adjust width as needed
                 height=450, 
                 html_attributes="controls autoplay loop"))
else:
    print(f"Error: Video file not found at {video_path}")

We'll use a helper function in lib/frames.py to get some technical stats on the video

In [None]:
processor = VideoProcessor("video/store-footage-01.mp4")

 The video is captured at 1080p resolution and 24 frames per second.

### Video Summarization

First we will use Amazon Nova Pro to give a high level summary of the video. 

In [None]:

system_message = f"""You are an AI assistant specialized in visual analysis and multimodal understanding.
Your task is to analyze a video showing a retail store environment. """

prompt = "Summarize the events in this video"
messages = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {
                        "bytes": get_base64_encoded_value(
                            "video/store-footage-01.mp4"
                        )
                    },
                }
            },
            {
                "text":prompt
            },
        ],
    }
]
model_response, content_text = call_nova(
    PRO_MODEL_ID, messages, system_message=system_message, max_tokens=300
)

print("\n[Response Content Text]")
print_output(content_text)

In [None]:
system_message = f"""You are an AI assistant specialized in visual analysis and multimodal understanding.
Your task is to analyze a video showing a retail store environment. """

prompt = """Describe 5 events  that happen in this video. Do not include timestamps 
Output your response as a bullet point list

Apply these definitions of physical location in store (reference these in your response):
1. ENTRANCE ZONE - Glass door entry area at top of image
2. CENTER PRODUCT DISPLAY TABLES - Wooden tables in center with smartphones/tablets
3. LEFT ACCESSORY WALL - Wall displays on left side of the image/video
4. RIGHT ACCESSORY WALL - Wall displays on right side of the image/video
5. BACK PRODUCT DISPLAY TABLE - Wooden table closet to the camera with devices 
6. SERVICE COUNTER LEFT - Staffed counter on left
7. SERVICE COUNTER RIGHT - Staffed counter on right
8. CENTER AISLE - Main walkway through center of the store

DO NOT make up events you don't see."""


messages = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {
                        "bytes": get_base64_encoded_value(
                            "video/store-footage-01.mp4"
                        )
                    },
                }
            },
            {
                "text":prompt
            },
        ],
    }
]
model_response, content_text = call_nova(
    PREMIER_MODEL_ID, messages, system_message=system_message, max_tokens=300
)

print("\n[Response Content Text]")
print_output(content_text)

## 2. Frame by frame analysis

Frame-by-frame analysis provides greater precision and control over video understanding, allowing us to identify specific moments, track changes between exact points in time, and correlate events with precise timestampsâ€”capabilities that are essential for detailed behavioral analysis and actionable insights. 

In this exercise, we're going to use frame by frame analysis to analyze a customer's behavior in the store. To do this you have to pre-process the video as follows:

### Steps we'll take for pre-processing:
1. **Extract frames:** Extract image frames from the video at 4 frames per second (fps)
2. **Analyze frames:** Analyze extracted frames to keep frames with distinct actions
3. **Create composite image:** Compose a grid of retained frames remaining frames as a composite image for Nova's analysis

We wil use a Python package called [VideoProcessor](./lib/frames.py) to complete the pre-processing steps.

**Note:** You do have the option to pass each individual frame to Nova for analysis. However, creating a composite of image frames is a useful approach for cases where you are designing for scale and need to optimize cost further 

In [None]:
# Extract frames at 4 fps
frames = processor.extract_frames(fps=4, max_resolution=(1280, 720))

# Create the composite of distinct frames
composite = processor.create_composite_from_distinct_frames(
    frame_paths=frames,
    columns=4,
    similarity_threshold=0.85
)

# Save and display the composite
composite_path = "distinct_frames_composite.jpg"
composite.save(composite_path)


In [None]:
# Display the composite image of distinct frames
with Image.open(composite_path) as img:
    resized_img = img.resize((900, int(900 * img.size[1] / img.size[0])), Image.LANCZOS)
    display(IPyImage(data=resized_img._repr_png_()))

#### Now we provide that composite image to Nova for a task to analyse a customer's behavior in the store

In [None]:
system_message = f"""You are an AI assistant specialized in visual analysis and multimodal understanding.
Your task is to analyze footgae showing AnyCompany Telecom retail store. """

customer_description=f"""customer in a white button downshirt who is closest to the center 
of the image in the first frame (closest to the center of the image)"""


prompt = """
You are analyzing security camera footage from an AnyCompany Telecom retail store displayed as an image. 
In the image you are looking at a grid of multiple image frame from the security footage.  
Analyze the grids from left to right and top to bottom

In this footage, there are two types of individuals:
- Employees: Wearing pink/magenta shirts
- Customers: Not wearing the uniform; They are wearing regular clothing.

IMAGE DETAILS:
- 21 sequential image frames arranged in a 4 by 6 grid
- Extracted at 4 frames per second
- Total time span of the original video: 5 seconds


ALWAYS Apply these store to each image frame (reference these in your response):
1. ENTRANCE ZONE - Glass door entry area at top of image
2. CENTER PRODUCT DISPLAY TABLES - Wooden tables in center with smartphones/tablets
3. LEFT ACCESSORY WALL - Wall displays on left side
4. RIGHT ACCESSORY WALL - Wall displays on right side
5. BACK PRODUCT DISPLAY TABLE - Wooden table closet to the camera with devices 
6. SERVICE COUNTER LEFT - Staffed counter on left
7. SERVICE COUNTER RIGHT - Staffed counter on right
8. CENTER AISLE - Main walkway through center of the store

CUSTOMER TO ANALYZE:
{customer_description}

YOUR TASK:
Analyze this customer frame-by-frame then provide a HIGH-LEVEL summary answering:What was this customer interested in?

Provide your analysis in JSON format with this structure:

{
  "journey_path": {
    "zones_visited_in_order": ["zone1", "zone2", "zone3"],
    "time_in_each_zone": "Describe time spent in each zone",
    "movement_pattern": "Describe their movement style"
  },
  
  "interest_analysis": {
    "primary_interest": "What are they most interested in?",
    "evidence": "What behaviors show this?",
    "engagement_level": "High or Medium or Low",
    "specific_items": "Any specific products they examined?"
  },
  
  "staff_interaction": {
    "did_they_interact_with_staff": "Yes or No or About to",
    "which_counter": "Left or Right or None",
    "interaction_type": "Describe the interaction or N/A"
  }
}

At all times, be precise, factual, and justify your classifications based on visual evidence from the video.

DO NOT make up events you don't see.

"""

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "jpg",
                    "source": {
                        "bytes": get_base64_encoded_value(
                            "distinct_frames_composite.jpg"
                        )
                    },
                }
            },
            {
                "text":prompt
            },
        ],
    }
]

model_response, content_text = call_nova(
    PREMIER_MODEL_ID, messages, system_message=system_message, max_tokens=300
)

print("\n[Response Content Text]")
print_output(content_text)

In [None]:
#remove (""") below to execute this cell and run the code again if interested

"""
#delete ./video/storage-footage-01 folder and distinct_frames_composte.jpg to run the code again
folder_path = './video/storage-footage-01'
file_path = './distinct_frames_composite.jpg'

# Delete folder and contents
try:
    if os.path.exists(folder_path):
        shutil.rmtree(folder_path)
        print(f"Successfully deleted {folder_path} and all its contents")
    else:
        print(f"Folder {folder_path} does not exist")
except Exception as e:
    print(f"Error occurred while deleting folder: {e}")

# Delete file
try:
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Successfully deleted {file_path}")
    else:
        print(f"File {file_path} does not exist")
except Exception as e:
    print(f"Error occurred while deleting file: {e}")

"""

# Conclusion

In this notebook, we demonstrated Amazon Nova's video analysis capabilities, showcasing both direct video processing and frame-by-frame analysis approaches.

## Key Learnings

1. **Video Analysis Approaches**
   - **Direct Analysis**: Used Nova for holistic video understanding
   - **Frame Analysis**: Extracted and analyzed individual frames for detailed insights
   - **Composite Analysis**: Combined frames for efficient batch processing

2. **Technical Implementation**
   - Leveraged a reusable video processing pipeline with FFmpeg
   - Implemented frame similarity detection to reduce redundancy
   - Created composite visualizations for analysis

3. **Business Applications**
   - Customer journey tracking in retail environments
   - Behavioral analysis and movement patterns
   - Decision support for customer service


This workshop demonstrated how Nova can be effectively used for detailed video understanding tasks while maintaining processing efficiency.