# Project Week 1: ActivityNet Video Data Preparation and Indexing

In this example we will use the ActivityNet dataset https://github.com/activitynet/ActivityNet. 

 - Select the 10 videos with more moments.
 - Download these videos onto your computer.
 - Extract the frames for every video.
 - Read the textual descriptions of each video.
 - Index the video data in OpenSearch.

 In this week, you will index the video data and make it searchable with OpenSearch. You should refer to the OpenSearch tutorial laboratory.

## Select videos
Download the `activity_net.v1-3.min.json` file containing the list of videos. The file is in the github repository of ActivityNet.
Parse this file and select the 10 videos with more moments.

In [2]:
import json
from pprint import pprint
import pprint as pp


with open('activity_net.v1-3.min.json', 'r') as json_data:
    data = json.load(json_data)

# Complete here
# Select 10 videos with the most moments
video_moments = {video_id: len(details["annotations"]) for video_id, details in data["database"].items()}
selected_videos = sorted(video_moments.keys(), key=lambda x: video_moments[x], reverse=True)[:10]

print("Selected videos:")
pprint(selected_videos)

Selected videos:
['o1WPnnvs00I',
 'oGwn4NUeoy8',
 'VEDRmPt_-Ms',
 'qF3EbR8y8go',
 'DLJqhYP-C0k',
 't6f_O8a4sSg',
 '6gyD-Mte2ZM',
 'jBvGvVw3R-Q',
 'PJ72Yl0B1rY',
 'QHn9KyE-zZo']


## Video frame extraction

PyAV is a wrapper library providing you access to `ffmpeg`, a command-line video processing tool. In the example below, you will be able to extract frames from the a video shot.

In [3]:
import av
import av.datasets

content = av.datasets.curated("pexels/time-lapse-video-of-night-sky-857195.mp4")
with av.open(content) as container:
    # Signal that we only want to look at keyframes.
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"

    for i, frame in enumerate(container.decode(stream)):
        print(frame)
        frame.to_image().save(f"night-sky.{i:04d}.jpg", quality=80)

<av.VideoFrame, pts=0 yuv420p 1280x720 at 0x15fb0d6f2e0>
<av.VideoFrame, pts=75 yuv420p 1280x720 at 0x15fb0d6f340>
<av.VideoFrame, pts=150 yuv420p 1280x720 at 0x15faf24fa00>


## Video metadata

Process the video metadata provided in the `json` file and index the video data in OpenSearch.

In [4]:
# Extract video metadata
video_metadata = {vid: {"duration": data["database"][vid]["duration"], "url": data["database"][vid].get("url", "N/A")} for vid in selected_videos}

# Sort video metadata by duration (ascending order)
video_metadata = dict(sorted(video_metadata.items(), key=lambda x: x[1]["duration"]))

print("Video metadata (ordered by duration):")
pprint(video_metadata)

Video metadata (ordered by duration):
{'6gyD-Mte2ZM': {'duration': 188.245,
                 'url': 'https://www.youtube.com/watch?v=6gyD-Mte2ZM'},
 'DLJqhYP-C0k': {'duration': 186.968,
                 'url': 'https://www.youtube.com/watch?v=DLJqhYP-C0k'},
 'PJ72Yl0B1rY': {'duration': 206.332,
                 'url': 'https://www.youtube.com/watch?v=PJ72Yl0B1rY'},
 'QHn9KyE-zZo': {'duration': 196.279,
                 'url': 'https://www.youtube.com/watch?v=QHn9KyE-zZo'},
 'VEDRmPt_-Ms': {'duration': 232.07999999999998,
                 'url': 'https://www.youtube.com/watch?v=VEDRmPt_-Ms'},
 'jBvGvVw3R-Q': {'duration': 218.62,
                 'url': 'https://www.youtube.com/watch?v=jBvGvVw3R-Q'},
 'o1WPnnvs00I': {'duration': 229.86,
                 'url': 'https://www.youtube.com/watch?v=o1WPnnvs00I'},
 'oGwn4NUeoy8': {'duration': 153.09,
                 'url': 'https://www.youtube.com/watch?v=oGwn4NUeoy8'},
 'qF3EbR8y8go': {'duration': 204.1,
                 'url': 'https://www.y

## Video captions

The ActivityNetCaptions dataset https://cs.stanford.edu/people/ranjaykrishna/densevid/ dataset provides a textual description of each videos. Index the video captions on a text field of your OpenSearch index.

In [6]:
# Extract video captions
video_captions = {vid: data["database"][vid].get("captions", []) for vid in selected_videos}

# Index video captions in OpenSearch
from opensearchpy import OpenSearch

host = 'api.novasearch.org'
port = 443

user = 'user04'
password = 'no.LIMITS2100'
index_name = user
    
# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = (user, password),
    use_ssl = True,
    url_prefix = 'opensearch_v2',
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False
)

if client.indices.exists(index_name):

    resp = client.indices.open(index = index_name)
    print(resp)

    print('\n----------------------------------------------------------------------------------- INDEX SETTINGS')
    settings = client.indices.get_settings(index = index_name)
    pp.pprint(settings)

    print('\n----------------------------------------------------------------------------------- INDEX MAPPINGS')
    mappings = client.indices.get_mapping(index = index_name)
    pp.pprint(mappings)

    print('\n----------------------------------------------------------------------------------- INDEX #DOCs')
    print(client.count(index = index_name))
else:
    print("Index does not exist.")

# Index video captions
# to_index = [{"video_id": vid, "captions": captions} for vid, captions in video_captions.items()]
# for doc in to_index:
#     client.index(index=index_name, body=doc)

{'acknowledged': True, 'shards_acknowledged': True}

----------------------------------------------------------------------------------- INDEX SETTINGS
{'user04': {'settings': {'index': {'creation_date': '1742992264835',
                                   'knn': 'true',
                                   'number_of_replicas': '0',
                                   'number_of_shards': '4',
                                   'provided_name': 'user04',
                                   'refresh_interval': '1s',
                                   'replication': {'type': 'DOCUMENT'},
                                   'uuid': 'uG4JrAVyRx-jTvLatZpCfg',
                                   'version': {'created': '136387927'}}}}}

----------------------------------------------------------------------------------- INDEX MAPPINGS
{'user04': {'mappings': {'dynamic': 'strict',
                         'properties': {'contents': {'analyzer': 'standard',
                                             

In [7]:
resp = client.indices.close(index = index_name)
print(resp)

{'acknowledged': True, 'shards_acknowledged': True, 'indices': {'user04': {'closed': True}}}


In [8]:
 # TODO