# Text-Video search app example

Create, deploy, feed and query the Vespa app using the Vespa python API.

This example requires FFmpeg for video processing.

## Install required packages


In [None]:
!pip install -r requirements.txt

## CLIP models

List the model variations:

In [None]:
import clip

clip.available_models()

Each model might have a different embedding size,
and this information is needed when creating the schema of the text-video search application.
Running the below outputs the dimensions for the supported models -
as this is a large download, a copy of the output is listed:

In [None]:
#embedding_info = {
#    name: clip.load(name)[0].visual.output_dim for name in clip.available_models()
#}
#embedding_info

````
{'RN50': 1024,
 'RN101': 512,
 'RN50x4': 640,
 'RN50x16': 768,
 'RN50x64': 1024,
 'ViT-B/32': 512,
 'ViT-B/16': 512,
 'ViT-L/14': 768,
 'ViT-L/14@336px': 768}
````

## Create and deploy a text-video search app


### Create the Vespa application package


The function `create_text_video_app` below uses [Pyvespa](https://pyvespa.readthedocs.io/en/latest/)
to create an application package with fields to store image embeddings extracted from the videos
that we want to search based on the selected CLIP models.
It also declares the types of the text embeddings that we are going to send along with the query when searching for images,
and creates one ranking profile for each (text, image) embedding model.

For this demonstration we are going to use only one CLIP model but we could very well index all the available models for comparison.

In [1]:
from embedding import create_text_video_app

app_package = create_text_video_app({"ViT-B/32": 512})

Inspect how the `schema` of the resulting application package looks like:

In [2]:
print(app_package.schema.schema_to_text)

schema videosearch {
    document videosearch {
        field video_file_name type string {
            indexing: summary | attribute
        }
        field vit_b_32_image type tensor<float>(x[512]) {
            indexing: attribute | index
            attribute {
                distance-metric: euclidean
            }
            index {
                hnsw {
                    max-links-per-node: 16
                    neighbors-to-explore-at-insert: 500
                }
            }
        }
    }
    rank-profile vit_b_32_similarity inherits default {
        first-phase {
            expression {
                closeness(vit_b_32_image)
            }
        }
    }
}


### Deploy to Vespa Cloud


Refer to [Authenticating with Vespa Cloud](https://pyvespa.readthedocs.io/en/latest/authenticating-to-vespa-cloud.html)
for any issues with the below (replace with your tenant name):

In [None]:
from vespa.deployment import VespaCloud

vespa_cloud = VespaCloud(
    tenant="mytenant",
    application="videosearch",
    application_package=app_package,
)

In [None]:
app = vespa_cloud.deploy(
    instance="clip-video-search"
)

Alternatively, check [this guide](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html) to deploy locally in a Docker container.


## Download and convert the data

We are going to use the UCF101 dataset to allow users to follow along from
their laptop. We downloaded a [zipped file](http://storage.googleapis.com/thumos14_files/UCF101_videos.zip)
containing 13320 trimmed videos, each including one action,
and a [text file](http://crcv.ucf.edu/THUMOS14/Class%20Index.txt) containing the list of action
classes and their numerical index.

In [None]:
import os

video_dir=os.getcwd()
print(video_dir)

There is better support for `.mp4` files, convert the `.avi` files to `.mp4` using [ffmpeg](https://www.ffmpeg.org/):

In [None]:
import subprocess

def convert_from_avi_to_mp4(file_name):
    outputfile = file_name.lower().replace(".avi", ".mp4")
    subprocess.call(["ffmpeg", "-i", file_name, outputfile])

The code below takes quite a while and could be sped up by using multi-processing:


In [None]:
import glob

video_files = glob.glob(os.path.join(video_dir, "*.avi"))
for file_name in video_files:
    convert_from_avi_to_mp4(file_name)

## Compute and feed embeddings

The function below assumes you have downloaded the UCF101 dataset and converted it to .mp4.
It extracts frames from the video, computes image embeddings according to a CLIP model and sends them to the Vespa app.

In [None]:
from embedding import compute_and_send_video_embeddings

compute_and_send_video_embeddings(
    app=app,
    batch_size=32,
    clip_model_names=["ViT-B/32"],
    number_frames_per_video=4,
    video_dir=video_dir,
)

The function `compute_and_send_video_embeddings` is a more robust version of the following loop:


In [None]:
for model_name in clip_model_names:
    video_dataset = (
        VideoFeedDataset(  ## PyTorch Dataset that outputs pyvespa-compatible data
            video_dir=video_dir,  # Folder containing video files
            model_name=model_name,  # CLIP model name used to convert image into vector
            number_frames_per_video=4,  # Number of image frames to use per video
        )
    )
    dataloader = DataLoader(  ## PyTorch Dataloader to loop through the dataset
        video_dataset,
        batch_size=batch_size,
        collate_fn=lambda x: [
            item for sublist in x for item in sublist
        ],  # turn list of list into flat list
    )
    for idx, batch in enumerate(dataloader):
        app.update_batch(batch=batch)

## Query the application


We created a custom class `VideoSearchApp` that implements a `query` method that is specific to text-video use case that we are demonstrating here.


In [None]:
from embedding import VideoSearchApp

video_app = VideoSearchApp(app=app, clip_model_name="ViT-B/32")

It takes a `text` query, transform it into an embedding with the CLIP model, and for each video it takes the score of the frame of that video that is closest to the text in the joint embedding space to represent the score of the video. We can also select the number of videos that we want to retrieve.


In [None]:
result = video_app.query(text="playing soccer", number_videos=4)
result

In [None]:
from IPython.display import Video, display

for hit in result:
    display(
        Video(os.path.join(os.environ["VIDEO_DIR"], hit["video_file_name"]), embed=True)
    )