# Create a text-video search app with Vespa
> Create, deploy, feed and query the Vespa app using the Vespa python API 

## Install required packages

In [None]:
!pip install -r requirements.txt

## CLIP model

There are multiple CLIP model variations

In [2]:
import clip

clip.available_models()

['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16']

Each CLIP model might have a different embedding size. We need this information when creating the schema of the text-video search application.

In [3]:
embedding_info = {name: clip.load(name)[0].visual.output_dim for name in clip.available_models()}
embedding_info

{'RN50': 1024,
 'RN101': 512,
 'RN50x4': 640,
 'RN50x16': 768,
 'ViT-B/32': 512,
 'ViT-B/16': 512}

## Create and deploy a text-video search app

### Create the Vespa application package

The function `create_text_video_app` below uses [the Vespa python API](https://pyvespa.readthedocs.io/en/latest/) to create an application package with fields to store image embeddings extracted from the videos that we want to search based on the selected CLIP models. It also declares the types of the text embeddings that we are going to send along with the query when searching for images, and creates one ranking profile for each (text, image) embedding model.

For this demonstration we are going to use only one CLIP model but we could very well index all the available models for comparison, just as we did for [the text-image sample app](https://github.com/vespa-engine/sample-apps/blob/master/text-image-search/src/python/compare-pre-trained-clip-for-text-image-search.ipynb).

In [4]:
from embedding import create_text_video_app

app_package = create_text_video_app({"ViT-B/32": 512})

We can inspect how the `schema` of the resulting application package looks like:

In [5]:
print(app_package.schema.schema_to_text)

schema video_search {
    document video_search {
        field video_file_name type string {
            indexing: summary | attribute
        }
        field vit_b_32_image type tensor<float>(x[512]) {
            indexing: attribute | index
            attribute {
                distance-metric: euclidean
            }
            index {
                hnsw {
                    max-links-per-node: 16
                    neighbors-to-explore-at-insert: 500
                }
            }
        }
    }
    rank-profile vit_b_32_similarity inherits default {
        first-phase {
            expression: closeness(vit_b_32_image)
        }
    }
}


### Deploy to Vespa Cloud

Follow [this guide](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html) to learn how to set the environment variables below before deploying to Vespa Cloud.

In [None]:
from vespa.deployment import VespaCloud

vespa_cloud = VespaCloud(
    tenant=os.environ["TENANT"],
    application=os.environ["APPLICATION"],
    key_location=os.environ["USER_KEY_PATH"],
    application_package=app_package,
)
app = vespa_cloud.deploy(
    instance="clip-video-search", disk_folder=os.environ["DISK_FOLDER"]
)

Alternatively, check [this guide](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-docker.html) to deploy locally in a Docker container.

## Feed data

### Download the data

We are going to use the UCF101 dataset to allow users to follow along from 
their laptop. We downloaded a [zipped file](http://storage.googleapis.com/thumos14_files/UCF101_videos.zip) 
containing 13320 trimmed videos, each including one action, 
and a [text file](http://crcv.ucf.edu/THUMOS14/Class%20Index.txt) containing the list of action 
classes and their numerical index.

After downloading and unzipping the data, set the `VIDEO_DIR` environment variable to the folder containing the video 
files.


### Convert .avi files to .mp4

There is better support for `.mp4` files, so we will convert the `.avi` files to `.mp4` using `ffmpeg`. The code below requires that your machine have `ffmpeg` installed.

In [None]:
import subprocess

def convert_from_avi_to_mp4(file_name):
    outputfile = file_name.lower().replace(".avi", ".mp4")
    subprocess.call(['ffmpeg', '-i', file_name, outputfile])

The code below takes quite a while and could be sped up by using multi-processing:

In [None]:
import glob

video_files = glob.glob(os.path.join(os.environ["VIDEO_DIR"], "*.avi"))
for file_name in video_files:
    convert_from_avi_to_mp4(file_name)

### Compute and send embeddings

The function below assumes you have downloaded the UCF101 dataset, converted it to .mp4 and stored the resulting files in the `VIDEO_PATH` folder. It extracts frames from the video, compute image embeddings according to a CLIP model and send it to the Vespa app.

In [None]:
from embedding import compute_and_send_video_embeddings

compute_and_send_video_embeddings(
    app=app, 
    batch_size=32, 
    clip_model_names=["ViT-B/32"], 
    number_frames_per_video=4,
    video_dir=os.environ["VIDEO_DIR"]
)

The function `compute_and_send_video_embeddings` is a more robust version of the following loop:

In [None]:
for model_name in clip_model_names:
    video_dataset = VideoFeedDataset(       ## PyTorch Dataset that outputs pyvespa-compatible data 
        video_dir=os.environ["VIDEO_DIR"],   # Folder containing video files
        model_name=model_name,               # CLIP model name used to convert image into vector
        number_frames_per_video=4            # Number of image frames to use per video
    )
    dataloader = DataLoader(                ## PyTorch Dataloader to loop through the dataset
        video_dataset,                  
        batch_size=batch_size,
        collate_fn=lambda x: [item for sublist in x for item in sublist],  # turn list of list into flat list
    )
    for idx, batch in enumerate(dataloader):
        app.update_batch(batch=batch)

## Query the application

We created a custom class `VideoSearchApp` that implements a `query` method that is specific to text-video use case that we are demonstrating here.

In [6]:
from embedding import VideoSearchApp

video_app = VideoSearchApp(app=app, clip_model_name="ViT-B/32")

It takes a `text` query, transform it into an embedding with the CLIP model, and for each video it takes the score of the frame of that video that is closest to the text in the joint embedding space to represent the score of the video. We can also select the number of videos that we want to retrieve.

In [7]:
result = video_app.query(text="playing soccer", number_videos=4)
result

[{'video_file_name': 'v_soccerjuggling_g03_c01.mp4',
  'relevance': 0.46539374113067183},
 {'video_file_name': 'v_soccerjuggling_g07_c03.mp4',
  'relevance': 0.4652734484154934},
 {'video_file_name': 'v_soccerjuggling_g21_c02.mp4',
  'relevance': 0.46525791091380464},
 {'video_file_name': 'v_soccerjuggling_g21_c04.mp4',
  'relevance': 0.4649292988294415}]

In [10]:
from IPython.display import Video, display

for hit in result:
    display(Video(os.path.join(os.environ["VIDEO_DIR"], hit["video_file_name"]), embed=True))