# Deep Dive Reverse Video Search

In the [previous tutorial](./1_reverse_video_search_engine.ipynb), we've learnt how to build a reverse video search engine. Now let's make the solution more feasible in production.

## Preparation

Let's recall preparation steps first:
1. Install packages
2. Prepare data
3. Start milvus

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| towhee |
| towhee.models |
| pillow |
| ipython |
| fastapi |

In [1]:
! python -m pip install -q towhee towhee.models

### Prepare data

This tutorial will use a small data extracted from [Kinetics400](https://www.deepmind.com/open-source/kinetics). You can download the subset from [Github](https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip). 

The data is organized as follows:
- **train:** candidate videos, 20 classes, 10 videos per class (200 in total)
- **test:** query videos, same 20 classes as train data, 1 video per class (20 in total)
- **reverse_video_search.csv:** a csv file containing an ***id***, ***path***, and ***label*** for each video in train data

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip -O
! unzip -q -o reverse_video_search.zip

For later steps to easier get videos & measure results, we build some helpful functions in advance:
- **ground_truth:** get ground-truth video ids for the query video by its path

In [1]:
import pandas as pd

df = pd.read_csv('./reverse_video_search.csv')

id_video = df.set_index('id')['path'].to_dict()
label_ids = {}
for label in set(df['label']):
    label_ids[label] = list(df[df['label']==label].id)

### Start Milvus

Before getting started with the engine, we also need to get ready with Milvus. Please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [None]:
! python -m pip install -q pymilvus==2.2.11

Here we prepare a function to work with a Milvus collection with the following parameters:
- [L2 distance metric](https://milvus.io/docs/metric.md#Euclidean-distance-L2)
- [IVF_FLAT index](https://milvus.io/docs/index.md#IVF_FLAT).

In [2]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='ids', is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='deep dive reverse video search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist": 400}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

### Build Engine

Now we are ready to build a reverse-video-search engine. Here we show an engine built with [`TimeSformer model`](https://towhee.io/action-classification/timesformer) and its performance to make comparasion later.

In [3]:
def read_csv(csv_file):
    import csv
    with open(csv_file, 'r', encoding='utf-8-sig') as f:
        data = csv.DictReader(f)
        for line in data:
            yield line['id'], line['path'], line['label']

def ground_truth(path):
    label = path.split('/')[-2]
    return label_ids[label]

def mean_hit_ratio(actual, predicted):
    ratios = []
    for act, pre in zip(actual, predicted):
        hit_num = len(set(act) & set(pre))
        ratios.append(hit_num / len(act))
    return sum(ratios) / len(ratios)

def mean_average_precision(actual, predicted):
    aps = []
    for act, pre in zip(actual, predicted):
        precisions = []
        hit = 0
        for idx, i in enumerate(pre):
            if i in act:
                hit += 1
            precisions.append(hit / (idx + 1))
        aps.append(sum(precisions) / len(precisions))
    
    return sum(aps) / len(aps)

In [6]:
import glob
from towhee import pipe, ops
from towhee.datacollection import DataCollection

collection = create_milvus_collection('timesformer', 768)

insert_pipe = (
    pipe.input('csv_path')
        .flat_map('csv_path', ('id', 'path', 'label'), read_csv)
        .map('id', 'id', lambda x: int(x))
        .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 8}))
        .map('frames', ('labels', 'scores', 'features'), ops.action_classification.timesformer(skip_preprocess=True))
        .map('features', 'features', ops.towhee.np_normalize())
        .map(('id', 'features'), 'insert_res', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='timesformer'))
        .output()
)

insert_pipe('reverse_video_search.csv')

collection.load()
eval_pipe = (
    pipe.input('path')
        .flat_map('path', 'path', lambda x: glob.glob(x))
        .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 8}))
        .map('frames', ('labels', 'scores', 'features'), ops.action_classification.timesformer(skip_preprocess=True))
        .map('features', 'features', ops.towhee.np_normalize())
        .map('features', 'result', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='timesformer', limit=10))  
        .map('result', 'predict', lambda x: [i[0] for i in x])
        .map('path', 'ground_truth', ground_truth)
        .window_all(('ground_truth', 'predict'), 'mHR', mean_hit_ratio)
        .window_all(('ground_truth', 'predict'), 'mAP', mean_average_precision)
        .output('mHR', 'mAP')
)

res = DataCollection(eval_pipe('./test/*/*.mp4'))
res.show()

mHR,mAP
0.715,0.7723293650793651


## Dimensionality Reduction

In production, memory consumption is always a major concern, which can by relieved by minimizing the embedding dimension. Random projection is a dimensionality reduction method for a set vectors in Euclidean space. Since this method is fast and requires no training, we'll try this technique and compare performance with TimeSformer model:

First let's get a quick look at the engine performance without dimension reduction. The embedding dimension is 768.

To reduce dimension, we can apply a projection matrix in proper size to each original embedding. We can just add an operator `.map('features', 'features', lambda x: np.dot(x, projection_matrix))` right after an video embedding is generated. Let's see how's the engine performance with embedding dimension down to 128.

In [9]:
import numpy as np

projection_matrix = np.random.normal(scale=1.0, size=(768, 128))

collection = create_milvus_collection('timesformer_128', 128)

insert_pipe = (
    pipe.input('csv_path')
        .flat_map('csv_path', ('id', 'path', 'label'), read_csv)
        .map('id', 'id', lambda x: int(x))
        .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 8}))
        .map('frames', ('labels', 'scores', 'features'), ops.action_classification.timesformer(skip_preprocess=True))
        .map('features', 'features', lambda x: np.dot(x, projection_matrix))
        .map('features', 'features', ops.towhee.np_normalize())
        .map(('id', 'features'), 'insert_res', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='timesformer_128'))
        .output()
)

insert_pipe('reverse_video_search.csv')

collection.load()
eval_pipe = (
    pipe.input('path')
        .flat_map('path', 'path', lambda x: glob.glob(x))
        .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 8}))
        .map('frames', ('labels', 'scores', 'features'), ops.action_classification.timesformer(skip_preprocess=True))
        .map('features', 'features', lambda x: np.dot(x, projection_matrix))
        .map('features', 'features', ops.towhee.np_normalize())
        .map('features', 'result', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='timesformer_128', limit=10))  
        .map('result', 'predict', lambda x: [i[0] for i in x])
        .map('path', 'ground_truth', ground_truth)
        .window_all(('ground_truth', 'predict'), 'mHR', mean_hit_ratio)
        .window_all(('ground_truth', 'predict'), 'mAP', mean_average_precision)
        .output('mHR', 'mAP')
)

res = DataCollection(eval_pipe('./test/*/*.mp4'))
res.show()

mHR,mAP
0.61,0.6778511904761905


It's surprising that the performance is not affected a lot. Both mHR and mAP descrease by about 0.1 while the embedding size are reduced by 6 times (dimension from 768 to 128).