# How to Build a Video Deduplication System

This notebook illustrates how to build a video deduplication engine from scratch using [Milvus](https://milvus.io/) and [Towhee](https://towhee.io/).


**What is Video Deduplication?**

Video Deduplication, also known as Video Copy Detection or Video Identification by Fingerprinting, means that given a query video, you need to find or retrieval the videos with the same content with query video.  
Due to the popularity of Internet-based video sharing services, the volume of video content on the Web has reached unprecedented scales. Besides copyright protection, a video copy detection system is important in applications like video classification, tracking, filtering and recommendation.  
The problem is particularly hard in the case of content-based video retrieval, where, given a query video, one needs to calculate its similarity with all videos in a database to retrieve and rank the videos based on relevance. However, using Milvus and Towhee can help you build a Video Deduplication System easily.

**What are Milvus & Towhee?**

- Milvus is the most advanced open-source vector database built for AI applications and supports nearest neighbor embedding search across tens of millions of entries.
- Towhee is a framework that provides ETL for unstructured data using SoTA machine learning models.

We'll go through video retrieval procedures and evaluate the performance. Moreover, we managed to make the core functionality as simple as few lines of code, with which you can start hacking your own video deduplication engine.



## Preparation

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| pymilvus |
| towhee |
| pillow |
| ipython |
| pandas |

In [1]:
! python -m pip install -q pymilvus towhee pillow ipython pandas

### Prepare the data

First, we need to prepare the dataset and Milvus environment.   

[VCDB:A Large-Scale Database for Partial Copy Detection in Videos](https://fvl.fudan.edu.cn/dataset/vcdb/list.htm) is a popular dataset for video deduplication task. It contains over 100,000 Web videos, and more than 9,000 copied segment pairs found through careful manual annotation.   
VCDB consists of two parts: the core dataset and the background dataset. The core dataset (528 videos, approximately 27 hours) was collected using 28 carefully selected queries from YouTube and MetaCafe. After extensive manual annotation, 9,236 pairs of partial copies were found. Major transformations between the copies include "insertion of patterns", "camcording", "scale change", "picture in picture", etc. 

In this tutorial, we prepare a subset of VCDB core dataset, which contains 20 events, and each of them contains about 5 videos with the same or similar content. This takes about 1.3G of space.

Let's take a quick look

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/VCDB_core_sample.zip -O
! unzip -q -o VCDB_core_sample.zip

In [3]:
import random
from pathlib import Path
import torch
import pandas as pd
random.seed(6)

root_dir = './VCDB_core_sample'


min_sample_num = 5
sample_folder_num = 20

all_video_path_lists = []
all_video_path_list = []

df = pd.DataFrame(columns=('path','event','id'))
query_df = pd.DataFrame(columns=('path','event','id'))

video_idx = 0
for i, mid_dir_path in enumerate(Path(root_dir).iterdir()):
    if i >= sample_folder_num:
        break
    if mid_dir_path.is_dir():
        path_videos = list(Path(mid_dir_path).iterdir())
        if len(path_videos) < min_sample_num:
            print('len(path_videos) < min_sample_num, continue.')
            continue
        sample_video_path_list = random.sample(path_videos, min_sample_num)
        all_video_path_lists.append(sample_video_path_list)
        all_video_path_list += [str(path) for path in sample_video_path_list]
        for j, path in enumerate(sample_video_path_list):
            video_idx += 1
            if j == 0:
                query_df = query_df.append(pd.DataFrame({'path': [str(path)],'event':[path.parent.stem],'id': [video_idx]}),ignore_index=True)
            df = df.append(pd.DataFrame({'path': [str(path)],'event':[path.parent.stem],'id': [video_idx]}),ignore_index=True)

all_sample_video_dicts = []
for i, sample_video_path_list in enumerate(all_video_path_lists):
    anchor_video = sample_video_path_list[0]
    pos_video_path_list = sample_video_path_list[1:]
    neg_video_path_lists = all_video_path_lists[:i] + all_video_path_lists[i + 1:]
    neg_video_path_list = [neg_video_path_list[0] for neg_video_path_list in neg_video_path_lists]
    all_sample_video_dicts.append({
        'anchor_video': anchor_video,
        'pos_video_path_list': pos_video_path_list,
        'neg_video_path_list': neg_video_path_list
    })

id2event = df.set_index(['id'])['event'].to_dict()
id2path = df.set_index(['id'])['path'].to_dict()

df_csv_path = 'video_info.csv'
query_df_csv_path = 'query_video_info.csv'
df.to_csv(df_csv_path)
query_df.to_csv(query_df_csv_path)
df

Unnamed: 0,path,event,id
0,VCDB_core_sample/obama_kicks_door/14c81d68b80d...,obama_kicks_door,1
1,VCDB_core_sample/obama_kicks_door/f26a39de8e8e...,obama_kicks_door,2
2,VCDB_core_sample/obama_kicks_door/1f73466d86f0...,obama_kicks_door,3
3,VCDB_core_sample/obama_kicks_door/df0c9e9664cf...,obama_kicks_door,4
4,VCDB_core_sample/obama_kicks_door/4df943d49033...,obama_kicks_door,5
...,...,...,...
90,VCDB_core_sample/the_last_samurai_last_battle/...,the_last_samurai_last_battle,91
91,VCDB_core_sample/the_last_samurai_last_battle/...,the_last_samurai_last_battle,92
92,VCDB_core_sample/the_last_samurai_last_battle/...,the_last_samurai_last_battle,93
93,VCDB_core_sample/the_last_samurai_last_battle/...,the_last_samurai_last_battle,94


Define some helper function to convert video to gif so that we can have a look at these videos.   

In [4]:
from IPython import display
from pathlib import Path
import towhee
from PIL import Image

def display_gif(video_path_list, text_list):
    html = ''
    for video_path, text in zip(video_path_list, text_list):
        html_line = '<img src="{}"> {} <br/><br/>'.format(video_path, text)
        html += html_line
    return display.HTML(html)

    
def convert_video2gif(video_path, output_gif_path, num_samples=16):
    frames = (
        towhee.glob(video_path)
              .video_decode.ffmpeg(start_time=0.0, end_time=1000.0, sample_type='time_step_sample', args={'time_step': 5})
              .to_list()[0]
    )
    imgs = [Image.fromarray(frame) for frame in frames]
    imgs[0].save(fp=output_gif_path, format='GIF', append_images=imgs[1:], save_all=True, loop=0)


def display_gifs_from_video(video_path_list, text_list, tmpdirname = './tmp_gifs'):
    Path(tmpdirname).mkdir(exist_ok=True)
    gif_path_list = []
    for video_path in video_path_list:
        video_name = str(Path(video_path).name).split('.')[0]
        gif_path = Path(tmpdirname) / (video_name + '.gif')
        convert_video2gif(video_path, gif_path)
        gif_path_list.append(gif_path)
    return display_gif(gif_path_list, text_list)

Positive denotes a video that is contain same content event in anchor video, while negative denotes an inconsistent.

In [5]:
random_video_pair = random.sample(all_sample_video_dicts, 1)[0]
neg_sample_num = min(5, sample_folder_num)
anchor_video = random_video_pair['anchor_video']
anchor_video_event = anchor_video.parent.stem
pos_video_list = random_video_pair['pos_video_path_list']
pos_video_list_events = [path.parent.stem for path in pos_video_list]
neg_video_list = random_video_pair['neg_video_path_list'][:neg_sample_num]
neg_video_list_events = [path.parent.stem for path in neg_video_list]

show_video_list = [str(anchor_video)] + [str(path) for path in pos_video_list] + [str(path) for path in neg_video_list]
# print(show_video_list)
caption_list = ['anchor video: ' + anchor_video_event] + ['positive video ' + str(i + 1) for i in range(len(pos_video_list))] + ['negative video ' + str(i + 1) + ': ' + neg_video_list_events[i] for i in range(len(neg_video_list))]
print(caption_list)
tmpdirname = './tmp_gifs'
display_gifs_from_video(show_video_list, caption_list, tmpdirname=tmpdirname)

['anchor video: saving_private_ryan_omaha_beach', 'positive video 1', 'positive video 2', 'positive video 3', 'positive video 4', 'negative video 1: obama_kicks_door', 'negative video 2: the_legend_of_1900_magic_waltz', 'negative video 3: kennedy_assassination_slow_motion', 'negative video 4: scent_of_woman_tango', 'negative video 5: bolt_beijing_100m']


### Create a Milvus Collection

Before getting started, please make sure you have [installed milvus](https://milvus.io/docs/v2.0.x/install_standalone-docker.md). Let's first create a `video deduplication` collection that uses the [L2 distance metric](https://milvus.io/docs/v2.0.x/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/v2.0.x/index.md#IVF_FLAT).

In [6]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='ids', is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='video deduplication')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2', #IP
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

In [7]:
collection = create_milvus_collection('video_deduplication', 1024)

## Video Copy Detection

In this section, we'll show how to build our Video Copy Detection engine using Milvus. The basic idea behind Video Copy Detection is the extract embeddings from videos using Deep Neural Network and store them in Milvus, then get query videos embeddings and compare with those stored in Milvus.

We use [Towhee](https://towhee.io/), a machine learning framework that allows for creating data processing pipelines. [Towhee](https://towhee.io/) also provides predefined operators which implement insert and query operation in Milvus.


### Load Video Embeddings into Milvus

We first extract embeddings from images with Coarse Grained Student model in [DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval](https://arxiv.org/abs/2106.13266) and insert the embeddings into Milvus for indexing. Towhee provides a [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) so that users can assemble a data processing pipeline with operators.   


In [8]:
%%time
import os
import towhee
from towhee import dc
device = 'cuda:2'
# device = 'cpu'

dc = (
    towhee.read_csv(df_csv_path).unstream() \
        .runas_op['id', 'id'](func=lambda x: int(x)) \
        .video_decode.ffmpeg['path', 'frames'](start_time=0.0, end_time=60.0, sample_type='time_step_sample', args={'time_step': 1}) \
        .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) \
        .distill_and_select['frames', 'vec'](model_name='cg_student', device=device) \
        .to_milvus['id', 'vec'](collection=collection, batch=30)
)

CPU times: user 26min 10s, sys: 14.1 s, total: 26min 24s
Wall time: 1min 55s


Here is detailed explanation for each line of the code:

- `towhee.read_csv(df_csv_path)`: read tabular data from csv file;

- `.runas_op['id', 'id'](func=lambda x: int(x))`: for each row from the data, convert the data type of the column `id` from str to int;

- `.video_decode.ffmpeg` and `runas_op`: subsample the video uniformly every one second, and then get a list of images in the video, which are the input of the operator model;

- `.distill_and_select['frames', 'vec'](model_name='cg_student')`: extract embedding feature from the video using Coarse Grained Student model in DnS.

- `.to_milvus['id', 'vec'](collection=collection, batch=30)`: insert video embedding features in to Milvus;


In [9]:
dc.show()

Unnamed: 0,path,event,id,frames,vec
0,VCDB_core_sample/obama_kicks_doo...,obama_kicks_door,1,"[[0, 0, 0, ...] shape=(270, 480, 3),[0, 0, 0, ...] shape=(270, 480, 3),[0, 0, 0, ...] shape=(270, 480, 3),[0, 0, 0, ...] shape=(270, 480, 3),...] len=60","[0.009174303, 0.0009705887, -0.007265164, ...] shape=(1024,)"
1,VCDB_core_sample/obama_kicks_doo...,obama_kicks_door,2,"[[0, 0, 0, ...] shape=(480, 640, 3),[1, 1, 1, ...] shape=(480, 640, 3),[1, 1, 1, ...] shape=(480, 640, 3),[1, 1, 1, ...] shape=(480, 640, 3),...] len=20","[-0.0049728933, -0.017937368, 0.016198786, ...] shape=(1024,)"
2,VCDB_core_sample/obama_kicks_doo...,obama_kicks_door,3,"[[32, 40, 60, ...] shape=(480, 640, 3),[30, 39, 64, ...] shape=(480, 640, 3),[34, 42, 70, ...] shape=(480, 640, 3),[34, 43, 66, ...] shape=(480, 640, 3),...] len=60","[0.0013966287, 0.022458233, 0.0027558554, ...] shape=(1024,)"
3,VCDB_core_sample/obama_kicks_doo...,obama_kicks_door,4,"[[0, 0, 2, ...] shape=(360, 640, 3),[0, 0, 2, ...] shape=(360, 640, 3),[0, 0, 0, ...] shape=(360, 640, 3),[0, 0, 0, ...] shape=(360, 640, 3),...] len=46","[0.010604611, 0.0021730594, 0.00046078453, ...] shape=(1024,)"
4,VCDB_core_sample/obama_kicks_doo...,obama_kicks_door,5,"[[0, 2, 0, ...] shape=(480, 640, 3),[0, 0, 0, ...] shape=(480, 640, 3),[0, 0, 0, ...] shape=(480, 640, 3),[0, 0, 1, ...] shape=(480, 640, 3),...] len=20","[-0.0022860356, 0.013693579, 0.009067149, ...] shape=(1024,)"


In [10]:
print('Total number of inserted data is {}.'.format(collection.num_entities))

Total number of inserted data is 95.


## Evaluation

We have finished the core functionality of the Video Copy Detection engine. However, we don't know whether it achieves a reasonable performance. We need to evaluate the retrieval engine against the ground truth.

In this section, we'll evaluate the strength of our text-video retrieval using mAP@topk:   
`mAP@topk` is the proportion of relevant items found in the top-k recommendations. Suppose that we computed precision at 10 examples and found it is 40% in our top-10 recommendation system. This means that 40% of the recall examples are real positive ones.

In [11]:
%%time
dc = (
    towhee.read_csv(query_df_csv_path).unstream() \
      .runas_op['event', 'ground_truth_event'](func=lambda x:[x]) \
      .video_decode.ffmpeg['path', 'frames'](start_time=0.0, end_time=60.0, sample_type='time_step_sample', args={'time_step': 1}) \
      .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) \
      .distill_and_select['frames', 'vec'](model_name='cg_student', device=device) \
      .milvus_search['vec', 'topk_raw_res'](collection=collection, limit=min_sample_num) \
      .runas_op['topk_raw_res', 'topk_events'](func=lambda res: [id2event[x.id] for i, x in enumerate(res)]) \
      .runas_op['topk_raw_res', 'topk_path'](func=lambda res: [id2path[x.id] for i, x in enumerate(res)])
)

CPU times: user 4min 16s, sys: 3.15 s, total: 4min 19s
Wall time: 20.9 s


In [12]:
dc.select['id', 'ground_truth_event', 'topk_raw_res', 'topk_events', 'topk_path']().show()

id,ground_truth_event,topk_raw_res,topk_events,topk_path
1,[obama_kicks_door] len=1,"[{""id"": 1, ""score"": 0.0},{""id"": 2, ""score"": 0.48000627756118774},{""id"": 5, ""score"": 0.5547954440116882},{""id"": 4, ""score"": 0.7438859343528748},...] len=5","[obama_kicks_door,obama_kicks_door,obama_kicks_door,obama_kicks_door,...] len=5","[VCDB_core_sample/obama_kicks_doo...,VCDB_core_sample/obama_kicks_doo...,VCDB_core_sample/obama_kicks_doo...,VCDB_core_sample/obama_kicks_doo...,...] len=5"
6,[the_legend_of_1900_magic_waltz] len=1,"[{""id"": 6, ""score"": 0.0},{""id"": 8, ""score"": 0.7755095958709717},{""id"": 10, ""score"": 0.7755095958709717},{""id"": 9, ""score"": 0.8679494857788086},...] len=5","[the_legend_of_1900_magic_waltz,the_legend_of_1900_magic_waltz,the_legend_of_1900_magic_waltz,the_legend_of_1900_magic_waltz,...] len=5","[VCDB_core_sample/the_legend_of_1...,VCDB_core_sample/the_legend_of_1...,VCDB_core_sample/the_legend_of_1...,VCDB_core_sample/the_legend_of_1...,...] len=5"
11,[kennedy_assassination_slow_motio...] len=1,"[{""id"": 11, ""score"": 0.0},{""id"": 13, ""score"": 1.0068117380142212},{""id"": 15, ""score"": 1.1083202362060547},{""id"": 14, ""score"": 1.136603832244873},...] len=5","[kennedy_assassination_slow_motio...,kennedy_assassination_slow_motio...,kennedy_assassination_slow_motio...,kennedy_assassination_slow_motio...,...] len=5","[VCDB_core_sample/kennedy_assassi...,VCDB_core_sample/kennedy_assassi...,VCDB_core_sample/kennedy_assassi...,VCDB_core_sample/kennedy_assassi...,...] len=5"
16,[scent_of_woman_tango] len=1,"[{""id"": 16, ""score"": 0.0},{""id"": 63, ""score"": 0.9324159026145935},{""id"": 18, ""score"": 1.005260944366455},{""id"": 19, ""score"": 1.010080099105835},...] len=5","[scent_of_woman_tango,mr_and_mrs_smith_tango,scent_of_woman_tango,scent_of_woman_tango,...] len=5","[VCDB_core_sample/scent_of_woman_...,VCDB_core_sample/mr_and_mrs_smit...,VCDB_core_sample/scent_of_woman_...,VCDB_core_sample/scent_of_woman_...,...] len=5"
21,[bolt_beijing_100m] len=1,"[{""id"": 21, ""score"": 0.0},{""id"": 25, ""score"": 0.05202530696988106},{""id"": 24, ""score"": 0.5932506918907166},{""id"": 22, ""score"": 0.6336506605148315},...] len=5","[bolt_beijing_100m,bolt_beijing_100m,bolt_beijing_100m,bolt_beijing_100m,...] len=5","[VCDB_core_sample/bolt_beijing_10...,VCDB_core_sample/bolt_beijing_10...,VCDB_core_sample/bolt_beijing_10...,VCDB_core_sample/bolt_beijing_10...,...] len=5"


In [13]:
benchmark = (
    dc.with_metrics(['mean_average_precision',]) \
        .evaluate['ground_truth_event', 'topk_events'](name='map_at_k') \
        .report()
)

Unnamed: 0,mean_average_precision
map_at_k,0.973977


We found that we achieved an excellent topk metric on this easy small dataset, which means that if we limit each event to have k duplicate videos, then they can all be almost recalled and they are almost true positive .

## Show query results

With all the milvus search result, we can take a look at the query and the results video for example.

In [14]:
dc_list = dc.to_list()
# random_idx = random.randint(0, len(dc_list) - 1)
sample_num = 3
sample_idxs = random.sample(range(len(dc_list)), sample_num)
def get_query_and_predict_videos(idx):
    query_video = id2path[int(dc_list[idx].id)]
    print('query_video =', query_video)
    predict_topk_video_list = dc_list[idx].topk_path[1:]
    print('predict_topk_video_list =', predict_topk_video_list)
    return query_video, predict_topk_video_list
dsp_res_list = []
for idx in sample_idxs:
    query_video, predict_topk_video_list = get_query_and_predict_videos(idx)
    show_video_list = [query_video] + predict_topk_video_list
    caption_list = ['query video: ' + Path(query_video).parent.stem] + ['result{0} video'.format(i) for i in range(len(predict_topk_video_list))]
    dsp_res_list.append(display_gifs_from_video(show_video_list, caption_list, tmpdirname=tmpdirname))

query_video = VCDB_core_sample/t-mac_13_points_in_35_seconds/5df28e18b3d8fbdc0f4cd07ef5aefcdc1b4f8d42.flv
predict_topk_video_list = ['VCDB_core_sample/t-mac_13_points_in_35_seconds/e4b443e64c27a3364d16db8e11e6e85f2d3fd7ed.flv', 'VCDB_core_sample/t-mac_13_points_in_35_seconds/b61905d41276ccf2af59d4985158f8b1ce1d4990.flv', 'VCDB_core_sample/t-mac_13_points_in_35_seconds/3d0a3002441f682c7124806eb9b92c677af2ee9e.flv', 'VCDB_core_sample/t-mac_13_points_in_35_seconds/2bdf8029b38735a992a56e32cfc81466eea81286.flv']
query_video = VCDB_core_sample/obama_kicks_door/14c81d68b80d04743a107d4de859cb4724ccc2c1.flv
predict_topk_video_list = ['VCDB_core_sample/obama_kicks_door/f26a39de8e8ec290703f4937977fc17322974748.flv', 'VCDB_core_sample/obama_kicks_door/4df943d4903333df61bb3854d47365edf3076b5b.flv', 'VCDB_core_sample/obama_kicks_door/df0c9e9664cfa6720c94e13eae35ddb7a9b5b927.flv', 'VCDB_core_sample/president_obama_takes_oath/e29e65d0e362b8e7d450d833227ea3c0f5f65f12.flv']
query_video = VCDB_core_sampl

In [15]:
dsp_res_list[0]

In [16]:
dsp_res_list[1]

In [17]:
dsp_res_list[2]

In [18]:
# import shutil
# shutil.rmtree(tmpdirname)