# How to Build a Video Copy Detection Engine

This notebook illustrates how to build a video copy detection engine from scratch using [Milvus](https://milvus.io/) and [Towhee](https://towhee.io/).


**What is Video Copy Detection?**

Video Copy Detection, also known as Video Identification by Fingerprinting, is to retrieve the similar or exact same video for a given query video. 

Due to the popularity of Internet-based video sharing services, the volume of video content on the Internet has reached unprecedented scales. A video copy detection system is important in applications like video classification, tracking, filtering and recommendation, not to mention the field of copyright protection.  

However, content-based video retrieval is particularly hard in practice, one needs to calculate the similarity between the given video and each and every video in a database to retrieve and rank similar ones based on relevance. Threfore, We hereby introduce Milvus and Towhee to help building a Video Deduplication System within several lines.

**What are Milvus & Towhee?**

- Milvus is the most advanced open-source vector database built for AI applications and supports nearest neighbor embedding search across tens of millions of entries.
- Towhee is a framework that provides ETL for unstructured data using SoTA machine learning models.

We'll go through video retrieval procedures and evaluate the performance. Moreover, we managed to make the core functionality as simple as few lines of code, with which you can start hacking your own video copy detection engine.

## Preparation

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| towhee |
| pillow |
| ipython |
| pandas |

In [None]:
! python -m pip install -q towhee pillow ipython pandas

### Prepare the data

First, we need to prepare the dataset and Milvus environment.   

[VCDB:A Large-Scale Database for Partial Copy Detection in Videos](https://fvl.fudan.edu.cn/dataset/vcdb/list.htm) is a popular dataset for video deduplication task. It contains over 100,000 Web videos, and more than 9,000 copied segment pairs found through careful manual annotation.

VCDB consists of two parts: the core dataset and the background dataset. The core dataset (528 videos, approximately 27 hours) was collected using 28 carefully selected queries from YouTube and MetaCafe. After extensive manual annotation, 9,236 pairs of partial copies were found. Major transformations between the copies include "insertion of patterns", "camcording", "scale change", "picture in picture", etc. 

In this tutorial, we prepare a subset of VCDB core dataset, which contains 20 events, and each of them contains about 5 videos with the same or similar content. This takes about 1.3G of space.

Let's take a quick look

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/VCDB_core_sample.zip -O
! unzip -q -o VCDB_core_sample.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1357M  100 1357M    0     0  4823k      0  0:04:48  0:04:48 --:--:-- 8655k519k      0  0:05:07  0:01:00  0:04:07 6986k  0  4581k      0  0:05:03  0:01:02  0:04:01 6921k4:55  0:01:05  0:03:50 7049k 5258k      0  0:04:24  0:01:55  0:02:29 4959k 5369k      0  0:04:18  0:02:02  0:02:16 7525k291k      0  0:04:22  0:02:13  0:02:09 4133k     0  5364k      0  0:04:19  0:02:35  0:01:44 5934k     0  0:04:15  0:02:41  0:01:34 7470k 0     0  5086k      0  0:04:33  0:03:16  0:01:17 1963k   0  0:04:44  0:03:34  0:01:10 4479k  0  0:04:47  0:03:51  0:00:56 3882k0:00:44 3168k6 3414k0  0:04:57  0:04:32  0:00:25 3567k 0  0:04:50  0:04:45  0:00:05 8669k 4816k      0  0:04:48  0:04:47  0:00:01 8841k


In [3]:
import random
from pathlib import Path
import torch
import pandas as pd
random.seed(6)

root_dir = './VCDB_core_sample'


min_sample_num = 5
sample_folder_num = 20

all_video_path_lists = []
all_video_path_list = []

df = pd.DataFrame(columns=('path','event','id'))
query_df = pd.DataFrame(columns=('path','event','id'))

video_idx = 0
for i, mid_dir_path in enumerate(Path(root_dir).iterdir()):
    if i >= sample_folder_num:
        break
    if mid_dir_path.is_dir():
        path_videos = list(Path(mid_dir_path).iterdir())
        if len(path_videos) < min_sample_num:
            print('len(path_videos) < min_sample_num, continue.')
            continue
        sample_video_path_list = random.sample(path_videos, min_sample_num)
        all_video_path_lists.append(sample_video_path_list)
        all_video_path_list += [str(path) for path in sample_video_path_list]
        for j, path in enumerate(sample_video_path_list):
            video_idx += 1
            if j == 0:
                query_df = pd.concat([query_df, pd.DataFrame({'path': [str(path)],'event':[path.parent.stem],'id': [str(video_idx)]})], ignore_index=True)
            df = pd.concat([df, pd.DataFrame({'path': [str(path)],'event':[path.parent.stem],'id': [str(video_idx)]})], ignore_index = True)

all_sample_video_dicts = []
for i, sample_video_path_list in enumerate(all_video_path_lists):
    anchor_video = sample_video_path_list[0]
    pos_video_path_list = sample_video_path_list[1:]
    neg_video_path_lists = all_video_path_lists[:i] + all_video_path_lists[i + 1:]
    neg_video_path_list = [neg_video_path_list[0] for neg_video_path_list in neg_video_path_lists]
    all_sample_video_dicts.append({
        'anchor_video': anchor_video,
        'pos_video_path_list': pos_video_path_list,
        'neg_video_path_list': neg_video_path_list
    })

id2event = df.set_index(['id'])['event'].to_dict()
id2path = df.set_index(['id'])['path'].to_dict()

df_csv_path = 'video_info.csv'
query_df_csv_path = 'query_video_info.csv'
df.to_csv(df_csv_path)
query_df.to_csv(query_df_csv_path)
df.head(10)

Unnamed: 0,path,event,id
0,VCDB_core_sample/mr_and_mrs_smith_tango/e2adc7...,mr_and_mrs_smith_tango,1
1,VCDB_core_sample/mr_and_mrs_smith_tango/be5178...,mr_and_mrs_smith_tango,2
2,VCDB_core_sample/mr_and_mrs_smith_tango/9ac82e...,mr_and_mrs_smith_tango,3
3,VCDB_core_sample/mr_and_mrs_smith_tango/e8aa83...,mr_and_mrs_smith_tango,4
4,VCDB_core_sample/mr_and_mrs_smith_tango/bf5822...,mr_and_mrs_smith_tango,5
5,VCDB_core_sample/the_legend_of_1900_magic_walt...,the_legend_of_1900_magic_waltz,6
6,VCDB_core_sample/the_legend_of_1900_magic_walt...,the_legend_of_1900_magic_waltz,7
7,VCDB_core_sample/the_legend_of_1900_magic_walt...,the_legend_of_1900_magic_waltz,8
8,VCDB_core_sample/the_legend_of_1900_magic_walt...,the_legend_of_1900_magic_waltz,9
9,VCDB_core_sample/the_legend_of_1900_magic_walt...,the_legend_of_1900_magic_waltz,10


Define some helper function to convert video to gif so that we can have a look at these videos.   

In [4]:
from IPython import display
from pathlib import Path
import towhee
from towhee import pipe, ops
from PIL import Image

def display_gif(video_path_list, text_list):
    html = ''
    for video_path, text in zip(video_path_list, text_list):
        html_line = '<img src="{}"> {} <br/><br/>'.format(video_path, text)
        html += html_line
    return display.HTML(html)

    
def convert_video2gif(video_path, output_gif_path, num_samples=16):
    p = (
        pipe.input('video_file')
        .flat_map('video_file', 'frame', ops.video_decode.ffmpeg(start_time=0.0, end_time=1000.0, sample_type='time_step_sample', args={'time_step': 5}))
        .output('frame')
    )
    frames = p(video_path).to_list()
    imgs = [Image.fromarray(frame[0]) for frame in frames]
    imgs[0].save(fp=output_gif_path, format='GIF', append_images=imgs[1:], save_all=True, loop=0)


def display_gifs_from_video(video_path_list, text_list, tmpdirname = './tmp_gifs'):
    Path(tmpdirname).mkdir(exist_ok=True)
    gif_path_list = []
    for video_path in video_path_list:
        video_name = str(Path(video_path).name).split('.')[0]
        gif_path = Path(tmpdirname) / (video_name + '.gif')
        convert_video2gif(video_path, gif_path)
        gif_path_list.append(gif_path)
    return display_gif(gif_path_list, text_list)

Positive denotes a video that is contain same content event in anchor video, while negative denotes an inconsistent.

In [6]:
random_video_pair = random.sample(all_sample_video_dicts, 1)[0]
neg_sample_num = min(1, sample_folder_num)
anchor_video = random_video_pair['anchor_video']
anchor_video_event = anchor_video.parent.stem
pos_video_list = random_video_pair['pos_video_path_list'][:1]
pos_video_list_events = [path.parent.stem for path in pos_video_list][:1]
neg_video_list = random_video_pair['neg_video_path_list'][:neg_sample_num]
neg_video_list_events = [path.parent.stem for path in neg_video_list]

show_video_list = [str(anchor_video)] + [str(path) for path in pos_video_list] + [str(path) for path in neg_video_list]
# print(show_video_list)
caption_list = ['anchor video: ' + anchor_video_event] + ['positive video ' + str(i + 1) for i in range(len(pos_video_list))] + ['negative video ' + str(i + 1) + ': ' + neg_video_list_events[i] for i in range(len(neg_video_list))]
print(caption_list)
tmpdirname = './tmp_gifs'
display_gifs_from_video(show_video_list, caption_list, tmpdirname=tmpdirname)

['anchor video: t-mac_13_points_in_35_seconds', 'positive video 1', 'negative video 1: mr_and_mrs_smith_tango']


### Create a Milvus Collection

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [None]:
! python -m pip install -q pymilvus==2.2.11

Let's first create a `video deduplication` collection that uses the [L2 distance metric](https://milvus.io/docs/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/index.md#IVF_FLAT).

In [5]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, descrition='ids',max_length=500, is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='video deduplication')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2', #IP
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

In [6]:
collection = create_milvus_collection('video_deduplication', 1024)

## Video Copy Detection

In this section, we'll show how to build our Video Copy Detection engine using Milvus. The basic idea behind Video Copy Detection is the extract embeddings from videos using Deep Neural Network and store them in Milvus, then get query videos embeddings and compare with those stored in Milvus.

We use [Towhee](https://towhee.io/), a machine learning framework that help users to build data processing pipelines. [Towhee](https://towhee.io/) also provides pre-defined operators which implement insert and query operation in Milvus.


### Load Video Embeddings into Milvus

We first extract embeddings from images and insert the embeddings into Milvus for indexing. Towhee provides a [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) so that users can assemble a data processing pipeline with operators.   


In [7]:
%%time
import os
from csv import reader

import towhee
from towhee.datacollection import DataCollection
from towhee import pipe, ops
device = 'cuda:1'
# device = 'cpu'

emb_pipe = (
    pipe.input('path', 'event', 'id')
        .flat_map('path', 'frames', ops.video_decode.ffmpeg(start_time=0.0, end_time=60.0, sample_type='time_step_sample', args={'time_step': 1}))
        .window_all('frames', 'video', lambda x: x)
        .map('video', 'emb', ops.video_copy_detection.distill_and_select(model_name='cg_student', device = device))
        .map(('id', 'emb'), 'insert_res', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='video_deduplication'))
        .output()
)

with open(df_csv_path, 'r') as f:
    csv_reader = reader(f)
    _ = next(csv_reader)
    for n, row in enumerate(csv_reader):
        res = emb_pipe(*row[1:])

CPU times: user 3min 7s, sys: 34.4 s, total: 3min 42s
Wall time: 2min 49s


Here is detailed explanation for each line of the code:

- `ops.video_decode.ffmpeg(start_time=0.0, end_time=60.0, sample_type='time_step_sample', args={'time_step': 1})`: Subsample and decode the first 60 seconds of the video to get a list of frames, one frame per second.

- `window_all('frames', 'video', lambda x: x)`: Append all frames into a list.

- `ops.video_copy_detection.distill_and_select(model_name='cg_student', device = device)`: Extract embedding from the frames using Coarse Grained Student model in DnS.

- `ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='video_deduplication')`: Insert video embedding in to Milvus collection `video_deduplication`.

In [8]:
print('Total number of inserted data is {}.'.format(collection.num_entities))

Total number of inserted data is 95.


## Evaluation

We have finished the core functionality of the Video Copy Detection engine. However, we don't know whether it achieves a reasonable performance. We need to evaluate the retrieval engine against the ground truth.

In this section, we'll evaluate the strength of our text-video retrieval using mAP@topk:   
`mAP@topk` is the proportion of relevant items found in the top-k recommendations. Suppose that we computed precision at 10 examples and found it is 40% in our top-10 recommendation system. This means that 40% of the recall examples are real positive ones.

In [10]:
%%time
collection.load()

def average_precision(x, y):
    sum = 0 
    ret = []
    for i, e in enumerate(y):
        sum += (e==x)
        ret.append(sum/(i+1))
    return ret

search_pipe = (
    pipe.input('path', 'event', 'id')
        .map('event', 'ground_truth_event', lambda x: x)
        .flat_map('path', 'frames', ops.video_decode.ffmpeg(start_time=0.0, end_time=60.0, sample_type='time_step_sample', args={'time_step': 1}))
        .window_all('frames', 'video', lambda x: x)
        .map('video', 'emb', ops.video_copy_detection.distill_and_select(model_name='cg_student', device = device))
		.map('emb', 'topk_raw_res', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='video_deduplication', limit=min_sample_num, metric_type='IP'))
        .map('topk_raw_res', 'topk_events', lambda res: [id2event[x[0]] for x in res])
        .map('topk_raw_res', 'topk_path', lambda res: [id2path[x[0]] for x in res])    
        .map(('ground_truth_event', 'topk_events'), 'AP', average_precision)    
        .output('path', 'id', 'topk_events', 'topk_path', 'ground_truth_event', 'AP')
)

APs = []
dc_list = []
with open(query_df_csv_path, 'r') as f:
    csv_reader = reader(f)
    _ = next(csv_reader)
    for n, row in enumerate(csv_reader):
        res = search_pipe(*row[1:])
        res_dict = res.get_dict()
        dc_list.append(res_dict)
        APs += res_dict['AP']
        path = res_dict['path']
        top_k_path = res_dict['topk_path']
        print(f'video {path}\'s top {min_sample_num} search result: {top_k_path}')


video VCDB_core_sample/scent_of_woman_tango/ac1543adad8ecc32d98d9a67d75fb0a39f67b685.flv's top 5 search result: ['VCDB_core_sample/scent_of_woman_tango/ac1543adad8ecc32d98d9a67d75fb0a39f67b685.flv', 'VCDB_core_sample/scent_of_woman_tango/f3c8c0c9b93e0a49d2508eee4aae618c1d69e082.flv', 'VCDB_core_sample/scent_of_woman_tango/c29e02d9e847bcde39ee667b454e4d67b22b105c.flv', 'VCDB_core_sample/scent_of_woman_tango/d1a19730dcc2b5ea5bec61756c452772aae031dd.flv', 'VCDB_core_sample/beautiful_mind_game_theory/6171d3d87ae377e497199554033bca96a263277b.mp4']
video VCDB_core_sample/kennedy_assassination_slow_motion/71ba3405d1e896d04c89754ff618b4198d2e65d5.flv's top 5 search result: ['VCDB_core_sample/kennedy_assassination_slow_motion/71ba3405d1e896d04c89754ff618b4198d2e65d5.flv', 'VCDB_core_sample/kennedy_assassination_slow_motion/6a2f5ceaa65dd1187ff57ff1c1fccee9da91985b.flv', 'VCDB_core_sample/kennedy_assassination_slow_motion/acdf1a72b8cf37ee2609cdc03ff70747f36e0476.flv', 'VCDB_core_sample/kennedy_as

In [11]:
print(f'The Mean Average Precision at {min_sample_num} is: {sum(APs) / len(APs)}')

The Mean Average Precision at 5 is: 0.8873684210526316


We found that we achieved an excellent topk metric on this easy small dataset, which means that if we limit each event to have k duplicate videos, then they can all be almost recalled and they are almost true positive .

## Show query results

With all the milvus search result, we can take a look at the query and the results video for example.

In [12]:
# dc_list = dc.to_list()
sample_num = 3
sample_idxs = random.sample(range(len(dc_list)), sample_num)
def get_query_and_predict_videos(idx):
    query_video = id2path[dc_list[idx]['id']]
    print('query_video =', query_video)
    predict_topk_video_list = dc_list[idx]['topk_path'][1:]
    print('predict_topk_video_list =', predict_topk_video_list)
    return query_video, predict_topk_video_list
dsp_res_list = []
for idx in sample_idxs:
    query_video, predict_topk_video_list = get_query_and_predict_videos(idx)
    show_video_list = [query_video] + predict_topk_video_list
    caption_list = ['query video: ' + Path(query_video).parent.stem] + ['result{0} video'.format(i) for i in range(len(predict_topk_video_list))]
    dsp_res_list.append(display_gifs_from_video(show_video_list, caption_list, tmpdirname=tmpdirname))

query_video = VCDB_core_sample/obama_kicks_door/1f73466d86f0a92140bd2f89eb95e8c147b47935.flv
predict_topk_video_list = ['VCDB_core_sample/david_beckham_lights_the_olympic_torch/09b682c899b0727e9990d8e347cdce3df7c5550e.flv', 'VCDB_core_sample/troy_achilles_and_hector/0b3f9e88e5ab73e19dc4d1a32115ea3457867128.flv', 'VCDB_core_sample/scent_of_woman_tango/d1a19730dcc2b5ea5bec61756c452772aae031dd.flv', 'VCDB_core_sample/troy_achilles_and_hector/ee417a6b882853ffcd3f78b380b0205a9411f4d6.flv']
query_video = VCDB_core_sample/scent_of_woman_tango/ac1543adad8ecc32d98d9a67d75fb0a39f67b685.flv
predict_topk_video_list = ['VCDB_core_sample/scent_of_woman_tango/f3c8c0c9b93e0a49d2508eee4aae618c1d69e082.flv', 'VCDB_core_sample/scent_of_woman_tango/c29e02d9e847bcde39ee667b454e4d67b22b105c.flv', 'VCDB_core_sample/scent_of_woman_tango/d1a19730dcc2b5ea5bec61756c452772aae031dd.flv', 'VCDB_core_sample/beautiful_mind_game_theory/6171d3d87ae377e497199554033bca96a263277b.mp4']
query_video = VCDB_core_sample/beaut

In [13]:
dsp_res_list[0]

In [14]:
dsp_res_list[1]

In [15]:
dsp_res_list[2]

In [None]:
# import shutil
# shutil.rmtree(tmpdirname)