# How to Build a Video Segment Copy Detection System

In the previous example we have demonstrated how to use milvus and towhee to build a simple video copy detection system at video level. In this tutorial, we will demonstrate video copy detection down to segment level. 

**What's the differents between video-level deduplication and segment-level deduplication?**


- Video-level deduplication is a method for situations with high repetition. It finds duplicate videos by comparing the similarity between the embeddings of the whole video. Since only one embedding is extracted from a video, this method works faster. But the limitation of this method is also obvious: it is not good for detecting similar videos of different lengths. For example, the first quarter of video A and video B are exactly the same, but their embeddings may not be similar. In this case, it is obviously impossible to detect infringing content.
 
- Segment-level deduplication detects the specific start and end times of repeated segments, which can handle complex clipping and insertion of video segments as well as situations where the video lengths are not equal. It does so by comparing the similarity between video frames. Obviously, we need to use this method in the actual task of mass video duplication checking. Of course, the speed of this method will be slower than the one of video level.

**What are Milvus & Towhee?**

- [Milvus](https://milvus.io/) is the most advanced open-source vector database built for AI applications and supports nearest neighbor embedding search across tens of millions of entries.
- [Towhee](https://towhee.io/) is a framework that provides ETL for unstructured data using SoTA machine learning models.

In this tutorial, we will demonstrate video duplication detection at segment level using Towhee and Milvus. Moreover, we managed to make the core functionality as simple as few lines of code, with which you can start hacking your own video deduplication engine.


## Preparation

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| pymilvus |
| towhee |
| pillow |
| ipython |
| numpy |
| plyvel / happybase |


In [1]:
! python -m pip install -q pymilvus towhee pillow ipython numpy plyvel happybase

### Prepare the data

First, we need to prepare the dataset and Milvus environment.   

The VCDB core dataset is almost all full video length repetition, which is not suitable for the evaluation of segment repetition detection technology. In this tutorial, we use [VCSL dataset](https://arxiv.org/abs/2203.02654). 

VCSL is a large-scale real dataset for video duplication detection from Youtube and Bilibili. Unlike VCDB, its cheating changes to video frames are more complex, including crop, filter, text overlay, background, cam-cording, picture in picture, even recent deepfake, etc. there are a wide range of content transformations among over 280k segment copies in VCSL, and these realistic skillful transformations bring great challenges to segment-level copy detection. 

In this tutorial, we use only a mini set of VCSL. It contains 5 events with three videos in each event, which are copies of each other. There is also a broken video in folder `crashed_video` for robustness testing.

In [12]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/VCSL-demo.zip -O
! unzip -q -o VCSL-demo.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  514M  100  514M    0     0  3510k      0  0:02:30  0:02:30 --:--:-- 3525k


The directory structure of this demo dataset is like this:
```
./VCSL-demo/
├── baisuishan
│   ├── 03584e404c0847fcbe5f9c486e8f8fc7-【宇哥】原来百岁山的广告是这个意思！-1UE411V74D.flv
│   ├── 217a12c936414660a53b55b22e2aea59-20200607百岁山-1kf4y127GB.flv
│   └── 41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv
├── crashed_video
│   └── Clip46_KingKong_MP4_Sorenson_750K_720x404_30fps_16x9_MPAA.mp4
├── donghua
│   ├── 043bee7a71f347f18e8576bb1a01c86b-Langrisser M 'Dream Simulation' OPレオンが戻ってきた-y3qAPvWnL18.mkv
│   ├── 1134884a6c544a00b3bce99457a2a98d-LANGRISSER Mobile OP-EA4PmSWGr8o.mp4
│   └── 1b4e048011714eab928650f64e6370a6-123-JqSBkvPsYBw.mkv
├── huangpi
│   ├── 01414786cbe74d2d95e2b71df26f39a3-【外星金轮】金轮....已经无所谓了  (反正马老师看不到)-1sh411B7g1.flv
│   ├── 0fba4debd48245699971c4c18608cde3-黄皮外星人 低成本大制作-1vP4y1s79h.flv
│   └── 9ec6f467472a428e91fd45bb688e3be1-酒醉的广场黄皮外星人-1uU4y1E7n6.flv
├── liulangdiqiu
│   ├── d62ce5becff14a0c9c7dab5eea6647dc-《流浪地球》吴京成青少年偶像，新闻联播盛赞：中国人的浪漫科幻-1qf4y1G7gM.flv
│   ├── e5dc80abd7a24b47accde190c9fdbcdc-【新闻联播】流浪地球上央视一套7点档啦-1db411m7f8.flv
│   └── ef65e0f662e646a88a13b6eddb640e48-《流浪地球》上《新闻联播》排面鸭！CCTV央视给力！-1xb411U7uE.flv
└── madongmei
    ├── 0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv
    ├── 8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv
    └── ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv
```

Define some helper function to convert video to gif so that we can have a look at these videos.   

In [2]:
from IPython import display
from pathlib import Path
import towhee
from PIL import Image as PILImage
import os

def display_gif(video_path_list, text_list):
    html = ''
    for video_path, text in zip(video_path_list, text_list):
        html_line = '<img src="{}"> {} <br/><br/>'.format(video_path, text)
        html += html_line
    return display.HTML(html)

    
def convert_video2gif(video_path, output_gif_path, start_time=0.0, end_time=1000.0, num_samples=16):
    frames = (
        towhee.glob(video_path)
              .video_decode.ffmpeg(start_time=start_time, end_time=end_time, sample_type='time_step_sample', args={'time_step': 3})
              .to_list()[0]
    )
    imgs = [PILImage.fromarray(frame) for frame in frames]
    imgs = [img.resize((int(img.width/6), int(img.height/6)), PILImage.NEAREST) for img in imgs]
    imgs[0].save(fp=output_gif_path, format='GIF', append_images=imgs[1:], save_all=True, loop=0)


def display_gifs_from_video(video_path_list, text_list, start_time_list = None, end_time_list = None, tmpdirname = './tmp_gifs'):
    Path(tmpdirname).mkdir(exist_ok=True)
    gif_path_list = []
    for i, video_path in enumerate(video_path_list):
        video_name = str(Path(video_path).name).split('.')[0]
        gif_path = Path(tmpdirname) / (video_name + '.gif')
        if start_time_list is not None:
            convert_video2gif(video_path, gif_path, start_time=start_time_list[i], end_time=end_time_list[i])
        else:
            convert_video2gif(video_path, gif_path)
        gif_path_list.append(gif_path)
    return display_gif(gif_path_list, text_list)

In [3]:
import random
random.seed(9)
vcsl_demo_root = './VCSL-demo/'

event_list = os.listdir(vcsl_demo_root)
if 'crashed_video' in event_list:
    event_list.remove('crashed_video')

random_event = random.choice(event_list)
random_event_folder = os.path.join(vcsl_demo_root, random_event)
random_event_videos = [os.path.join(random_event_folder, video_file) for video_file in os.listdir(random_event_folder)]
tmpdirname = './tmp_gifs'
display_gifs_from_video(random_event_videos, random_event_videos, tmpdirname=tmpdirname)

In [4]:
import towhee

os.environ["CUDA_VISIBLE_DEVICES"] = '1'

def normalize(x):
    import numpy as np
    return x / np.linalg.norm(x, axis=0)

def merge_ndarray(x):
    import numpy as np
    return np.concatenate(x).reshape(-1, x[0].shape[0])

### Setup Milvus and create a Milvus Collection

The last thing to be prepared is Milvus. For more options & detailed instructions, you can refer to [Milvus doc](https://milvus.io/docs/v2.1.x). If you need more help for Milvus, feel free to submit tickets or join discussion in [Milvus github](https://github.com/milvus-io/milvus).

In [None]:
# Download docker yaml for Milvus standalone
! wget https://github.com/milvus-io/milvus/releases/download/v2.1.1/milvus-standalone-docker-compose.yml -O docker-compose.yml
# Run command below under the same directory as the docker yaml
! docker-compose up -d

Let's first create a `video_deduplication` collection that uses the [L2 distance metric](https://milvus.io/docs/v2.1.x/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/v2.1.x/index.md#IVF_FLAT).

In [5]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')


def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
        FieldSchema(name='id', dtype=DataType.INT64, descrition='the id of the embedding', is_primary=True, auto_id=True),
        FieldSchema(name='path', dtype=DataType.VARCHAR, descrition='the path of the embedding', max_length=500),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='video embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='video dedup')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {'metric_type': 'IP', 'index_type': "IVF_FLAT", 'params': {"nlist": 1}}
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

collection = create_milvus_collection('video_deduplication', 256)

## Video Copy Detection

In this section, we'll show how to build our Video Copy Detection engine using Milvus. The basic idea behind Video Copy Detection is the extract embeddings from videos using Deep Neural Network and store them in Milvus, then get query videos embeddings and compare with those stored in Milvus.

We use [Towhee](https://towhee.io/), a machine learning framework that allows for creating data processing pipelines. [Towhee](https://towhee.io/) also provides predefined operators which implement insert and query operation in Milvus.


### Load Video Embeddings into Milvus

For every video, we decode it to image frames, and then using neural network to extract their embeddings. We insert them to Milvus and levelDB for storage.
![](video_decopy_insert.png)


In [6]:
%%time
from towhee.dc2 import pipe, ops
import glob

emb_pipe = (
    pipe.input('url')
        .map('url', 'id', lambda x: x)
        .flat_map('url', 'frames', ops.video_decode.ffmpeg(sample_type='time_step_sample', args={'time_step': 1}))
        .map('frames', 'emb', ops.image_embedding.isc())
        .map('emb', 'emb', normalize)
        .map(('id', 'emb'), 'insert_res', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='video_deduplication'))
        .window_all('emb', 'video_emb', merge_ndarray)
        # .map(('id', 'video_emb'), ('url_vec_status'), ops.kvstorage.insert_hbase('127.0.0.1', 9090, 'video_dedup'))
        .map(('url', 'video_emb'), ('url_vec_status'), ops.kvstorage.insert_leveldb('url_vec.db'))
        .output(tracer=True)
)

path = glob.glob('VCSL-demo/*/*')
for i in path:
    try:
        result = emb_pipe(i)
    except:
        pass

del emb_pipe

E0203 10:15:01.112718103 2399242 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
2023-02-03 10:16:30,832 - 139880157361920 - video_decoder.py-video_decoder:120 - ERROR: header damaged


CPU times: user 4min 7s, sys: 10.2 s, total: 4min 17s
Wall time: 2min 42s


If you see "ERROR: header damaged", it's because the sample dataset already has a corrupt video and we skip it. This is to simulate that in practice, where there may be a small amount of corrupted video in the huge video data, but they should not affect the execution of other normal video.

### Query videos

In theory, for each query video, it is necessary to match and retrieve all the videos in the database, which will cause huge overhead. In this tutorial, we perform a rough video selection which filter the videos with low similarity to solve this problem. 
 
First, for every query frame, we retrieve a certain number of similar frames through Milvus, which match for a specified video. The videos of these frames are then aggregated, sorted, and filtered. Then, the video embeddings of remaining videos and embedding of the query video are processed for localizing copyed segments. In this way, we can filter out videos with low similarity, saving a lot of computation for the whole pipeline.
![](video_decopy_query.png)

In [7]:
%%time
from towhee.datacollection import DataCollection
collection.load()

search_pipe = (
    pipe.input('url')
        .map('url', 'id', lambda x: x)
        .flat_map('url', 'frames', ops.video_decode.ffmpeg(sample_type='time_step_sample', args={'time_step': 1}))
        .map('frames', 'emb', ops.image_embedding.isc())
        .map('emb', 'emb', normalize)
		.map('emb', 'res', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='video_deduplication', limit=64, output_fields=['path'], metric_type='IP'))
        .window_all('res', 'res', lambda x:[i for y in x for i in y])
        .map('res', ('retrieved_ids', 'score'), lambda x: ([i[2] for i in x], [i[1] for i in x]))
        .window_all('emb', 'video_emb', merge_ndarray)
		.flat_map(('retrieved_ids','score'), 'candidates', ops.video_copy_detection.select_video(top_k=5, reduce_function='sum', reverse=True))
		.map('candidates', 'retrieved_emb', ops.kvstorage.from_leveldb(path = 'url_vec.db', is_ndarray = True))
		# .map('candidates', 'retrieved_emb', ops.kvstorage.search_hbase('127.0.0.1', 9090, 'video_dedup', is_ndarray = True))
		.map(('video_emb', 'retrieved_emb'), ('similar_segment', 'segment_score'), ops.video_copy_detection.temporal_network(min_length=1))
		.output('id', 'candidates', 'similar_segment', 'segment_score', tracer=True)
)  

path = glob.glob('VCSL-demo/madongmei/*')
for i in path:
    try:
        result = search_pipe(i)
        DataCollection(result).show()
    except:
        pass

del search_pipe

E0203 10:18:07.466673832 2399242 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E0203 10:18:07.638172901 2399242 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[0, 0, 83, 83]] len=1",[1.012048176972263] len=1
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[45, 144, 83, 167],[1, 141, 25, 162]] len=2","[0.4112515195471341,0.46153155432807075] len=2"
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[14, 0, 82, 28],[1, 1, 43, 48]] len=2","[0.3636096343398094,0.3561842535318953] len=2"
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/baisuishan/41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv,"[[37, 139, 61, 156],[14, 184, 36, 196],[52, 142, 70, 156]] len=3","[0.21124086438155756,0.20804748114417582,0.21043002232909203] len=3"
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/donghua/043bee7a71f347f18e8576bb1a01c86b-Langrisser M 'Dream Simulation' OPレオンが戻ってきた-y3qAPvWnL18.mkv,"[[17, 43, 37, 74]] len=1",[0.2017840591131472] len=1


id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[0, 1, 19, 25],[0, 45, 42, 78]] len=2","[0.5262382307717967,0.2769587500890096] len=2"
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[0, 141, 23, 164]] len=1",[0.6957248915796694] len=1
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[0, 0, 52, 52]] len=1",[1.0192307531833649] len=1
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/baisuishan/41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv,"[[20, 186, 51, 197],[0, 188, 19, 196]] len=2","[0.30379765374319895,0.3361414847550569] len=2"
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/liulangdiqiu/ef65e0f662e646a88a13b6eddb640e48-《流浪地球》上《新闻联播》排面鸭！CCTV央视给力！-1xb411U7uE.flv,"[[17, 25, 48, 61]] len=1",[0.23146820957980938] len=1


id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[0, 0, 167, 167]] len=1",[1.0059880007526831] len=1
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[10, 34, 84, 69],[141, 1, 162, 25]] len=2","[0.20891873005333297,0.4263895856009589] len=2"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[141, 0, 167, 25],[4, 4, 73, 48]] len=2","[0.6232124590406231,0.2644270837834451] len=2"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/liulangdiqiu/ef65e0f662e646a88a13b6eddb640e48-《流浪地球》上《新闻联播》排面鸭！CCTV央视给力！-1xb411U7uE.flv,"[[105, 146, 139, 173],[33, 40, 69, 53],[0, 145, 23, 160],[147, 40, 162, 53]] len=4","[0.2060048267489574,0.2150471015852325,0.2527633309364319,0.3381644998277937] len=4"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/baisuishan/41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv,"[[39, 185, 67, 199],[107, 185, 125, 199]] len=2","[0.34858442772002446,0.38751399517059326] len=2"


CPU times: user 1min 53s, sys: 1min 23s, total: 3min 17s
Wall time: 50.6 s


For each frame of each query video, we query the most similar frame information of 64 frames from Milvus. We aggregate and sort this information, and select candidate videos with topk=5. For this query video and the corresponding 5 candidate videos, `temporal_network` calculation is performed, and finally the detected duplicate segments are obtained. 

Note that our query uses the same dataset, in which there are 5 events, and each event has 3 videos, which are copies of each other. Using this dataset to query itself means that for each video, the correct query result should be the three videos under its own event. 

The output `similar_segment` column is detected segments list, which format is list of `[query_start_second, ref_start_second, query_end_second, ref_end_second]`. And `segment_score` column is the corresponding similarity score of each segment. We can observe that each query video does detect only 3 results of its own event, which is consistent with ground truth.
 
Let's take the result of the following line as an example, with `similar_segment` = [0, 141, 23, 164], indicating that in the query video, from 0 to 23 seconds, and the ref video from 141 to 164 seconds are repeated. We can display these clips.
![example](example.png)

In [8]:
event_videos = ['./VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv',
               './VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv']
tmpdirname2 = './tmp_gifs2'

display_gifs_from_video(event_videos, event_videos, start_time_list=[0, 141], end_time_list=[23, 164], tmpdirname=tmpdirname2)

### Options

For image embedding operator and kv storage operator, some options are provided:

- Image embedding operator: Check [Towhee Image Embedding](https://towhee.io/tasks/detail/operator?field_name=Computer-Vision&task_name=Image-Embedding) for more pre-trained models Towhee encapsulates as operators. For video copy detection task, we recommend `ISC`. `ISC` is a pre-trained model that works pretty well for such task. Besides, we support following models from timm:

	'isc', 'gmixer_24_224', 'resmlp_24_224', 'resmlp_12_distilled_224', 'resmlp_12_224', 'coat_lite_mini', 'deit_small_patch16_224', 'resmlp_36_224', 'pit_xs_224', 'convit_small', 'resmlp_24_distilled_224', 'tnt_s_patch16_224', 'pit_ti_224', 'resmlp_36_distilled_224', 'twins_svt_small', 'convit_tiny', 'coat_lite_small', 'coat_lite_tiny', 'deit_tiny_patch16_224', 'coat_mini', 'gmlp_s16_224', 'cait_xxs24_224', 'cait_s24_224', 'levit_128', 'coat_tiny', 'cait_xxs36_224', 'levit_192', 'levit_256', 'levit_128s', 'vit_small_patch32_224', 'vit_small_patch32_384', 'vit_small_r26_s32_224', 'vit_small_patch16_224'.

	Note that the vector dimension of the milvus collection we create before should change accordingly depends on the output embedding shape of the embedding model (e.g. 384 for `gmixer_24_224`, 256 for `ISC`);


- kv storage: Check [Towhee kv storage](https://towhee.io/kvstorage) for different kv database. If one wants to run pipeline on some large dataset, we recommend `hbase`, otherwise `leveldb` would be enough.

### Towhee Built-in Pipeline

For users's convenience, Towhee has excapsulate several built-in pipelines including `video_embedding` and `video_copy_detection`. So one can create and run the pipeline above within a few lines of code.

In [15]:
from towhee.dc2 import AutoPipes, AutoConfig
from towhee.datacollection import DataCollection
import glob

emb_conf = AutoConfig.load_config('video_embedding')
emb_conf.collection='video_deduplication'
emb_conf.leveldb_path='url_vec.db'
# emb_conf.hbase_table='video_dedup'
emb_conf.devide = 1
emb_pipe = AutoPipes.pipeline('video_embedding', emb_conf)

path = glob.glob('VCSL-demo/*/*')

for i in path:
    try:
        result = emb_pipe(i)
    except:
        pass

del emb_pipe

In [16]:
from towhee.dc2 import AutoPipes, AutoConfig
from towhee.datacollection import DataCollection

search_conf = AutoConfig.load_config('video_copy_detection')
search_conf.collection='video_deduplication'
search_conf.leveldb_path='url_vec.db'
# search_conf.hbase_table='video_dedup'
search_conf.search_params = {'limit': 64, 'metric_type':'IP'}
search_conf.top_k = 5
search_conf.devide = 1
search_pipe = AutoPipes.pipeline('video_copy_detection', search_conf)

import glob
path = glob.glob('VCSL-demo/madongmei/*')

for i in path:
    try:
        result = search_pipe(i)
        DataCollection(result).show()
    except:
        pass

del search_pipe

id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[0, 0, 83, 83]] len=1",[1.012048188462315] len=1
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[14, 0, 82, 28],[1, 1, 43, 48]] len=2","[0.3636065684258938,0.356180977285578] len=2"
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[45, 144, 83, 167],[1, 141, 25, 162]] len=2","[0.41543314300599643,0.46151140530904133] len=2"
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/donghua/1134884a6c544a00b3bce99457a2a98d-LANGRISSER Mobile OP-EA4PmSWGr8o.mp4,,
VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,VCSL-demo/donghua/043bee7a71f347f18e8576bb1a01c86b-Langrisser M 'Dream Simulation' OPレオンが戻ってきた-y3qAPvWnL18.mkv,"[[17, 43, 37, 74]] len=1",[0.20176983580869787] len=1


id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[0, 0, 52, 52]] len=1",[1.0192307302585015] len=1
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[0, 1, 19, 25],[0, 45, 42, 78]] len=2","[0.5262252314146175,0.2769457721710205] len=2"
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[0, 141, 23, 164]] len=1",[0.695718941481217] len=1
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/baisuishan/41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv,"[[20, 186, 51, 197],[0, 188, 19, 196]] len=2","[0.30379228364853633,0.3361529023559005] len=2"
VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,VCSL-demo/huangpi/0fba4debd48245699971c4c18608cde3-黄皮外星人 低成本大制作-1vP4y1s79h.flv,,


id,candidates,similar_segment,segment_score
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,"[[0, 0, 167, 167]] len=1",[1.0059880114601043] len=1
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/ad244c924f31461a9d809c77ae251ac1-夏洛特烦恼沈腾和大爷的经典对话，马什么梅，马冬什么，什么冬梅-1y7411n7y1.flv,"[[141, 0, 167, 25],[4, 4, 73, 48]] len=2","[0.6232093011631685,0.26442489687320403] len=2"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/liulangdiqiu/ef65e0f662e646a88a13b6eddb640e48-《流浪地球》上《新闻联播》排面鸭！CCTV央视给力！-1xb411U7uE.flv,"[[105, 146, 139, 173],[33, 40, 69, 53],[0, 145, 23, 160],[147, 40, 162, 53]] len=4","[0.20599562613690486,0.21504592652223548,0.2527659253070229,0.33814812558037893] len=4"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/madongmei/8ad81fc9fe0a47dbaab1b4cdc40bf07b-行了，大爷你一边凉快去吧，神曲《马冬梅》-1t54y117JK.flv,"[[10, 34, 84, 69],[141, 1, 162, 25]] len=2","[0.20891806401243998,0.42637169626024035] len=2"
VCSL-demo/madongmei/0640bd5d43d1499c962e275be6b804ef-大爷，马冬梅家住这吗？-1e64y1y799.flv,VCSL-demo/baisuishan/41c4eaced0d24ebba50d180026531025-廣東有線翡翠台「瞬間看地球」+插播「百歲山」廣告（錄影時間：2019年12月23日 上午8時55分）-1UJ411s7we.flv,"[[39, 185, 67, 199],[107, 185, 125, 199]] len=2","[0.3485658481007531,0.38749629631638527] len=2"


In [11]:
import shutil, os
from pathlib import Path

shutil.rmtree('VCSL-demo')
if Path('url_vec.db').exists():
	shutil.rmtree('url_vec.db')
os.remove('VCSL-demo.zip')