# Build an Video Action Classification System in 5 Minutes

This notebook illustrates how to build a video classification system from scratch using [Towhee](https://towhee.io/). A video classification system classifies videos into pre-defined categories. This tutorial will use pretrained labels of human activities as example.

Using the sample data of different classes of human activites, we will build a basic video classification system within 5 lines of code and check the performance using Towhee. In addition, this tutorial also suggests some optimization options. At the end, we use [Gradio](https://gradio.app/) to create a showcase that can be played with.

## Preparation

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| towhee |
| towhee.models |
| pillow |
| ipython |
| gradio |

In [11]:
! python -m pip install -q towhee towhee.models pillow ipython gradio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Prepare data

This tutorial will use a small data extracted from validation data of [Kinetics400](https://www.deepmind.com/open-source/kinetics). You can download the subset from [Github](https://github.com/towhee-io/data/releases/download/video-data/reverse_video_search.zip). This tutorial will just use 200 videos under `train` as example.

The data is organized as follows:
- **train:** 20 classes, 10 videos per class (200 in total)
- **reverse_video_search.csv:** a csv file containing an ***id***, ***path***, and ***label*** for each video in train directory

Let's take a quick look:

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip -O
! unzip -q -o reverse_video_search.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  152M  100  152M    0     0  4455k      0  0:00:34  0:00:34 --:--:-- 5342k


In [9]:
import pandas as pd

df = pd.read_csv('./reverse_video_search.csv')
print(df.head(3))
print(df.label.value_counts())

   id                                          path                 label
0   0  ./train/country_line_dancing/bTbC3w_NIvM.mp4  country_line_dancing
1   1  ./train/country_line_dancing/n2dWtEmNn5c.mp4  country_line_dancing
2   2  ./train/country_line_dancing/zta-Iv-xK7I.mp4  country_line_dancing
country_line_dancing     10
pumping_fist             10
playing_trombone         10
shuffling_cards          10
tap_dancing              10
clay_pottery_making      10
eating_hotdog            10
eating_carrots           10
juggling_soccer_ball     10
juggling_fire            10
javelin_throw            10
dunking_basketball       10
chopping_wood            10
trimming_trees           10
using_segway             10
pushing_cart             10
dancing_gangnam_style    10
riding_mule              10
drop_kicking             10
doing_aerobics           10
Name: label, dtype: int64


For later steps to easier get videos & measure results, we build some helpful functions in advance:
- **ground_truth:** get ground-truth label for the video by its path

In [7]:
def ground_truth(path):
    label = df.set_index('path').at[path, 'label']
    return [label.replace('_', ' ')]

## Build System

Now we are ready to build a video classification system using sample data. We will use the [X3D_M](https://arxiv.org/abs/2004.04730) model to predict most possible action labels for input videos. With proper [Towhee operators](https://towhee.io/operators), you don't need to go through video preprocessing & model details. It is very simple to use the [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) to wrap operators and then apply them to batch inputs.

### Predict labels

Let's take some 'tap_dancing' videos as example to see how to predict labels for videos within 5 lines. By default, the system will predict top 5 labels sorting by scores (of possibility) from high to low. You can control the number of labels returnbed by change `topk`. Please note that the first time run will take some time to download model.

In [2]:
from towhee import ops, pipe, DataCollection
from glob import glob

p = (
    pipe.input('video_path_root')
    .flat_map('video_path_root', 'video_path', glob)
    .map('video_path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 16}))
    .map('frames', ('predicts', 'scores', 'features'), ops.action_classification.pytorchvideo(model_name='x3d_m', skip_preprocess=True, topk=5))
    .output('video_path', 'predicts', 'scores')
)

DataCollection(p('./train/tap_dancing/*.mp4')).show()

Using cache found in /home/junjie.jiangjjj/.cache/torch/hub/facebookresearch_pytorchvideo_main


video_path,predicts,scores
./train/tap_dancing/2WdQuLmw-f4.mp4,tap dancing dancing charleston breakdancing jumpstyle dancing swing dancing,"[0.00492,0.00324,0.00255,0.00253,...] len=5"
./train/tap_dancing/Uf1PiOF8Poc.mp4,tap dancing dancing ballet country line dancing dancing charleston salsa dancing,"[0.0045,0.00362,0.00256,0.0025,...] len=5"
./train/tap_dancing/X7k8twydJIU.mp4,robot dancing tap dancing breakdancing krumping jumpstyle dancing,"[0.00542,0.00279,0.00265,0.00255,...] len=5"
./train/tap_dancing/PGPn8WhG3pM.mp4,tap dancing dancing charleston country line dancing jumpstyle dancing salsa dancing,"[0.00578,0.0029,0.0025,0.00249,...] len=5"
./train/tap_dancing/Krh21z_zyV8.mp4,tap dancing dancing ballet roller skating dancing charleston hopscotch,"[0.00673,0.0025,0.00249,0.00249,...] len=5"


#### Pipeline Explanation

Here are some details for each line of the assemble pipeline:

- `ops.video_decode.ffmpeg()`: an embeded Towhee operator reading video as frames with specified sample method and number of samples. [learn more](https://towhee.io/video-decode/ffmpeg)

- `ops.action_classification.pytorchvideo()`: an embeded Towhee operator applying specified model to video frames, which can be used to predict labels and extract features for video. [learn more](https://towhee.io/action-classification/pytorchvideo)

### Evaluation

We have just showed how to classify video, but how's its performance? 
In this section, we'll measure the performance with the average metric value:

- **mHR (recall@K):**
    - Mean Hit Ratio describes how many actual relevant results are returned out of all ground truths.
    - Since we predict top K labels while only 1 ground truth for each entity, the mean hit ratio is equivalent to recall@topk.

In [34]:
import time

def read_csv(csv_file):
    import csv
    with open(csv_file, 'r', encoding='utf-8-sig') as f:
        data = csv.DictReader(f)
        for line in data:
            yield line['path']

def mean_hit_ratio(actual, *predicteds):
    rets = []
    for predicted in predicteds:
        ratios = []
        for act, pre in zip(actual, predicted):
            hit_num = len(set(act) & set(pre))
            ratios.append(hit_num / len(act))
        rets.append(sum(ratios) / len(ratios))
    return rets

p = (
    pipe.input('csv_file')
    .flat_map('csv_file', 'path', read_csv)
    .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 16}))
    .map('frames', ('predicts', 'scores', 'features'), ops.action_classification.pytorchvideo(model_name='x3d_m', skip_preprocess=True, topk=5))
    .map('predicts', ('top1', 'top3', 'top5'), lambda x: (x[:1], x[:3], x[:5]))
    .map('path', 'ground_truth', ground_truth)
    .window_all(('ground_truth', 'top1', 'top3', 'top5'), ('top1_mean_hit_ratio', 'top3_mean_hit_ratio', 'top5_mean_hit_ratio'),  mean_hit_ratio)
    .output('top1_mean_hit_ratio', 'top3_mean_hit_ratio', 'top5_mean_hit_ratio')
)

import time
start = time.time()
DataCollection(p('reverse_video_search.csv')).show()
end = time.time()
print(f'Total time: {end-start}')


Using cache found in /home/junjie.jiangjjj/.cache/torch/hub/facebookresearch_pytorchvideo_main


top1_mean_hit_ratio,top3_mean_hit_ratio,top5_mean_hit_ratio
0.7,0.875,0.9


Total time: 34.97685408592224


## Optimization

You're always encouraged to play around with the tutorial. We present some optimization options here to make improvements in accuracy, latency, and resource usage. With these methods, you can make the classification system better in performance and more feasible in production.

### Change model

There are more video models using different networks. Normally a more complicated or larger model will show better results while cost more. You can always try more models to tradeoff among accuracy, latency, and resource usage. Here I show the performance of video classification using a SOTA model with [multiscale vision transformer](https://arxiv.org/abs/2104.11227) as backbone. The average recall increases by about 3% while double time is costed.

In [35]:

p = (
    pipe.input('csv_file')
    .flat_map('csv_file', 'path', read_csv)
    .map('path', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 32}))
    .map('frames', ('predicts', 'scores', 'features'), ops.action_classification.pytorchvideo(model_name='mvit_base_32x3', skip_preprocess=True, topk=5))
    .map('predicts', ('top1', 'top3', 'top5'), lambda x: (x[:1], x[:3], x[:5]))
    .map('path', 'ground_truth', ground_truth)
    .window_all(('ground_truth', 'top1', 'top3', 'top5'), ('top1_mean_hit_ratio', 'top3_mean_hit_ratio', 'top5_mean_hit_ratio'),  mean_hit_ratio)
    .output('top1_mean_hit_ratio', 'top3_mean_hit_ratio', 'top5_mean_hit_ratio')
)

import time
start = time.time()
DataCollection(p('reverse_video_search.csv')).show()
end = time.time()
print(f'Total time: {end-start}')

Using cache found in /home/junjie.jiangjjj/.cache/torch/hub/facebookresearch_pytorchvideo_main


top1_mean_hit_ratio,top3_mean_hit_ratio,top5_mean_hit_ratio
0.745,0.9,0.92


Total time: 36.61180877685547


## Release a Showcase

We can build a quick demo with this `action_classification_pipeline` with [Gradio](https://gradio.app/).

In [None]:
import gradio

topk = 3

def label(predicts: 'List', scores: 'List'):
    labels = {}
    for i in range(topk):
        labels[predicts[i]] = scores[i]
    return labels

action_classification_pipe = (
    pipe.input('video')
    .map('video', 'frames', ops.video_decode.ffmpeg(sample_type='uniform_temporal_subsample', args={'num_samples': 32}))
    .map('frames', ('predicts', 'scores', 'features'), ops.action_classification.pytorchvideo(model_name='mvit_base_32x3', skip_preprocess=True, topk=topk))
    .map(('predicts', 'scores'), 'label', label)
    .output('label')
)

def action_classification_function(video):
    return action_classification_pipe(video).to_list()[0][0]
    

interface = gradio.Interface(action_classification_function, 
                             inputs=gradio.Video(source='upload'),
                             outputs=[gradio.Label(num_top_classes=3)]
                            )


interface.launch(inline=True, share=True)

Using cache found in /home/junjie.jiangjjj/.cache/torch/hub/facebookresearch_pytorchvideo_main


<img src='action_classification_demo.png' alt='action_classification_demo' width=700px/>