Name	Name	Last commit message	Last commit date
parent directory ..
data	data
README.md	README.md
eval.py	eval.py
infer.py	infer.py
reader.py	reader.py
resnet.py	resnet.py
train.py	train.py
utility.py	utility.py

Video Classification Based on Temporal Segment Network

Video classification has drawn a significant amount of attentions in the past few years. This page introduces how to perform video classification with PaddlePaddle Fluid, on the public UCF-101 dataset, based on the state-of-the-art Temporal Segment Network (TSN) method.

Installation

Data preparation

Training

Evaluation

Inference

Performance

Installation

Running sample code in this directory requires PaddelPaddle Fluid v0.13.0 and later. If the PaddlePaddle on your device is lower than this version, please follow the instructions in installation document and make an update.

Data preparation

download UCF-101 dataset

Users can download the UCF-101 dataset by the provided script in data/download.sh.

decode video into frame

To avoid the process of decoding videos in network training, we offline decode them into frames and save it in the pickle format, easily readable for python.

Users can refer to the script data/video_decode.py for video decoding.

split data into train and test

We follow the split 1 of UCF-101 dataset. After data splitting, users can get 9537 videos for training and 3783 videos for validation. The reference script is data/split_data.py.

save pickle for training

As stated above, we save all data as pickle format for training. All information in each video is saved into one pickle, includes video id, frames binary and label. Please refer to the script data/generate_train_data.py. After this operation, one can get two directories containing training and testing data in pickle format, and two files train.list and test.list, with each line seperated by SPACE.

Training

After data preparation, users can start the PaddlePaddle Fluid training by:

python train.py \
    --batch_size=128 \
    --total_videos=9537 \
    --class_dim=101 \
    --num_epochs=60 \
    --image_shape=3,224,224 \
    --model_save_dir=output/ \
    --with_mem_opt=True \
    --lr_init=0.01 \
    --num_layers=50 \
    --seg_num=7 \
    --pretrained_model={path_to_pretrained_model}

parameter introduction:

batch_size: the size of each mini-batch.

total_videos: total number of videos in the training set.

class_dim: the class number of the classification task.

num_epochs: the number of epochs.

image_shape: input size of the network.

model_save_dir: the directory to save trained model.

with_mem_opt: whether to use memory optimization or not.

lr_init: initialized learning rate.

num_layers: the number of layers for ResNet.

seg_num: the number of segments in TSN.

pretrained_model: model path for pretraining.

data reader introduction: Data reader is defined in reader.py. Note that we use group operation for all frames in one video.

training: The training log is like:

[TRAIN] Pass: 0    trainbatch: 0    loss: 4.630959    acc1: 0.0    acc5: 0.0390625    time: 3.09 sec
[TRAIN] Pass: 0    trainbatch: 10    loss: 4.559069    acc1: 0.0546875    acc5: 0.1171875    time: 3.91 sec
[TRAIN] Pass: 0    trainbatch: 20    loss: 4.040092    acc1: 0.09375    acc5: 0.3515625    time: 3.88 sec
[TRAIN] Pass: 0    trainbatch: 30    loss: 3.478214    acc1: 0.3203125    acc5: 0.5546875    time: 3.32 sec
[TRAIN] Pass: 0    trainbatch: 40    loss: 3.005404    acc1: 0.3515625    acc5: 0.6796875    time: 3.33 sec
[TRAIN] Pass: 0    trainbatch: 50    loss: 2.585245    acc1: 0.4609375    acc5: 0.7265625    time: 3.13 sec
[TRAIN] Pass: 0    trainbatch: 60    loss: 2.151489    acc1: 0.4921875    acc5: 0.8203125    time: 3.35 sec
[TRAIN] Pass: 0    trainbatch: 70    loss: 1.981680    acc1: 0.578125    acc5: 0.8359375    time: 3.30 sec

Evaluation

Evaluation is to evaluate the performance of a trained model. One can download pretrained models and set its path to path_to_pretrain_model. Then top1/top5 accuracy can be obtained by running the following command:

python eval.py \
    --batch_size=128 \
    --class_dim=101 \
    --image_shape=3,224,224 \
    --with_mem_opt=True \
    --num_layers=50 \
    --seg_num=7 \
    --test_model={path_to_pretrained_model}

According to the congfiguration of evaluation, the output log is like:

[TEST] Pass: 0    testbatch: 0    loss: 0.011551    acc1: 1.0    acc5: 1.0    time: 0.48 sec
[TEST] Pass: 0    testbatch: 10    loss: 0.710330    acc1: 0.75    acc5: 1.0    time: 0.49 sec
[TEST] Pass: 0    testbatch: 20    loss: 0.000547    acc1: 1.0    acc5: 1.0    time: 0.48 sec
[TEST] Pass: 0    testbatch: 30    loss: 0.036623    acc1: 1.0    acc5: 1.0    time: 0.48 sec
[TEST] Pass: 0    testbatch: 40    loss: 0.138705    acc1: 1.0    acc5: 1.0    time: 0.48 sec
[TEST] Pass: 0    testbatch: 50    loss: 0.056909    acc1: 1.0    acc5: 1.0    time: 0.49 sec
[TEST] Pass: 0    testbatch: 60    loss: 0.742937    acc1: 0.75    acc5: 1.0    time: 0.49 sec
[TEST] Pass: 0    testbatch: 70    loss: 1.720186    acc1: 0.5    acc5: 0.875    time: 0.48 sec
[TEST] Pass: 0    testbatch: 80    loss: 0.199669    acc1: 0.875    acc5: 1.0    time: 0.48 sec
[TEST] Pass: 0    testbatch: 90    loss: 0.195510    acc1: 1.0    acc5: 1.0    time: 0.48 sec

Inference

Inference is used to get prediction score or video features based on trained models.

python infer.py \
    --class_dim=101 \
    --image_shape=3,224,224 \
    --with_mem_opt=True \
    --num_layers=50 \
    --seg_num=7 \
    --test_model={path_to_pretrained_model}

The output contains predication results, including maximum score (before softmax) and corresponding predicted label.

Test sample: PlayingGuitar_g01_c03, score: [21.418629], class [62]
Test sample: SalsaSpin_g05_c06, score: [13.238657], class [76]
Test sample: TrampolineJumping_g04_c01, score: [21.722862], class [93]
Test sample: JavelinThrow_g01_c04, score: [16.27892], class [44]
Test sample: PlayingTabla_g01_c01, score: [15.366951], class [65]
Test sample: ParallelBars_g04_c07, score: [18.42596], class [56]
Test sample: PlayingCello_g05_c05, score: [18.795723], class [58]
Test sample: LongJump_g03_c04, score: [7.100088], class [50]
Test sample: SkyDiving_g06_c03, score: [15.144707], class [82]
Test sample: UnevenBars_g07_c04, score: [22.114838], class [95]

Performance

Configuration	Top-1 acc
seg=7, size=224	0.859
seg=10, size=224	0.863

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

video_classification

video_classification

README.md

Video Classification Based on Temporal Segment Network

Table of Contents

Installation

Data preparation

download UCF-101 dataset

decode video into frame

split data into train and test

save pickle for training

Training

Evaluation

Inference

Performance

Files

video_classification

Directory actions

More options

Directory actions

More options

Latest commit

History

video_classification

Folders and files

parent directory

README.md

Video Classification Based on Temporal Segment Network

Table of Contents

Installation

Data preparation

download UCF-101 dataset

decode video into frame

split data into train and test

save pickle for training

Training

Evaluation

Inference

Performance