## Optional pip installation (for Winddows)

In [2]:
%pip install matplotlib torch==1.8.0 torchvision==0.9.0 gluoncv decord

Collecting matplotlib
  Downloading matplotlib-3.7.5-cp38-cp38-win_amd64.whl (7.5 MB)
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp38-cp38-win_amd64.whl (190.5 MB)
Collecting torchvision==0.9.0
  Downloading torchvision-0.9.0-cp38-cp38-win_amd64.whl (852 kB)
Collecting gluoncv
  Using cached gluoncv-0.10.5.post0-py2.py3-none-any.whl (1.3 MB)
Collecting decord
  Using cached decord-0.6.0-py3-none-win_amd64.whl (24.7 MB)
Collecting numpy
  Downloading numpy-1.24.4-cp38-cp38-win_amd64.whl (14.9 MB)
Collecting pillow>=4.1.1
  Downloading pillow-10.2.0-cp38-cp38-win_amd64.whl (2.6 MB)
Collecting pyparsing>=2.3.1
  Using cached pyparsing-3.1.1-py3-none-any.whl (103 kB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.1.1-cp38-cp38-win_amd64.whl (477 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.5-cp38-cp38-win_amd64.whl (56 kB)
Collecting importlib-resources>=3.2.0
  Downloading importlib_resources-6.1.2-py3-none-any.whl (34 kB)
Collecting cycler>=0.10
  Using cac

You should consider upgrading via the 'c:\Users\Owner\.pyenv\pyenv-win\versions\3.8.10\python.exe -m pip install --upgrade pip' command.


In [10]:
%pip uninstall Pillow

^C
Note: you may need to restart the kernel to use updated packages.


In [18]:
%pip install Pillow

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Owner\.pyenv\pyenv-win\versions\3.8.10\python.exe -m pip install --upgrade pip' command.


In [12]:
%matplotlib inline

# 1. Getting Started with Pre-trained I3D Models on Kinetcis400

`Kinetics400 <https://deepmind.com/research/open-source/kinetics>`_  is an action recognition dataset
of realistic action videos, collected from YouTube. With 306,245 short trimmed videos
from 400 action categories, it is one of the largest and most widely used dataset in the research
community for benchmarking state-of-the-art video action recognition models.

`I3D <https://arxiv.org/abs/1705.07750>`_ (Inflated 3D Networks) is a widely adopted 3D video
classification network. It uses 3D convolution to learn spatiotemporal information directly from videos.
I3D is proposed to improve `C3D <https://arxiv.org/abs/1412.0767>`_ (Convolutional 3D Networks) by inflating from 2D models.
We can not only reuse the 2D models' architecture (e.g., ResNet, Inception), but also bootstrap
the model weights from 2D pretrained models. In this manner, training 3D networks for video
classification is feasible and getting much better results.

In this tutorial, we will demonstrate how to load a pre-trained I3D model from `gluoncv-model-zoo`
and classify a video clip from the Internet or your local disk into one of the 400 action classes.

## Step by Step

We will try out a pre-trained I3D model on a single video clip.

First, please follow the `installation guide <../../index.html#installation>`__
to install ``PyTorch`` and ``GluonCV`` if you haven't done so yet.

## Simon's Fixes to Installation Instructions

1. Use python 3.8
2. Run `pip install torch==1.6.0 torchvision==0.7.0 gluoncv decord`
3. Run `pip uninstall Pillow`
4. Run `pip install Pillow==9.5.0`
5. (Optional) install Jupyter lab to run example notebook linked in tutorial `pip install jupyterlab`
6. Download the model config to download the pretrained model used in the tutorial (you will need to edit the config file path to where this file is stored on your system when running the code block which loads the model): https://raw.githubusercontent.com/dmlc/gluon-cv/master/scripts/action-recognition/configuration/resnet50_v1b_kinetics400.yaml
7. Run the notebook and check if class 0 (abseiling) is the final output.

In [3]:
import numpy as np
import decord
import torch

from gluoncv.torch.utils.model_utils import download
from gluoncv.torch.data.transforms.videotransforms import video_transforms, volume_transforms
from gluoncv.torch.engine.config import get_cfg_defaults
from gluoncv.torch.model_zoo import get_model

  from .autonotebook import tqdm as notebook_tqdm


Then, we download a video and extract a 32-frame clip from it.



In [107]:
url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4'  # contains 250 frames
video_fname = download(url)
vr = decord.VideoReader(video_fname)
frame_id_list = [5, 6, 7, 8, 9]
video_data = vr.get_batch(frame_id_list).asnumpy()

Now we define transformations for the video clip.
This transformation function does four things:
(1) resize the shorter side of video clip to short_side_size,
(2) center crop the video clip to crop_size x crop_size,
(3) transpose the video clip to ``num_channels*num_frames*height*width``,
and (4) normalize it with mean and standard deviation calculated across all ImageNet images.



In [108]:
crop_size = 224
short_side_size = 256
transform_fn = video_transforms.Compose([video_transforms.Resize(short_side_size, interpolation='bilinear'),
                                         video_transforms.CenterCrop(size=(crop_size, crop_size)),
                                         volume_transforms.ClipToTensor(),
                                         video_transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])


clip_input = transform_fn(video_data)
print('Video data is downloaded and preprocessed.')

Video data is downloaded and preprocessed.


Next, we load a pre-trained I3D model. Make sure to change the ``pretrained`` in the configuration file to True.



In [109]:
config_file = './i3d_resnet50_v1_kinetics400.yaml'
cfg = get_cfg_defaults()
cfg.merge_from_file(config_file)
model = get_model(cfg)
model.eval()
print('%s model is successfully loaded.' % cfg.CONFIG.MODEL.NAME)

i3d_resnet50_v1_kinetics400 model is successfully loaded.


Finally, we prepare the video clip and feed it to the model.



In [110]:
with torch.no_grad():
    pred = model(torch.unsqueeze(clip_input, dim=0)).numpy()

# Convert raw logits to probabilities using softmax
probs = torch.nn.functional.softmax(torch.tensor(pred), dim=1).numpy()

# Get the top predicted class and calculate confidence interval
top_class = np.argmax(probs)
confidence_interval = np.max(probs) - np.min(probs)

print(f'The input video clip is classified as class {top_class} with confidence interval {confidence_interval}')

The input video clip is classified as class 0 with confidence interval 0.7715625166893005


## Calculate confidence of frame windows
##### Adjust the 'N' value to set the step size

In [124]:
N = 4

url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4' 
video_fname = download(url)
vr = decord.VideoReader(video_fname)
config_file = './i3d_resnet50_v1_kinetics400.yaml'
cfg = get_cfg_defaults()
cfg.merge_from_file(config_file)
model = get_model(cfg)
model.eval()
for i in range(2 * N, len(vr) - (2 * N), 2): 
    frame_id_list = range(i - (2*N), i + (2*N) + 1, N)
    video_data = vr.get_batch(frame_id_list).asnumpy()
    crop_size = 224
    short_side_size = 256
    transform_fn = video_transforms.Compose([video_transforms.Resize(short_side_size, interpolation='bilinear'),
                                            video_transforms.CenterCrop(size=(crop_size, crop_size)),
                                            volume_transforms.ClipToTensor(),
                                            video_transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])


    clip_input = transform_fn(video_data)
    with torch.no_grad():
        pred = model(torch.unsqueeze(clip_input, dim=0)).numpy()

    # Convert raw logits to probabilities using softmax
    probs = torch.nn.functional.softmax(torch.tensor(pred), dim=1).numpy()

    # Get the top predicted class and calculate confidence interval
    top_class = np.argmax(probs)
    confidence_interval = np.max(probs) - np.min(probs)

    print(f'The input video clip is classified as class {top_class} with confidence interval {confidence_interval} for frame window {i}')


The input video clip is classified as class 0 with confidence interval 0.9928627610206604 for frame window 8
The input video clip is classified as class 0 with confidence interval 0.963922381401062 for frame window 10
The input video clip is classified as class 0 with confidence interval 0.8847163319587708 for frame window 12
The input video clip is classified as class 0 with confidence interval 0.9969870448112488 for frame window 14
The input video clip is classified as class 0 with confidence interval 0.9855148792266846 for frame window 16
The input video clip is classified as class 0 with confidence interval 0.9987284541130066 for frame window 18
The input video clip is classified as class 0 with confidence interval 0.9997177720069885 for frame window 20
The input video clip is classified as class 0 with confidence interval 0.9983910322189331 for frame window 22
The input video clip is classified as class 0 with confidence interval 0.99607914686203 for frame window 24
The input vide

##### function to measure average confidence given window size 'N' (Variable # frames)

In [8]:
def frame_window_confidence(N, vr, model, true_class):
    sum_confidence = 0
    sum_class = 0
    for i in range(N, len(vr) - N, (len(vr) - 2*N - 1) // 3):
        frame_id_list = range(i-N, i+N+1)
        video_data = vr.get_batch(frame_id_list).asnumpy()
        crop_size = 224
        short_side_size = 256
        transform_fn = video_transforms.Compose([video_transforms.Resize(short_side_size, interpolation='bilinear'),
                                                video_transforms.CenterCrop(size=(crop_size, crop_size)),
                                                volume_transforms.ClipToTensor(),
                                                video_transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
        clip_input = transform_fn(video_data)
        with torch.no_grad():
            pred = model(torch.unsqueeze(clip_input, dim=0)).numpy()
        probs = torch.nn.functional.softmax(torch.tensor(pred), dim=1).numpy()
        top_class = np.argmax(probs)
        confidence_interval = np.max(probs) - np.min(probs)
        if top_class == true_class: 
            sum_class += 1
            sum_confidence += confidence_interval
        #print(f'Class: {top_class} \tConfidence: {confidence_interval} \tWindow:{i}')
    #print(f'Average confidence level for window size {N} is {sum_confidence / (len(vr) - 2*N)}')
    #print(f'Predicted top class with accuracy {sum_class / (len(vr) - 2*N)}')
    return sum_confidence / (len(vr) - 2*N), sum_class / (len(vr) - 2*N)


In [9]:
url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4' 
video_fname = download(url)
vr = decord.VideoReader(video_fname)
config_file = './i3d_resnet50_v1_kinetics400.yaml'
cfg = get_cfg_defaults()
cfg.merge_from_file(config_file)
model = get_model(cfg)
model.eval()

for i in range(2, 10):
    confidence, accuracy = frame_window_confidence(i, vr, model, 0)
    print(f'{i*2+1} frames has average confidence of {confidence} with an accuracy of {accuracy}')

2
83
164
245
5 frames has average confidence of 0.01606677218181331 with an accuracy of 0.016260162601626018
3
84
165
246
7 frames has average confidence of 0.016385315138785564 with an accuracy of 0.01639344262295082
4
84
164
244
9 frames has average confidence of 0.01638208637552813 with an accuracy of 0.01652892561983471
5
84
163
242
11 frames has average confidence of 0.016626195112864176 with an accuracy of 0.016666666666666666
6
85
164
243
13 frames has average confidence of 0.015279303578769459 with an accuracy of 0.01680672268907563
7
85
163
241
15 frames has average confidence of 0.01670145887439534 with an accuracy of 0.01694915254237288
8
85
162
239
17 frames has average confidence of 0.016835230792689528 with an accuracy of 0.017094017094017096
9
86
163
240
19 frames has average confidence of 0.01704691042160166 with an accuracy of 0.017241379310344827


We can see that our pre-trained model predicts this video clip
to be ``abseiling`` action with high confidence.



## Next Step

If you would like to dive deeper into finetuing SOTA video models on your datasets,
feel free to read the next `tutorial on finetuning <finetune_custom.html>`__.

