# Action Recognition Using Inflated 3D CNN 

In this practice session, we are going to try 3D Covnet or simply I3D as mentioned by the authors Joao Carreira and Andrew Zisserman in their paper “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset” published in 2017 as a CVPR(Conference on Computer Vision and Pattern Recognition) conference paper. 

A new architecture was introduced in the paper, as mentioned in the name above for video classification. This spectacular model achieved state-of-the-art results on the HMDB51 and UCF101 datasets. Moreover, when pre-trained on Kinetics dataset, it performed extremely well and was placed first in the CVPR 2017 Charades challenge.

To read about it more, please refer [this](https://analyticsindiamag.com/action-recognition-using-inflated-3d-cnn/) article.

# Code Implementation of Inflated 3D CNN
## Setup and Importing Dependencies

imageio provides an easy interface for reading and writing a wide range of image data.

OpenCV provides bindings for computer vision problems. 

In [10]:

!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim tensorflow keras torch opencv-python torchvision \
    tqdm scikit-image pillow --user -q --no-warn-script-location

!python -m pip install -q imageio --user -q
!python -m pip install -q git+https://github.com/tensorflow/docs --user -q
import IPython
IPython.Application.instance().kernel.do_shutdown(True)


  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone


Importing for displaying the image in the cell itself (notebook).

Verbosity is a term used to describe the amount of information to be viewed in Output.

In [19]:
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_docs.vis import embed

logging.set_verbosity(logging.ERROR)
import random
import re
import os
import tempfile
import ssl
import cv2
import numpy as np

# Some modules to display an animation using imageio.
import imageio
from IPython import display

from urllib import request  # requires python3

We are going to use the UCF dataset using the below URL.

Then create helper functions for Listing all the videos available in the UCF101 dataset.

In [21]:


# Utilities to fetch videos from UCF101 dataset
UCF_ROOT = "https://www.crcv.ucf.edu/THUMOS14/UCF101/UCF101/"
_VIDEO_LIST = None
_CACHE_DIR = tempfile.mkdtemp()
# As of July 2020, crcv.ucf.edu doesn't use a certificate accepted by the
# default Colab environment anymore.
unverified_context = ssl._create_unverified_context()

def list_ucf_videos():
  
  global _VIDEO_LIST
  if not _VIDEO_LIST:
    index = request.urlopen(UCF_ROOT, context=unverified_context).read().decode("utf-8")
    videos = re.findall("(v_[\w_]+\.avi)", index)
    _VIDEO_LIST = sorted(set(videos))
  return list(_VIDEO_LIST)

def fetch_ucf_video(video):
  
  cache_path = os.path.join(_CACHE_DIR, video)
  if not os.path.exists(cache_path):
    urlpath = request.urljoin(UCF_ROOT, video)
    print("Fetching %s => %s" % (urlpath, cache_path))
    data = request.urlopen(urlpath, context=unverified_context).read()
    open(cache_path, "wb").write(data)
  return cache_path

# Utilities to open video files using CV2
def crop_center_square(frame):
  y, x = frame.shape[0:2]
  min_dim = min(y, x)
  start_x = (x // 2) - (min_dim // 2)
  start_y = (y // 2) - (min_dim // 2)
  return frame[start_y:start_y+min_dim,start_x:start_x+min_dim]

def load_video(path, max_frames=0, resize=(224, 224)):
  cap = cv2.VideoCapture(path)
  frames = []
  try:
    while True:
      ret, frame = cap.read()
      if not ret:
        break
      frame = crop_center_square(frame)
      frame = cv2.resize(frame, resize)
      frame = frame[:, :, [2, 1, 0]]
      frames.append(frame)
      
      if len(frames) == max_frames:
        break
  finally:
    cap.release()
  return np.array(frames) / 255.0

def to_gif(images):
  converted_images = np.clip(images * 255, 0, 255).astype(np.uint8)
  imageio.mimsave('./animation.gif', converted_images, fps=25)
  return embed.embed_file('./animation.gif')

# Using the UCF101 dataset

In [22]:
# Get the list of videos in the dataset.
ucf_videos = list_ucf_videos()
  
categories = {}
for video in ucf_videos:
  category = video[2:-12]
  if category not in categories:
    categories[category] = []
  categories[category].append(video)
print("Found %d videos in %d categories." % (len(ucf_videos), len(categories)))

for category, sequences in categories.items():
  summary = ", ".join(sequences[:2])
  print("%-20s %4d videos (%s, ...)" % (category, len(sequences), summary))


Found 13320 videos in 101 categories.
ApplyEyeMakeup        145 videos (v_ApplyEyeMakeup_g01_c01.avi, v_ApplyEyeMakeup_g01_c02.avi, ...)
ApplyLipstick         114 videos (v_ApplyLipstick_g01_c01.avi, v_ApplyLipstick_g01_c02.avi, ...)
Archery               145 videos (v_Archery_g01_c01.avi, v_Archery_g01_c02.avi, ...)
BabyCrawling          132 videos (v_BabyCrawling_g01_c01.avi, v_BabyCrawling_g01_c02.avi, ...)
BalanceBeam           108 videos (v_BalanceBeam_g01_c01.avi, v_BalanceBeam_g01_c02.avi, ...)
BandMarching          155 videos (v_BandMarching_g01_c01.avi, v_BandMarching_g01_c02.avi, ...)
BaseballPitch         150 videos (v_BaseballPitch_g01_c01.avi, v_BaseballPitch_g01_c02.avi, ...)
BasketballDunk        131 videos (v_BasketballDunk_g01_c01.avi, v_BasketballDunk_g01_c02.avi, ...)
Basketball            134 videos (v_Basketball_g01_c01.avi, v_Basketball_g01_c02.avi, ...)
BenchPress            160 videos (v_BenchPress_g01_c01.avi, v_BenchPress_g01_c02.avi, ...)
Biking              

In [23]:
# Get a sample cricket video.
video_path = fetch_ucf_video("v_CricketShot_g04_c02.avi")
sample_video = load_video(video_path)

Fetching https://www.crcv.ucf.edu/THUMOS14/UCF101/UCF101/v_CricketShot_g04_c02.avi => /tmp/tmpahnp0id8/v_CricketShot_g04_c02.avi


In [24]:
sample_video.shape

(116, 224, 224, 3)

In [25]:
i3d = hub.load("https://tfhub.dev/deepmind/i3d-kinetics-400/1").signatures['default']

In [28]:
def predict(sample_video):
  # Add a batch axis to the to the sample video.
  model_input = tf.constant(sample_video, dtype=tf.float32)[tf.newaxis, ...]

  logits = i3d(model_input)['default'][0]
  probabilities = tf.nn.softmax(logits)

  print("Top 5 actions:")
  for i in np.argsort(probabilities)[::-1][:5]:
    print(f"  {logits[i]:22}: {probabilities[i] * 100:5.2f}%")

In [29]:
predict(sample_video)

Top 5 actions:
      15.780726432800293: 97.77%
      10.851480484008789:  0.71%
      10.622776985168457:  0.56%
      10.613629341125488:  0.56%
       9.193199157714844:  0.13%


Now try a new video, from: https://commons.wikimedia.org/wiki/Category:Videos_of_sports


In [1]:
# !curl -O https://upload.wikimedia.org/wikipedia/commons/8/86/End_of_a_jam.ogv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 55.0M  100 55.0M    0     0  2815k      0  0:00:20  0:00:20 --:--:-- 3475k


In [31]:
video_path = "https://gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/raw/main/action_recognition/End_of_a_jam.ogv"

In [32]:
sample_video = load_video(video_path)[:100]
sample_video.shape

(100, 224, 224, 3)

In [33]:
# to_gif(sample_video)

Output hidden; open in https://colab.research.google.com to view.

In [34]:
predict(sample_video)

Top 5 actions:
      14.983098030090332: 96.85%
      10.898600578308105:  1.63%
       8.869956016540527:  0.21%
       8.793314933776855:  0.20%
       8.579297065734863:  0.16%
