# Transfer Learning for Human Activity Recognition from Video with TensorFlow Hub


---


Overview

[TensorFlow Hub](https://tfhub.dev/) is a repository of pre-trained TensorFlow models.

In this project, I have used pre-trained model from TensorFlow Hub with [`tf.keras`](https://www.tensorflow.org/api_docs/python/tf/keras) for Video classification. Transfer learning makes it possible to save training resources and achieve good model generalization even when training on a small dataset. In this project, I will be using [`MoViNet architecture`](https://tfhub.dev/tensorflow/movinet/a5/base/kinetics-600/classification/1)

Note: Try to run the Notebook in GPU

In [1]:
#!pip install pytube3                                                              # Download pytube to download videos from Youtube
!pip install git+https://github.com/ssuwani/pytube
from pytube import YouTube

!pip install flask-ngrok
import flask_ngrok
from flask import Flask
from flask_ngrok import run_with_ngrok

import time
import numpy as np
import pandas as pd

import cv2

from absl import logging
import tensorflow as tf
import tensorflow_hub as hub

print("Version: ", tf.__version__)
print("Hub version: ", hub.__version__)

Collecting git+https://github.com/ssuwani/pytube
  Cloning https://github.com/ssuwani/pytube to /tmp/pip-req-build-bcz0nrz2
  Running command git clone -q https://github.com/ssuwani/pytube /tmp/pip-req-build-bcz0nrz2
Building wheels for collected packages: pytube
  Building wheel for pytube (setup.py) ... [?25l[?25hdone
  Created wheel for pytube: filename=pytube-10.8.1-cp37-none-any.whl size=46190 sha256=ca89bdedf48f4fcd8f06b4e7527142a9fdf69bde128bf21afc1e92e3a174d25d
  Stored in directory: /tmp/pip-ephem-wheel-cache-498ivfww/wheels/25/7c/91/be1312a77c2a7ff95e5f9dc1e0ff59113d10b67b4f80d2f4b8
Successfully built pytube
Version:  2.5.0
Hub version:  0.12.0


#Downloading Dataset for Kinetics 600

In [2]:
label = pd.read_csv('https://gist.githubusercontent.com/willprice/f19da185c9c5f32847134b87c1960769/raw/9dc94028ecced572f302225c49fcdee2f3d748d8/kinetics_600_labels.csv', index_col='id')
train = pd.read_csv('https://raw.githubusercontent.com/rocksyne/kinetics-dataset-downloader/master/dataset_splits/kinetics700/train.csv')
test  = pd.read_csv('https://raw.githubusercontent.com/rocksyne/kinetics-dataset-downloader/master/dataset_splits/kinetics600/test.csv')
test.head()

Unnamed: 0,label,youtube_id,time_start,time_end,split
0,sipping cup,--0l35AkU34,62,72,test
1,using inhaler,--71SekUwOA,5,15,test
2,climbing tree,--8YXc8iCt8,2,12,test
3,washing feet,--GkrdYZ9Tc,0,10,test
4,playing kickball,--NFqQGeShc,33,43,test


#Downloading specific video only to avoid storage issues

Test the model by changing the input here

Choose from train or test dataset

Choose the Video

In [3]:
df = test #@param ["None", "train", "test"] {type:"raw", allow-input: true}
i = int(input('Enter the index of video between 0 and {}: '.format(len(df))))    

Enter the index of video between 0 and 56508: 1


In [5]:
while True:
  try: 
    yt = YouTube('https://www.youtube.com/watch?v='+df.youtube_id[i])
    video = yt.streams.first().download()
    break
  except: 
    time.sleep(10)
    continue

###Get details of video

In [6]:
n_frame = cv2.VideoCapture(video).get(cv2.CAP_PROP_FRAME_COUNT)
fps = int(cv2.VideoCapture(video).get(cv2.CAP_PROP_FPS))
length = yt.length

### Reading and Converting Video to Numpy array

In [7]:
cap = cv2.VideoCapture(video)
cv2.VideoCapture.isOpened(cap)
frames = []
try:
  x,y,_ = cap.read()[1].shape
  res = min(x,y)
except: res = 180
while True:
  ret, frame = cap.read()
  if not ret:
    break
   
  frame = cv2.resize(frame, (res,res))
  frame = frame[:, :, [2, 1, 0]]
  frames.append(frame)
   
  if len(frames) == 0:
    break

cap.release()
frames = frames[df.time_start[i]*fps:df.time_end[i]*fps]
array = np.array(frames) / 255.0
array = array.reshape(1,df.time_end[i]*fps-df.time_start[i]*fps,res,res,3)
array.shape

(1, 300, 360, 360, 3)

#Importing MoViNet A-5 from TensorFlow Hub

####[Theory](https://www.arxiv-vanity.com/papers/2103.11511/)

Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on

It works in Three steps:


>First define a MoViNet search space to allow Neural Architecture Search (NAS) to efficiently trade-off spatiotemporal feature representations.


>Then introduce Stream Buffers for MoViNets, which process videos in small consecutive subclips, requiring constant memory without sacrificing long temporal dependencies, and which enable online inference.



>Finally, create Temporal Ensembles of streaming MoViNets, regaining the slightly lost accuracy from the stream buffer.





#Building the Model

In [8]:
logging.set_verbosity(logging.ERROR)

inputs = tf.keras.layers.Input( shape=[None, None, None, 3], dtype=tf.float32)

encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/movinet/a5/base/kinetics-600/classification/1")

# Important: due to a bug in the tf.nn.conv3d CPU implementation, we must
# compile with tf.function to enforce correct behavior. Otherwise, the output
# on CPU may be incorrect.
encoder.call = tf.function(encoder.call, experimental_compile=True)

outputs = encoder(dict(image=inputs))

model = tf.keras.Model(inputs, outputs)

# Check Output

In [9]:
#example_input = tf.ones([1, 8, 320, 320, 3])
example_output = model(array)
print('Predicted: ', label.name[np.argmax(example_output[0])])
print('Actual:    ', df.label[i])

Predicted:  using inhaler
Actual:     using inhaler
