# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Video Classification using LSTM

## Learning Objectives

At the end of the experiment, you will be able to :

* extract frames out of a video
* build the CNN model to extract features of the video frames
* train the LSTM, GRU model to classify the video based on sequence

## Information

**Background:** The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:



*   Activity Recognition: Generating a textual description of an activity demonstrated in a sequence of images
*   Image Description: Generating a textual description of a single image.
*   Video Description: Generating a textual description of a sequence of images.

**Applications:** Applications such as surveillance, video retrieval and
human-computer interaction require methods for recognizing human actions in various scenarios. In the area of robotics, the tasks of
autonomous navigation or social interaction could also take advantage of the knowledge extracted
from live video recordings. Typical scenarios
include scenes with cluttered, moving backgrounds, nonstationary camera, scale variations, individual variations in
appearance and cloth of people, changes in light and view
point and so forth. All of these conditions introduce challenging problems that can be addressed using deep learning (computer vision) models.

## Dataset



**Dataset:** This dataset consists of labelled videos of 6 human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below.

![img](http://www.nada.kth.se/cvap/actions/actions.gif)

All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average. In summary, there are 25x6x4=600 video files for each combination of 25 subjects, 6 actions and 4 scenarios. For this mini-project we have randomly selected 20% of the data as test set.

Dataset source: https://www.csc.kth.se/cvap/actions/

**Methodology:**

When performing image classification, we input an image to our CNN; Obtain the predictions from the CNN;
Choose the label with the largest corresponding probability


Since a video is just a series of image frames, in a video classification, we Loop over all frames in the video file;
For each frame, pass the frame through the CNN; Classify each frame individually and independently of each other; Choose the label with the largest corresponding probability;
Label the frame and write the output frame to disk

Refer this [Video Classification using Keras](https://medium.com/video-classification-using-keras-and-tensorflow/action-recognition-and-video-classification-using-keras-and-tensorflow-56badcbe5f77) for complete understanding and implementation example of video classification.

## Problem Statement

Train a CNN-LSTM based deep neural net to recognize the action being performed in a video.

## Grading = 10 Points

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip
!unzip -qq Actions.zip
print("Dataset downloaded successfully!!")

### Import Required packages

In [None]:
import keras
from keras import applications
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import *
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.layers import Dense, Input
from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.optimizers import Adam,Nadam

import os, glob
import cv2
import numpy as np
from matplotlib import pyplot as plt

### Load the data and generate frames of video (2 points)

Detecting an action is possible by analyzing a series of images (that we name “frames”) that are taken in time.

Hint: Refer data preparation section in [keras_video_classification](https://keras.io/examples/vision/video_classification/)


In [None]:
data_dir = "/content/Actions/train/"
img_height , img_width = 160, 120
n_channels = 3
seq_len = 20
classes = os.listdir(data_dir)
classes

In [None]:
# Creating frames from videos; selecting the frames from the middle of video
def frames_extraction(video_path):
    frames_list = []
    vidObj = cv2.VideoCapture(video_path)
    frames = vidObj.get(cv2.CAP_PROP_FRAME_COUNT)
    fps = int(vidObj.get(cv2.CAP_PROP_FPS))
    seconds = int(frames / fps)
    frame_count = 0
    frame_seq = 0
    while frame_count < frames:
        success, image = vidObj.read()
        if success and frame_count == int(frames * frame_seq):
            image = cv2.resize(image, (img_width, img_height))
            frames_list.append(image)
            frame_seq += 1 / seq_len
        elif not success:
            print("Defected frame")
            break
        frame_count += 1

    return frames_list, fps, seconds

In [None]:
# selecting first frames
def frames_extraction(video_path):
    frames_list = []
    vidObj = cv2.VideoCapture(video_path)
    frames = vidObj.get(cv2.CAP_PROP_FRAME_COUNT)
    fps = int(vidObj.get(cv2.CAP_PROP_FPS))
    seconds = int(frames / fps)
    frame_count = 1
    while frame_count <= seq_len:
        success, image = vidObj.read()
        if success:
            image = cv2.resize(image, (img_width, img_height))
            frames_list.append(image)
        elif not success:
            print("Defected frame")
            break
        frame_count += 1

    return frames_list, fps, seconds

In [None]:
# testing with 1 sample of video
vid = frames_extraction("/content/Actions/train/Handclapping/person01_handclapping_d3_uncomp.avi")
np.array(vid[0]).shape, vid[1], vid[2]

generating frames of all the videos and storing it in numpy array

In [None]:
def create_data(input_dir):
    X, Y, fps_list, duration = [], [], [], []
    classes_list = os.listdir(input_dir)
    for c in classes_list:
      print(c)
      files_list = os.listdir(os.path.join(input_dir, c))
      for f in files_list:
        frames, fps, sec_s = frames_extraction(os.path.join(os.path.join(input_dir, c), f))
        fps_list.append(fps)
        duration.append(sec_s)
        if len(frames) == seq_len:
          X.append(frames)
          Y.append(c)
    X = np.asarray(X)
    Y = np.asarray(Y)
    return X, Y,fps_list, duration

X, label, fps_list, duration = create_data(data_dir)

In [None]:
X.shape, label.shape

In [None]:
# fps is common for all the videos
set(fps_list)

In [None]:
# different duration for videos
print(set(duration))

In [None]:
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(label)
y = le.transform(label)
# one hot encoding Classes
Y = np_utils.to_categorical(y)
Y.shape

#### Load Test data

In [None]:
test_data_dir = "/content/Actions/test/"
Xtest, Ytest, _, _ = create_data(test_data_dir)

In [None]:
ytest = le.transform(Ytest)

Ytest = np_utils.to_categorical(ytest)
Ytest.shape

In [None]:
np.where(label==classes[3])

In [None]:
plt.imshow(X[0][0])

In [None]:
fig = plt.figure(figsize=(seq_len * img_width / 100, len(classes)*img_height/100))
columns = seq_len
rows = 6
counter = 1
for i in list(range(len(label)))[::83]:
  for j in range(seq_len):
    fig.add_subplot(rows, columns, counter)
    counter += 1
    plt.imshow(X[i][j])
    plt.title(label[i])
plt.show()

In [None]:
# to play the video and to observe the ACTION in frames
from moviepy.editor import *
path="/content/Actions/train/Walking/person11_walking_d4_uncomp.avi"

clip=VideoFileClip(path)
ipython_display(clip,width=280)

### Create the neural network

We can build the model in several ways. We can use a well-known model that we inject in time distributed layer, or we can build our own.

With custom ConvNet each input image of the sequence must pass to a convolutional network. The goal is to train that model for each frame and then decide the class to infer.

* Use ConvNet and Time distributed to detect features.
* Inject the Time distributed output to GRU or LSTM to treat time series.
* Apply a DenseNet to take the decision, to classify.

##### Build the ConvNet for the feature extraction

In [None]:
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D

def build_convnet(shape=(160, 120, 3)):
    model = keras.Sequential()
    model.add(Conv2D(64, (3,3), input_shape=shape, activation='relu'))
    model.add(Conv2D(64, (3,3), activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPool2D())

    model.add(Conv2D(128, (3,3), activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPool2D())

    model.add(Conv2D(512, (3,3), activation='relu'))
    model.add(BatchNormalization())

    # flatten...
    # You can also use Flatten but GlobalMaxPool2D will reduce the number of outputs (getting only maximum values from the last convolution)
    model.add(GlobalMaxPool2D())
    return model

#### Build the Time Distributed model and DenseNet

In [None]:
from keras.layers import TimeDistributed, GRU, Dense, Dropout

def action_model(shape=(40, 160, 120, 3), nbout=6):
    # Create our convnet with (160, 120, 3) input shape
    convnet = build_convnet(shape[1:]) # Removes the '40' dimension index

    # then create our final model
    model = keras.Sequential()

    # add the convnet with (5, 160, 120, 3) shape
    model.add(TimeDistributed(convnet, input_shape=shape))

    # here, you can also use GRU or LSTM
    model.add(GRU(64))

    # and finally, we make a decision network
    model.add(Dense(1024, activation='relu'))
    model.add(Dropout(.5))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(nbout, activation='softmax'))
    return model

#### Setup the parameters and train the model with epochs in batch wise

In [None]:
INSHAPE = (seq_len,) + (img_height,img_width) + (n_channels,) # (5, 160, 120, 3)
print(INSHAPE, len(classes))
model = action_model(INSHAPE, len(classes))
optimizer = keras.optimizers.Adam(0.00001)
model.compile(optimizer, 'categorical_crossentropy', metrics=['acc'])

In [None]:
model.fit(x = X, y = Y, validation_data=(Xtest, Ytest), epochs=25, batch_size = 10)

#### Model 2

Model 2 with Time Distributed layers of CNN

In [None]:
from keras.layers import TimeDistributed, Conv2D, Dense, MaxPooling2D, Flatten, LSTM, Dropout, BatchNormalization
from keras import models
model_cnlst = models.Sequential()
model_cnlst.add(TimeDistributed(Conv2D(128, (3, 3), strides=(1,1),activation='relu'),
                                input_shape=(seq_len, img_height, img_width, n_channels)))
model_cnlst.add(TimeDistributed(Conv2D(64, (3, 3), strides=(1,1),activation='relu')))
model_cnlst.add(TimeDistributed(MaxPooling2D(2,2)))
model_cnlst.add(TimeDistributed(Conv2D(64, (3, 3), strides=(1,1),activation='relu')))
model_cnlst.add(TimeDistributed(Conv2D(32, (3, 3), strides=(1,1),activation='relu')))
model_cnlst.add(TimeDistributed(MaxPooling2D(2,2)))
model_cnlst.add(TimeDistributed(BatchNormalization()))

model_cnlst.add(TimeDistributed(Flatten()))
model_cnlst.add(Dropout(0.2))

model_cnlst.add(LSTM(32,return_sequences=False,dropout=0.2)) # used 32 units

model_cnlst.add(Dense(64,activation='relu'))
model_cnlst.add(Dense(32,activation='relu'))
model_cnlst.add(Dropout(0.2))
model_cnlst.add(Dense(len(classes), activation='softmax'))
model_cnlst.summary()

In [None]:
from keras import optimizers

optimizer_new=optimizers.Adam(learning_rate=0.00001)
model_cnlst.compile(optimizer=optimizer_new,loss='categorical_crossentropy',metrics=['acc'])
# Training:
history_new_cnlst=model_cnlst.fit(x = X, y = Y, validation_data=(Xtest, Ytest),
                                  epochs=20, batch_size = 10)

In [None]:
!nvidia-smi

In [None]:
plt.plot(history_new_cnlst.history['loss'],'r',label='training loss')
plt.plot(history_new_cnlst.history['val_loss'],label='validation loss')
plt.xlabel('# epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

### Use pre-trained model for feature extraction (4 points)

To create a deep learning network for video classification:

* Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as VGG16, to extract features from each frame.

* Train an LSTM network on the sequences to predict the video labels.

* Assemble a network that classifies videos directly by combining layers from both networks.

Hint: [VGG-16 CNN and LSTM](https://riptutorial.com/keras/example/29812/vgg-16-cnn-and-lstm-for-video-classification)

In [None]:
frames = seq_len
rows = img_height
columns = img_width

In [None]:
# Functional API
video = Input(shape=(seq_len,
                     img_height,
                     img_width,
                     n_channels))
cnn_base = VGG16(input_shape=(img_height,
                              img_width,
                              n_channels),
                 weights="imagenet",
                 include_top=False)
cnn_out = GlobalAveragePooling2D()(cnn_base.output)
cnn = Model(cnn_base.input, cnn_out)
cnn.trainable = False
encoded_frames = TimeDistributed(cnn)(video)
encoded_sequence = LSTM(256)(encoded_frames)
hidden_layer = Dense(1024, activation="relu")(encoded_sequence)
outputs = Dense(len(classes), activation="softmax")(hidden_layer)
model = Model([video], outputs)
optim = Adam(learning_rate=0.0002)
model.compile(loss="categorical_crossentropy",
              optimizer=optim,
              metrics=["accuracy"])

In [None]:
vgg_history = model.fit(x= X, y=Y, validation_data=(Xtest, Ytest), batch_size=20, epochs=10)

**Adding Dropout to reduce ovefitting**

In [None]:
# Sequential API
cnn = keras.Sequential()
cnn.add(VGG16(input_shape=(img_height, img_width, n_channels), weights="imagenet", include_top=False))
cnn.add(GlobalAveragePooling2D())
cnn.trainable = False
model = keras.Sequential()
model.add(TimeDistributed(cnn))
model.add(LSTM(512))
model.add(Dense(4096, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(2048, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(512, activation="relu"))
model.add(Dense(len(classes), activation="softmax"))
optim = Adam(learning_rate=0.0002)
model.compile(loss="categorical_crossentropy", optimizer=optim, metrics=["accuracy"])

In [None]:
vgg_history = model.fit(x= X, y=Y, validation_data=(Xtest, Ytest), batch_size=20, epochs=10)

In [None]:
plt.plot(vgg_history.history['loss'],'r',label='training loss')
plt.plot(vgg_history.history['val_loss'],label='validation loss')
plt.xlabel('# epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

### Report Analysis

* Report the video frames sequences used to classify the sequences correctly
* Discuss the impact of the LSTM, GRU and TimeDistributed layers
* Discuss about the model convergence using pre-trained and convnet