<a href="https://colab.research.google.com/github/rajiv-ranjan/cds-mini-projects/blob/Archana/M3_NB_MiniProject_3_Video_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Video based Action Classification using LSTM

## Learning Objectives

At the end of the experiment, you will be able to :

* extract frames out of a video
* build the CNN model to extract features from the video frames
* train LSTM/GRU model to perform action classification

## Information

**Background:** The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:



*   Activity Recognition: Generating a textual description of an activity demonstrated in a sequence of images
*   Image Description: Generating a textual description of a single image.
*   Video Description: Generating a textual description of a sequence of images.

**Applications:** Applications such as surveillance, video retrieval and
human-computer interaction require methods for recognizing human actions in various scenarios. In the area of robotics, the tasks of
autonomous navigation or social interaction could also take advantage of the knowledge extracted
from live video recordings. Typical scenarios
include scenes with cluttered, moving backgrounds, nonstationary camera, scale variations, individual variations in
appearance and cloth of people, changes in light and view
point and so forth. All of these conditions introduce challenging problems that can be addressed using deep learning (computer vision) models.

## Dataset



**Dataset:** This dataset consists of labelled videos of 6 human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/actions.gif)

All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average. In summary, there are 25x6x4=600 video files for each combination of 25 subjects, 6 actions and 4 scenarios. For this mini-project we have randomly selected 20% of the data as test set.

Dataset source: https://www.csc.kth.se/cvap/actions/

**Methodology:**

When performing image classification, we input an image to our CNN; Obtain the predictions from the CNN;
Choose the label with the largest corresponding probability


Since a video is just a series of image frames, in a video classification, we Loop over all frames in the video file;
For each frame, pass the frame through the CNN; Classify each frame individually and independently of each other; Choose the label with the largest corresponding probability;
Label the frame and write the output frame to disk

Refer this [Video Classification using Keras](https://medium.com/video-classification-using-keras-and-tensorflow/action-recognition-and-video-classification-using-keras-and-tensorflow-56badcbe5f77) for complete understanding and implementation example of video classification.

## Problem Statement

Train a CNN-LSTM based deep neural net to recognize the action being performed in a video

## Grading = 10 Points

### Install and re-start the runtime

In [1]:
#!pip3 install imageio==2.4.1


In [4]:
# @title Download Dataset
# !wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip
# !unzip -qq Actions.zip
# print("Dataset downloaded successfully!!")


from utility import download_and_unzip

download_and_unzip(
    filename="Actions.zip",
    url="https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip",
)


False

### Import required packages

In [7]:
import keras
from keras import applications
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import *
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.layers import Dense, Input

# from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers import GlobalAveragePooling2D
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from keras.layers import GRU, Dense, Dropout
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from tensorflow.keras.optimizers import Adam

import os, glob
import cv2
import numpy as np
# import pandas as pd


### Load the data and generate frames of video (2 points)

Detecting an action is possible by analyzing a series of images (that we name “frames”) that are taken in time.

Hint: Refer data preparation section in [keras_video_classification](https://keras.io/examples/vision/video_classification/)


In [20]:
data_dir = "Content/Actions/train/"
test_data_dir = "Content/Actions/test/"
# YOUR CODE HERE
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048


In [21]:
def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


In [22]:
def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)
            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)


In [23]:
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()


In [24]:
labels = []
for class_name in os.listdir(data_dir):
    labels.append(class_name)

label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique([labels])
)
print(label_processor.get_vocabulary())


[np.str_('Handclapping'), np.str_('Walking'), np.str_('boxing'), np.str_('handwaving'), np.str_('jogging'), np.str_('running')]


In [25]:
def get_video_path(data_dir):
    video_path = []
    for class_name in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_name)
        for video_name in os.listdir(class_path):
            video_path.append(os.path.join(class_path, video_name))

    return video_path


In [26]:
def prepare_all_videos(data_dir, root_dir):
    video_paths = get_video_path(data_dir)
    num_samples = len(video_paths)
    labels_str = label_processor.get_vocabulary()  # List of labels
    labels_str = np.array(labels_str)  # Convert to NumPy array
    labels = keras.ops.convert_to_numpy(label_processor(labels_str[..., None]))

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )
    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(
            shape=(
                1,
                MAX_SEQ_LENGTH,
            ),
            dtype="bool",
        )
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :],
                    verbose=0,
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(data_dir, "train")
test_data, test_labels = prepare_all_videos(test_data_dir, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

print(f"Frame features in test set: {test_data[0].shape}")
print(f"Frame masks in test set: {test_data[1].shape}")


OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person11_running_d1_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person21_running_d4_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person07_running_d1_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person08_running_d2_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person15_running_d3_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person05_running_d4_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person13_running_d4_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person03_running_d3_uncomp.avi"
OpenCV: Couldn't read video stream from file "train/Content/Actions/train/running/person

Frame features in train set: (479, 20, 2048)
Frame masks in train set: (479, 20)
Frame features in test set: (120, 20, 2048)
Frame masks in test set: (120, 20)


OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person16_running_d2_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person19_running_d1_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person13_running_d1_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person02_running_d2_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person05_running_d1_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person07_running_d4_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person07_running_d3_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person13_running_d3_uncomp.avi"
OpenCV: Couldn't read video stream from file "test/Content/Actions/test/running/person15_running_d4_unco

#### Visualize the frames and analyze the object in each frame. (1 point)

* Plot the frames of each class per row (6 rows)
* Plot the title as label on each subplot

In [None]:
# YOUR CODE HERE


### Create the Neural Network (4 points)

We can build the model in several ways. We can use a well-known model that we inject in time distributed layer, or we can build our own.

With custom ConvNet each input image of the sequence must pass to a convolutional network. The goal is to train that model for each frame and then decide the class to infer.

* Use ConvNet and Time distributed to detect features.
* Inject the Time distributed output to GRU or LSTM to treat as a time series.
* Apply a DenseNet to take the decision and classify.

##### Build the ConvNet for the feature extraction, GRU LSTM layers as a time series and Dense layers for classification

In [None]:
# YOUR CODE HERE


#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# Note: There will be a high memory requirement for the training steps below.
# You should work on a GPU/TPU based runtime. See 'Change Runtime' in Colab
# Training time for each epoch could be ~30 mins
# To save and re-load your model later, see the reference below:
# https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb

# YOUR CODE HERE


### Use pre-trained model for feature extraction (3 points)

To create a deep learning network for video classification:

* Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as VGG16, to extract features from each frame.

* Train an LSTM network on the sequences to predict the video labels.

* Assemble a network that classifies videos directly by combining layers from both networks.

Hint: [VGG-16 CNN and LSTM](https://riptutorial.com/keras/example/29812/vgg-16-cnn-and-lstm-for-video-classification)

#### Load and fine-tune the pre-trained model

In [None]:
# YOUR CODE HERE


#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# YOUR CODE HERE


### Report Analysis

* Discuss on FPS, Number of frames and duration of each video
* Analyze the impact of the LSTM, GRU and TimeDistributed layers
* Discuss about the model convergence using pre-trained and ConvNet
* *Additional Reading*: Read and discuss about the use of Conv3D in video classification