# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Video based Action Classification using LSTM

## Learning Objectives

At the end of the experiment, you will be able to :

* extract frames out of a video
* build the CNN model to extract features from the video frames
* train LSTM/GRU model to perform action classification

## Information

**Background:** The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:



*   Activity Recognition: Generating a textual description of an activity demonstrated in a sequence of images
*   Image Description: Generating a textual description of a single image.
*   Video Description: Generating a textual description of a sequence of images.

**Applications:** Applications such as surveillance, video retrieval and
human-computer interaction require methods for recognizing human actions in various scenarios. In the area of robotics, the tasks of
autonomous navigation or social interaction could also take advantage of the knowledge extracted
from live video recordings. Typical scenarios
include scenes with cluttered, moving backgrounds, nonstationary camera, scale variations, individual variations in
appearance and cloth of people, changes in light and view
point and so forth. All of these conditions introduce challenging problems that can be addressed using deep learning (computer vision) models.

## Dataset



**Dataset:** This dataset consists of labelled videos of 6 human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/actions.gif)

All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average. In summary, there are 25x6x4=600 video files for each combination of 25 subjects, 6 actions and 4 scenarios. For this mini-project we have randomly selected 20% of the data as test set.

Dataset source: https://www.csc.kth.se/cvap/actions/

**Methodology:**

When performing image classification, we input an image to our CNN; Obtain the predictions from the CNN;
Choose the label with the largest corresponding probability


Since a video is just a series of image frames, in a video classification, we Loop over all frames in the video file;
For each frame, pass the frame through the CNN; Classify each frame individually and independently of each other; Choose the label with the largest corresponding probability;
Label the frame and write the output frame to disk

Refer this [Video Classification using Keras](https://medium.com/video-classification-using-keras-and-tensorflow/action-recognition-and-video-classification-using-keras-and-tensorflow-56badcbe5f77) for complete understanding and implementation example of video classification.

## Problem Statement

Train a CNN-LSTM based deep neural net to recognize the action being performed in a video

## Grading = 10 Points

### Install and re-start the runtime

In [18]:
# !pip3 install imageio==2.4.1


In [19]:
# @title Download Dataset
# !wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip
# !unzip -qq Actions.zip
# print("Dataset downloaded successfully!!")


from utility import download_and_unzip

download_and_unzip(
    filename="Actions.zip",
    url="https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip",
)


False

### Import required packages

In [20]:
import keras
from keras import applications
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import *
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.layers import Dense, Input

# from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers import GlobalAveragePooling2D
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from keras.layers import GRU, Dense, Dropout
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from tensorflow.keras.optimizers import Adam

import os, glob
import cv2 as cv
import numpy as np
import pandas as pd


### Load the data and generate frames of video (2 points)

Detecting an action is possible by analyzing a series of images (that we name “frames”) that are taken in time.

Hint: Refer data preparation section in [keras_video_classification](https://keras.io/examples/vision/video_classification/)


In [21]:
# data_dir = "/content/Actions/train/"
# test_data_dir = "/content/Actions/test/"
# YOUR CODE HERE
data_dir = "Actions_temp/train/"
test_data_dir = "Actions_temp/test/"


In [22]:
def play_video(video_path):
    cap = cv.VideoCapture(video_path)

    while cap.isOpened():
        ret, frame = cap.read()

        # if frame is read correctly ret is True
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)

        cv.imshow("frame", gray)
        if cv.waitKey(25) == ord("q"):
            break

    cap.release()
    cv.destroyAllWindows()


# play_video(data_dir + "boxing/person01_boxing_d1_uncomp.avi")


In [23]:
def load_video_into_array_(file_path):
    # Load the video from the file path
    cap = cv.VideoCapture(file_path)

    # Initialize an empty list to store the frames
    frames = []

    # Loop through the video frames
    while cap.isOpened():
        # Read the frame
        ret, frame = cap.read()

        # If the frame was not read correctly, break the loop
        if not ret:
            break

        # Convert the frame to grayscale
        frame = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)

        # Resize the frame to 224x224
        frame = cv.resize(frame, (224, 224))

        # Normalize the pixel values
        frame = frame / 255.0

        # Append the frame to the frames list
        frames.append(frame)

    # Release the VideoCapture object and close the window
    cap.release()
    cv.destroyAllWindows()

    # Convert the frames list to a numpy array
    frames = np.array(frames)

    return frames


In [24]:
def load_video_into_dataframe(root_path):
    # Initialize an empty list to store the file names and labels
    data = []

    # Loop through each subfolder in the folder
    for subfolder in os.listdir(root_path):
        subfolder_path = os.path.join(root_path, subfolder)
        if os.path.isdir(subfolder_path):
            # Loop through each file in the subfolder
            for file_name in os.listdir(subfolder_path):
                file_path = os.path.join(subfolder_path, file_name)
                if os.path.isfile(file_path):
                    # Append the file name and label (subfolder name) to the data list
                    data.append(
                        {
                            "folder": subfolder_path,
                            "file_name": file_name,
                            "label": subfolder,
                        }
                    )

    # Create a dataframe from the data list
    df = pd.DataFrame(data)

    return df


In [25]:
df_data = pd.DataFrame()
df_data = load_video_into_dataframe(data_dir)
df_data.head(5)


Unnamed: 0,folder,file_name,label
0,Actions_temp/train/running,person01_running_d3_uncomp.avi,running
1,Actions_temp/train/running,person01_running_d1_uncomp.avi,running
2,Actions_temp/train/running,person01_running_d2_uncomp.avi,running
3,Actions_temp/train/handwaving,person01_handwaving_d2_uncomp.avi,handwaving
4,Actions_temp/train/handwaving,person01_handwaving_d1_uncomp.avi,handwaving


In [26]:
df_data.groupby("label").count()


Unnamed: 0_level_0,folder,file_name
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Handclapping,3,3
Walking,3,3
boxing,3,3
handwaving,3,3
jogging,3,3
running,3,3


In [27]:
df_test_data = pd.DataFrame()
df_test_data = load_video_into_dataframe(test_data_dir)
df_test_data.head(5)


Unnamed: 0,folder,file_name,label
0,Actions_temp/test/running,person02_running_d2_uncomp.avi,running
1,Actions_temp/test/running,person04_running_d4_uncomp.avi,running
2,Actions_temp/test/running,person02_running_d1_uncomp.avi,running
3,Actions_temp/test/handwaving,person02_handwaving_d1_uncomp.avi,handwaving
4,Actions_temp/test/handwaving,person02_handwaving_d2_uncomp.avi,handwaving


In [28]:
df_test_data.groupby("label").count()


Unnamed: 0_level_0,folder,file_name
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Handclapping,3,3
Walking,3,3
boxing,3,3
handwaving,3,3
jogging,3,3
running,3,3


In [29]:
def create_preprocess_folder(name_preprocessed_folder, org_root_dir, datatype_dir):
    datatype_dir_path = os.path.join(name_preprocessed_folder, datatype_dir)
    if not os.path.exists(datatype_dir_path):
        os.makedirs(datatype_dir_path)

    # Loop through each subfolder in the data directory
    for subfolder in os.listdir(org_root_dir):
        class_dir_path = os.path.join(datatype_dir_path, subfolder)
        if not os.path.exists(class_dir_path):
            os.makedirs(class_dir_path)


In [30]:
create_preprocess_folder("preprocessed_videos", data_dir, "train")
create_preprocess_folder("preprocessed_videos", test_data_dir, "test")


In [31]:
def extract_frames_from_avi(
    video_path, output_folder, frame_prefix="frame_", file_format=".jpg"
):
    try:
        # Check if video file exists
        if not os.path.isfile(video_path):
            print(f"Error: Video file '{video_path}' does not exist.")
            return -1

        # Create output folder if it doesn't exist
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
            print(f"Created output directory: {output_folder}")

        # Initialize video capture
        cap = cv.VideoCapture(video_path)

        # Check if video opened successfully
        if not cap.isOpened():
            print(f"Error: Could not open video file '{video_path}'.")
            return -1

        # Get video properties
        fps = cap.get(cv.CAP_PROP_FPS)
        frame_count = int(cap.get(cv.CAP_PROP_FRAME_COUNT))
        duration = frame_count / fps if fps > 0 else 0
        frame_height = int(cap.get(cv.CAP_PROP_FRAME_HEIGHT))
        frame_width = int(cap.get(cv.CAP_PROP_FRAME_WIDTH))
        ret, frame = cap.read()
        channels = frame.shape[2] if ret else None

        print("Video properties:")
        print(f"- FPS: {fps}")
        print(f"- Total frames: {frame_count}")
        print(f"- Duration: {duration:.2f} seconds")
        print(f"- Height: {frame_height} ")
        print(f"- Width: {frame_width} ")
        print(f"- Channels: {channels} ")

        # Reset to beginning of video
        cap.set(cv.CAP_PROP_POS_FRAMES, 0)

        # Extract frames
        frame_number = 0
        frames_saved = 0

        while True:
            # Read next frame
            success, frame = cap.read()

            # Break the loop if we've reached the end of the video
            if not success:
                break

            # Construct output filename with leading zeros for proper sorting
            frame_filename = f"{frame_prefix}{frame_number:06d}{file_format}"
            output_path = os.path.join(output_folder, frame_filename)

            # Save the frame
            cv.imwrite(output_path, frame)
            frames_saved += 1

            # Increment frame counter
            frame_number += 1

    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Release the video capture object
        cap.release()

    print(f"Extraction complete. Saved {frames_saved} frames to {output_folder}")
    return frames_saved


In [32]:
df_data.head(5)


Unnamed: 0,folder,file_name,label
0,Actions_temp/train/running,person01_running_d3_uncomp.avi,running
1,Actions_temp/train/running,person01_running_d1_uncomp.avi,running
2,Actions_temp/train/running,person01_running_d2_uncomp.avi,running
3,Actions_temp/train/handwaving,person01_handwaving_d2_uncomp.avi,handwaving
4,Actions_temp/train/handwaving,person01_handwaving_d1_uncomp.avi,handwaving


In [33]:
def extract_frames_from_videos(df, preprocessed_folder):
    # Loop through each row in the dataframe
    for index, row in df.iterrows():
        # Get the file path
        file_path = os.path.join(row["folder"], row["file_name"])

        # Get the label
        label = row["label"]

        # Create a folder for the label if it doesn't exist
        label_folder = os.path.join(
            os.path.join(preprocessed_folder, label), row["file_name"].split(".")[0]
        )
        if not os.path.exists(label_folder):
            os.makedirs(label_folder)

        # Extract frames from the video

        number_of_frames_saved = extract_frames_from_avi(
            video_path=file_path, output_folder=label_folder
        )
        print(f"Extracted {number_of_frames_saved} frames in '{label_folder}'")


In [34]:
extract_frames_from_videos(df_data, "preprocessed_videos/train")


Video properties:
- FPS: 25.0
- Total frames: 350
- Duration: 14.00 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 350 frames to preprocessed_videos/train/running/person01_running_d3_uncomp
Extracted 350 frames in 'preprocessed_videos/train/running/person01_running_d3_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 335
- Duration: 13.40 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 335 frames to preprocessed_videos/train/running/person01_running_d1_uncomp
Extracted 335 frames in 'preprocessed_videos/train/running/person01_running_d1_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 365
- Duration: 14.60 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 365 frames to preprocessed_videos/train/running/person01_running_d2_uncomp
Extracted 365 frames in 'preprocessed_videos/train/running/person01_running_d2_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 656
- Duration: 26.24 secon

In [35]:
extract_frames_from_videos(df_test_data, "preprocessed_videos/test")


Video properties:
- FPS: 25.0
- Total frames: 1492
- Duration: 59.68 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 1492 frames to preprocessed_videos/test/running/person02_running_d2_uncomp
Extracted 1492 frames in 'preprocessed_videos/test/running/person02_running_d2_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 230
- Duration: 9.20 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 230 frames to preprocessed_videos/test/running/person04_running_d4_uncomp
Extracted 230 frames in 'preprocessed_videos/test/running/person04_running_d4_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 314
- Duration: 12.56 seconds
- Height: 120 
- Width: 160 
- Channels: 3 
Extraction complete. Saved 314 frames to preprocessed_videos/test/running/person02_running_d1_uncomp
Extracted 314 frames in 'preprocessed_videos/test/running/person02_running_d1_uncomp'
Video properties:
- FPS: 25.0
- Total frames: 550
- Duration: 22.00 seconds
-

In [None]:


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]



In [None]:

def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)


In [None]:
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()


In [None]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())


In [None]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[..., None]))

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(
            shape=(
                1,
                MAX_SEQ_LENGTH,
            ),
            dtype="bool",
        )
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :], verbose=0,
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels



In [None]:


train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")


In [None]:
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model





In [None]:

# Utility for running experiments.
def run_experiment():
    filepath = "/tmp/video_classifier/ckpt.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model


In [None]:

_, sequence_model = run_experiment()


#### Visualize the frames and analyze the object in each frame. (1 point)

* Plot the frames of each class per row (6 rows)
* Plot the title as label on each subplot

In [None]:
# YOUR CODE HERE


### Create the Neural Network (4 points)

We can build the model in several ways. We can use a well-known model that we inject in time distributed layer, or we can build our own.

With custom ConvNet each input image of the sequence must pass to a convolutional network. The goal is to train that model for each frame and then decide the class to infer.

* Use ConvNet and Time distributed to detect features.
* Inject the Time distributed output to GRU or LSTM to treat as a time series.
* Apply a DenseNet to take the decision and classify.

##### Build the ConvNet for the feature extraction, GRU LSTM layers as a time series and Dense layers for classification

In [None]:
# YOUR CODE HERE


#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# Note: There will be a high memory requirement for the training steps below.
# You should work on a GPU/TPU based runtime. See 'Change Runtime' in Colab
# Training time for each epoch could be ~30 mins
# To save and re-load your model later, see the reference below:
# https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb

# YOUR CODE HERE


### Use pre-trained model for feature extraction (3 points)

To create a deep learning network for video classification:

* Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as VGG16, to extract features from each frame.

* Train an LSTM network on the sequences to predict the video labels.

* Assemble a network that classifies videos directly by combining layers from both networks.

Hint: [VGG-16 CNN and LSTM](https://riptutorial.com/keras/example/29812/vgg-16-cnn-and-lstm-for-video-classification)

#### Load and fine-tune the pre-trained model

In [None]:
# YOUR CODE HERE


#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# YOUR CODE HERE


### Report Analysis

* Discuss on FPS, Number of frames and duration of each video
* Analyze the impact of the LSTM, GRU and TimeDistributed layers
* Discuss about the model convergence using pre-trained and ConvNet
* *Additional Reading*: Read and discuss about the use of Conv3D in video classification