#__Video Classification Using Hybrid Model__
Let's see how to classify the video using transfer learning and a recurrent model on the UCF101 dataset.


## Steps to Be Followed:
1. Downloading data and importing the required libraries
2. Reading the data from datasets and printing the ten rows
3. Defining the functions for cropping and loading video frames
4. Building a feature extraction model using InceptionV3 architecture
5. Creating a string lookup table for labels and printing the vocabulary of the label processor
6. Preparing video data for training and testing by extracting frame features
7. Defining and training a sequence model using GRU layers
8. Loading a test video, extracting frame features, and making predictions using the sequence model

### Step 1: Downloading Data and Importing the Required Libraries
- Download the dataset
- Import the required libraries

In [1]:
!wget -q --no-check-certificate https://www.crcv.ucf.edu/data/UCF101/UCF101.rar
!wget -q --no-check-certificate https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip

In [2]:
%%capture
!unrar e UCF101.rar data/
!unzip -qq UCF101TrainTestSplits-RecognitionTask.zip

In [22]:
!pip install opencv-python



In [3]:
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os
import shutil

Open the __.txt__ file which has the names of the training videos

Create a dataframe having video names


In [4]:

f = open("ucfTrainTestlist/trainlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')

train = pd.DataFrame()
train['video_name'] = videos
train = train[:-1]
train.head()

Unnamed: 0,video_name
0,ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi 1
1,ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c02.avi 1
2,ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c03.avi 1
3,ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c04.avi 1
4,ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c05.avi 1


Open the __.txt__ file which has the names of the test videos

Create a DataFrame having video names

In [5]:

with open("ucfTrainTestlist/testlist01.txt", "r") as f:
    temp = f.read()
videos = temp.split("\n")

test = pd.DataFrame()
test["video_name"] = videos
test = test[:-1]
test.head()

Unnamed: 0,video_name
0,ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi
1,ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c02.avi
2,ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c03.avi
3,ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c04.avi
4,ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c05.avi


- Define the __extract_tag__ function that extracts a tag from the video path. This is done by splitting the video path by or and returning the first part.
- Define the __separate_video_name function, which separates the video name from the video path. This is achieved by splitting the video name by / and returning the second part.
- Define the __rectify_video_name__ function to rectify the video name by splitting the video name by " " and returning the first part.
- Define the __move_videos__ function:
   - Check if the output directory exists. If not, create the directory using __os.mkdir__.
   - Iterate over the DataFrame, __df__, using a progress bar from the __tqdm__ library.
   - For each row in the DataFrame, extract the video file name from the __video_name__ column, create its path, and then copy the video file to the output directory using __shutil.copy2__.
   - After the loop ends, print the total number of videos in the output directory.


In [6]:
def extract_tag(video_path):
    return video_path.split("/")[0]

def separate_video_name(video_name):
    return video_name.split("/")[1]

def rectify_video_name(video_name):
    return video_name.split(" ")[0]

def move_videos(df, output_dir):
    if not os.path.exists(output_dir):
        os.mkdir(output_dir)
    for i in tqdm(range(df.shape[0])):
        videoFile = df['video_name'][i].split("/")[-1]
        videoPath = os.path.join("data", videoFile)
        shutil.copy2(videoPath, output_dir)
    print()
    print(f"Total videos: {len(os.listdir(output_dir))}")

### Step 2: Reading the Data from Datasets and Printing the Ten Rows

- Define the values of **IMG_SIZE**, **BATCH_SIZE**, **EPOCHS**, **MAX_SEQ_LENGTH**, and **NUM_FEATURES**


In [7]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 2

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

### Step 3: Defining the Functions for Cropping and Loading Video Frames
- Define a function named **crop_center_square** that crops a frame to a square shape by determining the minimum dimension and calculating the starting coordinates
- Define a function named __load_video__ that loads a video file, crops each frame to a square shape, resizes it, and converts the color channels.
- Open the video file using **cv2.VideoCapture** and initialize an empty list called frames
- Read frames from the video, crop them to a square shape, resize them, convert the color channels, and append them to the frames list
- If the maximum number of frames is reached or the video ends, exit the loop.
- Release the video capture.
- Convert the frame list to a NumPy array
- Return the array of frames

In [8]:
def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

__Observation:__
- The code defines two functions, **crop_center_square** and __load_video__, which can be used to crop frames from videos and load videos as arrays of frames, respectively.

### Step 4: Building a Feature Extraction Model Using InceptionV3 Architecture

- Create a feature extractor using the InceptionV3 model from keras.applications with specific configurations.
- Assign the preprocess_input function from **keras.applications.inception_v3** to the v**ariable preprocess_input**.
- Create an input layer with the shape __(IMG_SIZE, IMG_SIZE, 3)__ using **keras.Input**.
- Preprocess the input using the **preprocess_input function**.
- Pass the preprocessed input through the feature extractor to obtain the outputs.
- Create a model with the inputs and outputs using **keras.Model** and assign it to the variable **feature_extractor**.

In [9]:
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


__Observations:__
- The code defines a function **build_feature_extractor** that creates a feature extractor model using InceptionV3 architecture.
- The model takes inputs of size __(IMG_SIZE, IMG_SIZE, 3)__, preprocesses the inputs, and produces the outputs.
- The created feature extractor model is assigned to the variable **feature_extractor**.

In [10]:
train["tag"] = train["video_name"].apply(extract_tag)
train["video_name"] = train["video_name"].apply(separate_video_name)
train.head()

Unnamed: 0,video_name,tag
0,v_ApplyEyeMakeup_g08_c01.avi 1,ApplyEyeMakeup
1,v_ApplyEyeMakeup_g08_c02.avi 1,ApplyEyeMakeup
2,v_ApplyEyeMakeup_g08_c03.avi 1,ApplyEyeMakeup
3,v_ApplyEyeMakeup_g08_c04.avi 1,ApplyEyeMakeup
4,v_ApplyEyeMakeup_g08_c05.avi 1,ApplyEyeMakeup


In [11]:
train["video_name"] = train["video_name"].apply(rectify_video_name)
train.head()

Unnamed: 0,video_name,tag
0,v_ApplyEyeMakeup_g08_c01.avi,ApplyEyeMakeup
1,v_ApplyEyeMakeup_g08_c02.avi,ApplyEyeMakeup
2,v_ApplyEyeMakeup_g08_c03.avi,ApplyEyeMakeup
3,v_ApplyEyeMakeup_g08_c04.avi,ApplyEyeMakeup
4,v_ApplyEyeMakeup_g08_c05.avi,ApplyEyeMakeup


In [12]:
test["tag"] = test["video_name"].apply(extract_tag)
test["video_name"] = test["video_name"].apply(separate_video_name)
test.head()

Unnamed: 0,video_name,tag
0,v_ApplyEyeMakeup_g01_c01.avi,ApplyEyeMakeup
1,v_ApplyEyeMakeup_g01_c02.avi,ApplyEyeMakeup
2,v_ApplyEyeMakeup_g01_c03.avi,ApplyEyeMakeup
3,v_ApplyEyeMakeup_g01_c04.avi,ApplyEyeMakeup
4,v_ApplyEyeMakeup_g01_c05.avi,ApplyEyeMakeup


In [14]:
train["tag"].value_counts().nlargest(n).reset_index()

Unnamed: 0,tag,count
0,Punch,121
1,PlayingCello,120
2,ShavingBeard,118
3,CricketShot,118
4,PlayingGuitar,117
5,TennisSwing,117
6,Drumming,116
7,HorseRiding,115
8,PlayingDhol,115
9,BoxingPunchingBag,114


In [15]:
n = 10
topNActs = train["tag"].value_counts().nlargest(n).reset_index()["tag"].tolist()
train_new = train[train["tag"].isin(topNActs)]
test_new = test[test["tag"].isin(topNActs)]
train_new.shape, test_new.shape

((1171, 2), (459, 2))

**Observation:**
- The output **((1171, 2), (459, 2))** is a tuple showing the shapes of **train_new** and **test_new**. The **train_new** DataFrame has 1171 rows and 2 columns, and the **test_new** DataFrame has 459 rows and 2 columns.

In [16]:
train_new = train_new.reset_index(drop=True)
test_new = test_new.reset_index(drop=True)

In [17]:
train_new

Unnamed: 0,video_name,tag
0,v_BoxingPunchingBag_g08_c01.avi,BoxingPunchingBag
1,v_BoxingPunchingBag_g08_c02.avi,BoxingPunchingBag
2,v_BoxingPunchingBag_g08_c03.avi,BoxingPunchingBag
3,v_BoxingPunchingBag_g08_c04.avi,BoxingPunchingBag
4,v_BoxingPunchingBag_g08_c05.avi,BoxingPunchingBag
...,...,...
1166,v_TennisSwing_g25_c02.avi,TennisSwing
1167,v_TennisSwing_g25_c03.avi,TennisSwing
1168,v_TennisSwing_g25_c04.avi,TennisSwing
1169,v_TennisSwing_g25_c05.avi,TennisSwing


### Step 5: Creating a String Lookup Table for Labels and Prints the Vocabulary of the Label Processor
- Create a label processor using **keras.layers.StringLookup**
- Set the number of out-of-vocabulary (OOV) indices to **0**
- Set the vocabulary of the label processor to the unique values from the **tag** column of the **train_df** DataFrame
- Retrieve the vocabulary of the label processor using **label_processor.get_vocabulary()**

In [18]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train["tag"])
)
print(label_processor.get_vocabulary())

['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress', 'Biking', 'Billiards', 'BlowDryHair', 'BlowingCandles', 'BodyWeightSquats', 'Bowling', 'BoxingPunchingBag', 'BoxingSpeedBag', 'BreastStroke', 'BrushingTeeth', 'CleanAndJerk', 'CliffDiving', 'CricketBowling', 'CricketShot', 'CuttingInKitchen', 'Diving', 'Drumming', 'Fencing', 'FieldHockeyPenalty', 'FloorGymnastics', 'FrisbeeCatch', 'FrontCrawl', 'GolfSwing', 'Haircut', 'HammerThrow', 'Hammering', 'HandstandPushups', 'HandstandWalking', 'HeadMassage', 'HighJump', 'HorseRace', 'HorseRiding', 'HulaHoop', 'IceDancing', 'JavelinThrow', 'JugglingBalls', 'JumpRope', 'JumpingJack', 'Kayaking', 'Knitting', 'LongJump', 'Lunges', 'MilitaryParade', 'Mixing', 'MoppingFloor', 'Nunchucks', 'ParallelBars', 'PizzaTossing', 'PlayingCello', 'PlayingDaf', 'PlayingDhol', 'PlayingFlute', 'PlayingGuitar', 'PlayingPiano', 'PlayingSitar', 'PlayingTabla', 'P

__Observations:__
- The code creates a label processor that maps labels from text to integer indices.
- It uses the unique values from the **tag** column of the **train_df** DataFrame as the vocabulary for the label processor.
- The output is the vocabulary of the label processor, which is a list of unique labels.

### Step 6: Preparing Video Data for Training and Testing by Extracting Frame Features
- Define a function named **prepare_all_videos** that inputs a DataFrame (df) and a root directory **(root_dir)**.
- Retrieve the video paths and labels from the DataFrame and encode the labels using **label_processor**.
- Initialize arrays to store frame masks and frame features for each video.
- Iterate over each video in the dataset, load the frames, and extract features using the **feature_extractor** model.
- Update the arrays with the extracted features and masks for each video.
- Call the **prepare_all_videos** function on the train and test DataFrames, storing the returned values in **train_data**, **train_labels**, **test_data**, and __test_labels__.
- Finally, print the shape of the frame features in the train set and the shape of the frame masks in the train set.

In [19]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    for idx, path in enumerate(video_paths):
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train, "train")
test_data, test_labels = prepare_all_videos(test, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

Frame features in train set: (9537, 20, 2048)
Frame masks in train set: (9537, 20)


__Observations:__
- The code processes videos by extracting frame features and creating frame masks.
- It then returns the frame features, frame masks, and labels for the train and test sets.
- The output is the shape of the frame features in the train set and the shape of the frame masks in the train set.

### Step 7: Defining and Training a Sequence Model Using GRU Layers
- Define a function named **get_sequence_model** that creates a sequence model for video classification
- Create input layers for frame features and masks
- Apply two GRU layers to the frame features input, with the second GRU layer returning only the last output
- Add a dropout layer, a dense layer with ReLU activation, and a final dense layer with softmax activation for the output
- Compile the model with sparse categorical cross-entropy loss, Adam optimizer, and accuracy metric
- Define a function named __run_experiment__ for running the training and evaluation
- Set up a checkpoint to save the best model during training
- Create the sequence model using **get_sequence_model**
- Train the model on the training data with a validation split, specified number of epochs, and the checkpoint callback
- Load the best weights saved during training
- Evaluate the model on the test data and print the test accuracy
- Return the history object and the trained sequence model
- Call the **run_experiment** function and store the returned values in __(history)__ and **sequence_model**


In [20]:
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model


def run_experiment():
    filepath = "/tmp/video_classifier"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model


_, sequence_model = run_experiment()

Epoch 1/2
Epoch 1: val_loss improved from inf to 4.78489, saving model to /tmp/video_classifier
Epoch 2/2
Epoch 2: val_loss did not improve from 4.78489
Test accuracy: 1.16%


__Observations:__
- Training progress and validation metrics will be displayed during the model training process.
- After training, the model will be evaluated on the test data, and the test accuracy will be printed.

### Step 8: Loading a Test Video, Extracting Frame Features, and Making Predictions Using the Sequence Model
- Load a random test video path.
- Call the **sequence_prediction** function with the test video path.
- Within the **sequence_prediction** function:

  a. Get the vocabulary of the classes.

  b. Load the frames of the video.

  c. Prepare the frames for sequence prediction by extracting frame features and creating a frame mask.

  d. Use the trained sequence model to predict the probabilities of each class for the video.

  e. Print the predicted class probabilities in descending order.

- Assign the frames of the test video to the variable test_frames.

In [21]:
def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames



test_video = np.random.choice(test["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)


Test video path: v_SoccerJuggling_g05_c06.avi
  PlayingSitar:  1.08%
  Drumming:  1.08%
  PlayingDhol:  1.08%
  CricketShot:  1.08%
  HorseRiding:  1.08%
  Hammering:  1.08%
  BoxingPunchingBag:  1.08%
  PoleVault:  1.08%
  IceDancing:  1.08%
  HeadMassage:  1.08%
  PlayingCello:  1.08%
  Billiards:  1.08%
  Bowling:  1.08%
  Kayaking:  1.08%
  JumpRope:  1.08%
  PlayingDaf:  1.08%
  BenchPress:  1.08%
  PlayingGuitar:  1.07%
  HammerThrow:  1.07%
  CliffDiving:  1.07%
  Diving:  1.07%
  Archery:  1.07%
  BandMarching:  1.07%
  HandstandPushups:  1.07%
  GolfSwing:  1.07%
  ApplyEyeMakeup:  1.07%
  PlayingFlute:  1.07%
  BabyCrawling:  1.07%
  CricketBowling:  1.07%
  FrontCrawl:  1.06%
  BaseballPitch:  1.06%
  Nunchucks:  1.06%
  Biking:  1.06%
  Haircut:  1.06%
  BlowDryHair:  1.06%
  Basketball:  1.06%
  FieldHockeyPenalty:  1.06%
  BrushingTeeth:  1.06%
  LongJump:  1.06%
  FloorGymnastics:  1.06%
  BasketballDunk:  1.06%
  BoxingSpeedBag:  1.06%
  HorseRace:  1.06%
  Knitting:  1

__Observations:__
- The test video path will be printed.
- The predicted class probabilities for the test video will be printed, showing the class label and the corresponding probability.
- The frames of the test video will be assigned to the **test_frames** variable.