<a href="https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Prep_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data collection

In order to keep the runtime of this example relatively short, we will be using a subsampled version of the original UCF101 dataset. You can refer to [this notebook](https://github.com/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) to know how the subsampling was done. 

In [1]:
!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

## Setup

In [2]:
from tensorflow import keras

import tensorflow as tf
import pandas as pd 
import numpy as np
import cv2
import os

## Define hyperparameters

In [3]:
IMG_SIZE = 128
MAX_SEQ_LENGTH = 20
NUM_FEATURES = 1024

## Data preparation

In [4]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

Total videos for training: 594
Total videos for testing: 224


In [5]:
train_df.sample(10)

Unnamed: 0,video_name,tag
132,v_PlayingCello_g10_c02.avi,PlayingCello
94,v_CricketShot_g22_c03.avi,CricketShot
164,v_PlayingCello_g15_c02.avi,PlayingCello
502,v_TennisSwing_g11_c06.avi,TennisSwing
163,v_PlayingCello_g15_c01.avi,PlayingCello
500,v_TennisSwing_g11_c04.avi,TennisSwing
365,v_ShavingBeard_g08_c07.avi,ShavingBeard
222,v_PlayingCello_g23_c06.avi,PlayingCello
68,v_CricketShot_g17_c06.avi,CricketShot
289,v_Punch_g15_c04.avi,Punch


In [6]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub

def crop_center_square(frame):
        y, x = frame.shape[0:2]
        min_dim = min(y, x)
        start_x = (x // 2) - (min_dim // 2)
        start_y = (y // 2) - (min_dim // 2)
        return frame[start_y : start_y+min_dim, start_x : start_x+min_dim]

def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames) 

In [7]:
def build_feature_extractor():
    feature_extractor = keras.applications.DenseNet121(weights="imagenet", 
                                                       include_top=False, pooling="avg",
                                                       input_shape=(IMG_SIZE, IMG_SIZE, 3))
    preprocess_input = keras.applications.densenet.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")

feature_extractor = build_feature_extractor()

In [8]:
label_processor = keras.layers.experimental.preprocessing.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"]),  mask_token=None
)
print(label_processor.get_vocabulary())

['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']


In [9]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()
    
    # `frame_features` are what we will feed to our sequence model.
    frame_features = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES),
                                dtype="float32")
    
    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        
        # Pad shorter videos.
        if len(frames) < MAX_SEQ_LENGTH:
            diff = MAX_SEQ_LENGTH - len(frames)
            padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
            frames = np.concatenate(frames, padding)

        frames = frames[None, ...]
        
        # Initialize placeholder to store the features of the current video. 
        temp_frame_featutes = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES),
                                dtype="float32")
        
        # Extract features from the frames of the current video. 
        for i, batch in enumerate(frames):  
            video_length = batch.shape[1]
            length = min(MAX_SEQ_LENGTH, video_length)  
            for j in range(length):
                if np.mean(batch[j, :]) > 0.0:
                    temp_frame_featutes[i, j, :] = feature_extractor.predict(batch[None, j, :])  
                else:
                    temp_frame_featutes[i, j, :] = 0.0

        frame_features[idx, ] = temp_frame_featutes.squeeze()

    return frame_features, labels

In [10]:
train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"Frame features in train set: {train_data.shape}")

Frame features in train set: (594, 20, 1024)


The above code block will take ~20 minutes to execute depending on the machine it's being executed. 

## Serialize data for later use

In [11]:
np.save("train_data.npy", train_data, fix_imports=True, allow_pickle=False)
np.save("train_labels.npy", train_labels, fix_imports=True, allow_pickle=False)
np.save("test_data.npy", test_data, fix_imports=True, allow_pickle=False)
np.save("test_labels.npy", test_labels, fix_imports=True, allow_pickle=False)

In [12]:
!tar cf top5_data_prepared.tar.gz train_data.npy train_labels.npy test_data.npy test_labels.npy

In [14]:
!cp top5_data_prepared.tar.gz /content/drive/MyDrive