1-Import and read the video, extract frames from it, and save them as images

2-Label a few images for training the model (Don’t worry, I have done it for you)

3-Build our model on training data

4-Make predictions for the remaining images

5-Calculate the screen time of both TOM and JERRY

### Let us start with importing all the necessary libraries

In [None]:
import cv2     # for capturing videos
import math   # for mathematical operations
import matplotlib.pyplot as plt    # for plotting the images
%matplotlib inline
import pandas as pd
from keras.preprocessing import image   # for preprocessing the images
import numpy as np    # for mathematical operations
from keras.utils import np_utils
from skimage.transform import resize   # for resizing images

### Step – 1: Read the video, extract frames from it and save them as images

Now we will load the video and convert it into frames. We will first capture the video from the given directory using the VideoCapture() function, and then we’ll extract frames from the video and save them as an image using the imwrite() function. Let’s code it

In [None]:
count = 0
videoFile = "../input/video-classification-tutorial/Tom and jerry.mp4"
cap = cv2.VideoCapture(videoFile)   # capturing the video from the given path
frameRate = cap.get(5) #frame rate
x=1
while(cap.isOpened()):
    frameId = cap.get(1) #current frame number
    ret, frame = cap.read()
    if (ret != True):
        break
    if (frameId % math.floor(frameRate) == 0):
        filename ="frame%d.jpg" % count;count+=1
        cv2.imwrite(filename, frame)
cap.release()
print ("Done!")

Once this process is complete, ‘Done!’ will be printed on the screen as confirmation that the frames have been created.

Let us try to visualize an image (frame). We will first read the image using the imread() function of matplotlib, and then plot it using the imshow() function.

In [None]:
img = plt.imread('./frame0.jpg')   # reading image using its name
plt.imshow(img)

Getting excited, yet?

This is the first frame from the video. We have extracted one frame for each second, from the entire duration of the video. Since the duration of the video is 4:58 minutes (298 seconds), we now have 298 images in total.

Our task is to identify which image has TOM, and which image has JERRY. If our extracted images would have been similar to the ones present in the popular Imagenet dataset, this challenge could have been a breeze. How? We could simply have used models pre-trained on that Imagenet data and achieved a high accuracy score! But then where’s the fun in that?

We have cartoon images so it’ll be very difficult (if not impossible) for any pre-trained model to identify TOM and JERRY in a given video.

### Step – 2: Label a few images for training the model

So how do we go about handling this? A possible solution is to manually give labels to a few of the images and train the model on them. Once the model has learned the patterns, we can use it to make predictions on a previously unseen set of images.

Keep in mind that there could be frames when neither TOM nor JERRY are present. So, we will treat it as a multi-class classification problem. The classes which I have defined are:


0 – neither JERRY nor TOM

1 – for JERRY

2 – for TOM

Don’t worry, Go ahead and download the mapping.csv file which contains each image name and their corresponding class (0 or 1 or 2).

In [None]:
data = pd.read_csv('../input/video-classification-tutorial/mapping.csv')     # reading the csv file
data.head()      # printing first five rows of the file

The mapping file contains two columns:

Image_ID: Contains the name of each image

Class: Contains corresponding class for each image

Our next step is to read the images which we will do based on their names, aka, the Image_ID column.

In [None]:
X = [ ]     # creating an empty array
for img_name in data.Image_ID:
    img = plt.imread('' + img_name)
    X.append(img)  # storing each image in array X
X = np.array(X)    # converting list to array

In [None]:
X

Tada! We now have the images with us. Remember, we need two things to train our model:


Training images, and

Their corresponding class

Since there are three classes, we will one hot encode them using the to_categorical() function of keras.utils.

In [None]:
y = data.Class
dummy_y = np_utils.to_categorical(y)    # one hot encoding Classes

We will be using a **VGG16** pretrained model which takes an input image of shape (224 X 224 X 3). Since our images are in a different size, we need to reshape all of them. We will use the resize() function of skimage.transform to do this.

In [None]:
image = []
for i in range(0,X.shape[0]):
    a = resize(X[i], preserve_range=True, output_shape=(224,224)).astype(int)      # reshaping to 224*224*3
    image.append(a)
X = np.array(image)

All the images have been reshaped to 224 X 224 X 3. But before passing any input to the model, we must preprocess it as per the model’s requirement. Otherwise, the model will not perform well enough. Use the preprocess_input() function of **keras.applications.vgg16** to perform this step.

In [None]:
from keras.applications.vgg16 import preprocess_input
X = preprocess_input(X)      # preprocessing the input data

We also need a validation set to check the performance of the model on unseen images. We will make use of the train_test_split() function of the sklearn.model_selection module to randomly divide images into training and validation set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, dummy_y, test_size=0.3, random_state=42)  

### Step 3: Building the model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.layers import Dense, InputLayer, Dropout

We will now load the VGG16 pretrained model and store it as base_model:

In [None]:
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))    # include_top=False to remove the top layer

We will make predictions using this model for X_train and X_valid, get the features, and then use those features to retrain the model.

In [None]:
X_train = base_model.predict(X_train)
X_valid = base_model.predict(X_valid)
X_train.shape, X_valid.shape

The shape of X_train and X_valid is (208, 7, 7, 512), (90, 7, 7, 512) respectively. In order to pass it to our neural network, we have to reshape it to 1-D.

In [None]:
X_train = X_train.reshape(208, 7*7*512)      # converting to 1-D
X_valid = X_valid.reshape(90, 7*7*512)

We will now preprocess the images and make them zero-centered which helps the model to converge faster.

In [None]:
train = X_train/X_train.max()      # centering(normalized) the data
X_valid = X_valid/X_train.max()

Finally, we will build our model. This step can be divided into 3 sub-steps:

1-Building the model

2-Compiling the model

3-Training the model

In [None]:
# i. Building the model
model = Sequential()
model.add(InputLayer((7*7*512,)))    # input layer
model.add(Dense(units=1024, activation='sigmoid')) # hidden layer
model.add(Dense(3, activation='softmax'))    # output layer

In [None]:
model.summary()

We have a hidden layer with 1,024 neurons and an output layer with 3 neurons (since we have 3 classes to predict). Now we will compile our model:

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In the final step, we will fit the model and simultaneously also check its performance on the unseen images, i.e., validation images:

In [None]:
model.fit(train, y_train, epochs=100, validation_data=(X_valid, y_valid))

We can see it is performing really well on the training as well as the validation images. We got an accuracy of around 92% on unseen images. And this is how we train a model on video data to get predictions for each frame.