## Video Analysis Using Deep Learning Algorithm

Identify the character of TV series and calculates its screentime.

But In this project our aim is to transfer the image classification problem into a video analysis problem. We were trying to find a way to build a model that automatically identified specific people in a given video at a particular time interval.
And it turned out that we can do it. 

So a small application of this project can be used as calculating a screen time of a person in the video.

And this project is focusd on same application and we built a small POC (proof of concept ) to check how it works. 

### The complete steps in bird eye view:
1. Import and read the video, extract frames from it, and save them as images

2. Label a few images for training the model 

3. Build our model on training data

4. Make predictions for the remaining images

5. Calculate the screen time 

In [21]:
import cv2

In [14]:
!pip install pytube

Collecting pytube
  Downloading pytube-12.0.0-py3-none-any.whl (56 kB)
[?25l[K     |█████▉                          | 10 kB 19.4 MB/s eta 0:00:01[K     |███████████▋                    | 20 kB 21.3 MB/s eta 0:00:01[K     |█████████████████▍              | 30 kB 11.4 MB/s eta 0:00:01[K     |███████████████████████▏        | 40 kB 4.1 MB/s eta 0:00:01[K     |█████████████████████████████   | 51 kB 4.6 MB/s eta 0:00:01[K     |████████████████████████████████| 56 kB 2.2 MB/s 
[?25hInstalling collected packages: pytube
Successfully installed pytube-12.0.0


## Step – 1: Read the video, extract frames from it and save them as images


We will first capture the video from the given directory using the `VideoCapture()` function, and then we’ll extract frames from the video and save them as an image using the `imwrite()` function.

In [3]:
# Opens the Video file
cap= cv2.VideoCapture('The Big Bang Theory - Science is dead.3gpp')
i=0

image_folder = 'img'
while True:
    ret, frame = cap.read()
    
    if ret == False:
        break
    cv2.imwrite(image_folder+'/'+str(i)+'.jpg',frame)
    i+=1

cap.release()
cv2.destroyAllWindows()

The video is now converted into individual frames. In this problem, there is only one class, either “Sheldon” or “No Sheldon”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “Sheldon” and “no_sheldon”. Then manually added images to these two sub-folders. After creating dataset we are ready to dive into the code and concepts.

### Input Data and Preprocessing

We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:

- Read all images one by one using openCV
- Resize each image to (224, 224, 3) for the input to the model
- Divide the data by 255 to make input features to neural network in the same range
- Append to corresponding class

In [3]:
from tqdm import tqdm
import cv2
import os
import numpy as np

img_path = '/content/drive/MyDrive/GG_DL_Project/data'

class1_data = []
class2_data = []
for classes in os.listdir(img_path):
        fin_path = os.path.join(img_path, classes)
        for fin_classes in tqdm(os.listdir(fin_path)):
            img = cv2.imread(os.path.join(fin_path, fin_classes))
            #print(img.shape)
            if img is None:
              continue
            else:
              img = cv2.resize(img, (224,224))
              img = img/255.
              if classes == 'sheldon':
                  class1_data.append(img)
              else:
                  class2_data.append(img)

class1_data = np.array(class1_data)
class2_data = np.array(class2_data)

100%|██████████| 2330/2330 [00:15<00:00, 148.32it/s]
100%|██████████| 1145/1145 [00:24<00:00, 46.07it/s] 


Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.

In [4]:
import keras
from keras.applications.vgg16 import VGG16

vgg_model = VGG16(include_top=False, weights='imagenet')

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Sheldon” or “No Sheldon”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features

In [5]:
vgg_class1 = vgg_model.predict(class1_data)

In [6]:
vgg_class2 = vgg_model.predict(class2_data)

Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Sheldon” or “No Sheldon”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:

In [7]:
from keras.layers import Input, Dense, Dropout
from keras.models import Model

inputs = Input(shape=(7*7*512,))

dense1 = Dense(1024, activation = 'relu')(inputs)
drop1 = Dropout(0.5)(dense1)
dense2 = Dense(512, activation = 'relu')(drop1)
drop2 = Dropout(0.5)(dense2)
outputs = Dense(1, activation = 'sigmoid')(drop2)

model = Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 25088)]           0         
                                                                 
 dense (Dense)               (None, 1024)              25691136  
                                                                 
 dropout (Dropout)           (None, 1024)              0         
                                                                 
 dense_1 (Dense)             (None, 512)               524800    
                                                                 
 dropout_1 (Dropout)         (None, 512)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 513       
                                                                 
Total params: 26,216,449
Trainable params: 26,216,449
Non-tra

Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network

In [8]:
train_data = np.concatenate((vgg_class1[:3000], vgg_class2[:2000]), axis = 0)
train_data = train_data.reshape(train_data.shape[0],7*7*512)

valid_data = np.concatenate((vgg_class1[3000:], vgg_class2[2000:]), axis = 0)
valid_data = valid_data.reshape(valid_data.shape[0],7*7*512)

In [9]:
train_label = np.array([0]*vgg_class1[:3000].shape[0] + [1]*vgg_class2[:2000].shape[0])
valid_label = np.array([0]*vgg_class1[3000:].shape[0] + [1]*vgg_class2[2000:].shape[0])

## Training the Network

we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function.

In [11]:
import tensorflow as tf
from keras.callbacks import ModelCheckpoint

#tf.logging.set_verbosity(tf.logging.ERROR)
model.compile(optimizer = 'sgd', loss = 'binary_crossentropy', metrics = ['accuracy'])

filepath="best_model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

I am using batch size of 64 and 10 epochs to train.

In [12]:
model.fit(train_data, train_label, epochs = 10, batch_size = 64, validation_data = (valid_data, valid_label), verbose = 2, callbacks = callbacks_list)

Epoch 1/10
50/50 - 10s - loss: 0.6143 - accuracy: 0.7099 - val_loss: 0.0150 - val_accuracy: 1.0000 - 10s/epoch - 195ms/step
Epoch 2/10
50/50 - 9s - loss: 0.3986 - accuracy: 0.8362 - val_loss: 0.3255 - val_accuracy: 0.9453 - 9s/epoch - 173ms/step
Epoch 3/10
50/50 - 9s - loss: 0.2646 - accuracy: 0.8985 - val_loss: 0.0505 - val_accuracy: 0.9848 - 9s/epoch - 171ms/step
Epoch 4/10
50/50 - 9s - loss: 0.2415 - accuracy: 0.8998 - val_loss: 0.1035 - val_accuracy: 0.9635 - 9s/epoch - 171ms/step
Epoch 5/10
50/50 - 9s - loss: 0.1465 - accuracy: 0.9478 - val_loss: 2.8720 - val_accuracy: 0.0000e+00 - 9s/epoch - 172ms/step
Epoch 6/10
50/50 - 9s - loss: 0.2211 - accuracy: 0.9243 - val_loss: 0.0174 - val_accuracy: 1.0000 - 9s/epoch - 173ms/step
Epoch 7/10
50/50 - 9s - loss: 0.1587 - accuracy: 0.9475 - val_loss: 0.0301 - val_accuracy: 0.9787 - 9s/epoch - 174ms/step
Epoch 8/10
50/50 - 9s - loss: 0.0643 - accuracy: 0.9857 - val_loss: 0.0102 - val_accuracy: 1.0000 - 9s/epoch - 176ms/step
Epoch 9/10
50/50 -

<keras.callbacks.History at 0x7efc6e0256d0>

# Test video : collecting new video

In [15]:
# Test video


from pytube import YouTube as yt

video_link = 'https://www.youtube.com/watch?v=g_j869cfKDs'
vid = yt(video_link)

stream = vid.streams.first()
stream.download()

'/content/Sheldon On Teaching Women And then uses Google.3gpp'

## Extracting frames for test data

In [16]:
### Test video extraction 


# Opens the Video file
cap= cv2.VideoCapture('Sheldon On Teaching Women And then uses Google.3gpp')
i=0

image_folder = '/content/drive/MyDrive/GG_DL_Project/test_data'
while True:
    ret, frame = cap.read()
    
    if ret == False:
        break
    cv2.imwrite(image_folder+'/'+str(i)+'.jpg',frame)
    i+=1

cap.release()
cv2.destroyAllWindows()

## Calculating screen time :

To test our trained model and calculate the screen time, I have downloaded another video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Sheldon” or “No Sheldon”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Sheldon” in it and then divide it by 24 to count the number of seconds “Sheldon” was on screen.

In [19]:
import os
import numpy as np

sheldon_images = []
no_sheldon_images = []

test_path = '/content/drive/MyDrive/GG_DL_Project/test_data'

for test in tqdm(os.listdir(test_path)):
    test_img = cv2.imread(os.path.join(test_path, test))
    if img is None:
      continue
    else:
      test_img = cv2.resize(test_img, (224,224))
      test_img = test_img/255.
      test_img = np.expand_dims(test_img, 0)
      pred_img = vgg_model.predict(test_img)
      pred_feat = pred_img.reshape(1, 7*7*512)
      out_class = model.predict(pred_feat)
      if out_class < 0.5:
          sheldon_images.append(out_class)
      else:
          no_sheldon_images.append(out_class)

100%|██████████| 2397/2397 [25:38<00:00,  1.56it/s]


In [20]:
print(len(sheldon_images))

544


This test video clip is made up of 24 frames per second and number of images predicted for having “Sheldon” in it are 544. So the screen time for Sheldon will be 544/24 = 22 seconds.