# Object Detection on Video inputs Using YOLOv3

**Having examined the performance of SSD300 on images, we now extend our analysis of object detection to video sequences. The motivation for this work can lead to the following use cases:**

- **Tracking suspicious activities in a crowded field. For example, the model can be retrained to capture instances of backpacks such as the one used by the Boston Bomber**
- **Self driving cars**
- **Object detection for video searching: Rather than searching for videos based on file names, AI based searches can search for videos based on objects in the video**
- 

## Introduction to YOLOv3

**Yolo stands for "You Only Look Once" v3. It is an improvement over SSD (1). The diagram below shows that YOLOV3-spp results in a mean Average Precision (mAP) OF 60.6 compared to the mean Average Precision (map) of 41.2 for SSD300**

<img src='imgs/yolov3-perfomance.png' alt='Yolo Comparison' style="height: 400px;width: 400px;"/>

## Yolo v3 Architecture

<img src='imgs/yolov3-artichecture.png' alt='yolov3-artichecture'/>

### Bounding Box Prediction

**YoloV3 predicts tx, ty, tw, th as indicated in the diagram below**

<img src='imgs/bound_box.png' alt='Yolo Comparison' style="height: 400px;width: 400px;"/>

###  Class Prediction
**Each bounding box in Yolov3 predicts a class that an object belongs to using multilabel classification using independent logistic classifier and not softmax**

### Feature Extractor

**Yolov3 uses the Darknet-53 architecture shown below to extract features of the inputs. This architecture consists of 53 layer convolutional network. The darknet-53 uses pretrained weights obtained from Imagenet dataset**

<img src='imgs/darknet-53.png' alt='Yolo Comparison'/>

### Non maximum suppression

We use non maximum suppression to ensure that a class in only identified one. Suppose we have an image of 100X100 and a grid of 9X9 grid. If this truck below fallls in  multiple cells of grid, non maximum suppression ensures we identify the optimal cell from from the all the cells that the truck in detected in.

nmsThreshold

Non maximum suppression:
1. Discards cells where probability of class being present is <= nmsThreshold
2. Use the  cell with greatedprobability among candidates for object as a prediction
3. Last discard any remaining cell with Intersection over union value >= confThreshold 

<img src='imgs/non-max-suppression.png' alt='non-max-suppression.png'/>

## Yolov3 Implementation

**We will examine the performance of YoloV3 against a video input. OpenCV4 will be the library that we will use to read and write videos as needed**

**For this project, we will reuse pretrained weights obtained by training the model against Cocodata sets [2]. We went with this approach because we found out training the model from scratch for only a few epochs took several days even with AWS p3.2xlarge resources**

In [1]:
import pandas as pd
import numpy as np
import cv2 #import openCV

In [2]:
import time
from tqdm import tqdm_notebook, tqdm

**OpenCV4 provides a convinient method for loading model and weights** [3]

In [3]:
def load_model(model_config, model_weights):
    model = cv2.dnn.readNetFromDarknet(model_config, model_weights) #load the model and weights
    model.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV) # Set Open CY as backend
    model.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU) # Set target CPU
    return model

In [4]:
#Function to return the names of the output layers of the model
def getModelOutputLayers(model):    
    layerNames= model.getLayerNames() ## Get the list names of all the layers in the network
    # getUnconnectedOutLayers() returns the numpy array of index of the layer that is not connected. 
    #A layer that is not connected is equilavent to the last year
    #pick the index return by getUnconnectedOutLayers() and pass them to layerNames to get the laste
    namesModelOutputLayers = [layerNames[i[0] - 1] for i in model.getUnconnectedOutLayers()]
    return namesModelOutputLayers

The first step to understanding YOLO is how it encodes its output. The input image is divided into an S x S grid of cells. For each object that is present on the image, one grid cell is said to be “responsible” for predicting it. That is the cell where the center of the object falls into.

Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has 5 components: (x, y, w, h, confidence). The (x, y) coordinates represent the center of the box, relative to the grid cell location (remember that, if the center of the box does not fall inside the grid cell, than this cell is not responsible for it). These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size. Let’s look at an example:

In [5]:
def drawBoundaryBox(classId, conf, left, top, right, bottom, frame):
    # Draw a bounding box.
    cv2.rectangle(frame, (left, top), (right, bottom), (255, 178, 50), 3)
    
    label = '%.2f' % conf        
    # Get the label for the class name and its confidence
    if classes:
        assert(classId < len(classes))
        label = '%s:%s' % (classes[classId], label)

    #Display the label at the top of the bounding box
    labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
    top = max(top, labelSize[1])
    cv2.rectangle(frame, (left, top - round(1.5*labelSize[1])), (left + round(1.5*labelSize[0]), top + baseLine), (255, 255, 255), cv2.FILLED)
    cv2.putText(frame, label, (left, top), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0,0,0), 1)

Formally we define confidence as Pr(Object) * IOU(pred, truth) . If no object exists in that cell, the confidence score should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth

In [6]:
def get_boundary(frame, detection):
    dimension = {}
    fHeight = frame.shape[0]
    fWidth = frame.shape[1]
    
    Cx = int(detection[0] * fWidth)
    Cy = int(detection[1] * fHeight)
    
    width = int(detection[2] * fWidth)
    height = int(detection[3] * fHeight)
    left = int(Cx - width / 2)
    top = int(Cy - height / 2)
    boundary = [left, top, width, height ]
    return  boundary

In [7]:
def postprocess(frame, finalLayerOutputs, score_threshold, nms_threshold):
    information = []
    frameHeight = frame.shape[0]
    frameWidth = frame.shape[1]

    # For allthe bounding boxes output in the model, save boxes where we have high confidence scores
    # the class with the highest score i assigned as the box classs
    classIds = []
    scores = []
    bboxes = []
    for finalLayerOutput in finalLayerOutputs:
        for detection in finalLayerOutput:
            detectionScores = detection[5:] #get the 
            classId = np.argmax(detectionScores) #Get the index of the highest score
            confidence = detectionScores[classId]  #Detemine the confidence level of that score
            if confidence >score_threshold:
                boundary  = get_boundary(frame, detection)               
                classIds.append(classId)
                scores.append(float(confidence))
                bboxes.append(boundary)

    # Perform non maximum suppression to eliminate redundant overlapping boxes with lower confidences.    
    indices = cv2.dnn.NMSBoxes(bboxes, scores, score_threshold, nms_threshold)
    for i in indices:
        i = i[0]
        box = bboxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        drawBoundaryBox(classIds[i], scores[i], left, top, left + width, top + height, frame)
        information.append(classIds[i])
    return information       


In [8]:
## Preprocess input frame:
#This function scales, crops,resizes an image input
def preprocesFrame(frame,  scalefactor, size, mean, crop=False):
    processedFrame  = cv2.dnn.blobFromImage(
                                image = frame ,  #input frame
                                scalefactor= scalefactor, #normalize
                                size = size, #resize 
                                mean = mean, #mean subtraction value from (R, G, B),        
                                crop= crop)
    return processedFrame 

**Using OpenCV to read and write Videos**

In [9]:
def detect_object(model, imgWidth,imgHeight, fpath_originalVideo,fpath_detectionVideo, score_threshold, nms_threshold):
    
    '''
    imgWidth = Image Width of the input frame. Since we are using Darknet-53, this should be 416  
    imgLength = Image Length of the input frame. Since we are using Darknet-53 this should be 416  
    fpath_originalVideo = /path/to/original_video
    fpath_detectionVideo = /path/to/detection_video
    
    '''
    
    log_scene = [] #Store scene information in this variable
    
    videoCapture = cv2.VideoCapture(fpath_originalVideo) #capture the originalVideo and  process it frame-by-frame 
    
    #Define video write parameters
    # saves the video in Motion JPEG video formart. See: https://www.fourcc.org/mjpg/
    #fourcc = cv2.VideoWriter_fourcc('M','J','P','G')
    fourcc = cv2.VideoWriter_fourcc(*'MJPG')
    fps = 30 #set frames per second rate
    imgW=  round(videoCapture.get(cv2.CAP_PROP_FRAME_WIDTH))  #get current image Width 
    imgH = round(videoCapture .get(cv2.CAP_PROP_FRAME_HEIGHT)) #get current image Height 
       
    
                 
    videoWriter= cv2.VideoWriter(filename = fpath_detectionVideo,# name of file to be written
                                            fourcc = fourcc, # vide0 format
                                            fps = fps, #write speeed in frames per sec
                                            frameSize = (imgW, imgH)
                                            )   
 
    
    '''
    vid_writer = cv2.VideoWriter(fpath_detectionVideo, cv2.VideoWriter_fourcc('M','J','P','G'), 30, 
                             (round(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),round(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))))
    '''
    while cv2.waitKey(1) < 0:        
     
        # read current video frame
        hasFrame, frame = videoCapture.read()

        # If the videoCapture does not capture any frames, we have reached the end of the video
        # stop the detection process 
        if not hasFrame:           
            print('Object Detection Completed for: ', fpath_detectionVideo)
            cv2.waitKey(3000)
            break

        # preprocess frames
        processedFrame = preprocesFrame(frame = frame,  
                                        scalefactor =  1/255, #normalize
                                        size  = (imgWidth, imgHeight), # we are not resizing
                                        mean = [0,0,0] #We are not making a change
                                       )


        # Set the input 
        model.setInput(processedFrame)

        # Get the names of output layers 
        namesModelOutputLayers = getModelOutputLayers(model)
        
        #get the output of the final layers
        finalLayerOutputs = model.forward(namesModelOutputLayers)

        # get and draw boundaries, as well as scene information
       
        information = postprocess(frame, finalLayerOutputs, score_threshold, nms_threshold)

        # Put efficiency information. The function getPerfProfile returns the 
        # overall time for inference(t) and the timings for each of the layers(in layersTimes)
        retval, _ = model.getPerfProfile()
        # getTickFrequency() Returns the number of ticks per second. 
        label = 'Inference time: %.2f ms' % (retval * 1000.0 / cv2.getTickFrequency())
        
        #Add label to the frame
        cv2.putText(frame, label, (0, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))

        # Save frame with the detection boxes    
        videoWriter.write(frame.astype(np.uint8)) 
        
        #Add scene information
        log_scene.append(information)
    return log_scene
        

In [10]:
def describe_video(video_output):
    bucket = []
    for index, frame in enumerate(video_output): 
        row = []
        for item in range(0, len(frame)):
            row.append(classes[video_output[index][item]])
        bucket.append(row) 
    scene = pd.DataFrame(bucket)
    scene = scene.dropna(how='all')
    scene = scene.set_index(1).stack()
    scene = scene.groupby(level=[0,1]).nunique(dropna=False).unstack(fill_value=0)
    scene = scene.T    
    return scene   

# Testing

In [11]:
import os
TEST_DIR = 'test_videos'
TEST_DIR_DETECTION = 'test_videos_detection'

**Load Classes**

In [12]:
classes = pd.read_csv('class_names.csv')
classes = classes['class'].tolist()

In [13]:
classes[:10] #Display a few classes

['person',
 'bicycle',
 'car',
 'motorbike',
 'aeroplane',
 'bus',
 'train',
 'truck',
 'boat',
 'traffic light']

**Darknet-53 Takes input image of 416 x 416**

In [14]:
imgWidth = 416       #Darknet-53 image Width 
imgHeight = 416      #Darknet-53 image Height
score_threshold = 0.5  #Confidence threshold to determine whether to accept of reject
nms_threshold = 0.4   #Non-maximum suppression threshold. Ensures that the object is only detected once

**We will use original weights of the YoloV3 training to conduct the testing**

In [15]:
# Load model configuaration and model weigts. This are pretrained weights and associated configuration
model_config = 'cfg/yolov3-spp.cfg';
model_weights= 'weights/yolov3-spp.weights';

In [16]:
from IPython.display import HTML

**Video 1: Before Object Detection**

In [17]:
%%HTML
<video width='640' height='300' controls>
  <source src='demo/thailand.mp4' type='video/mp4'>
</video>

**Video 1: After Object Detection**

In [18]:
%%HTML
<video width='640' height='300' controls>
  <source src='demo/thailand_detection.mp4' type='video/mp4'>
</video>

## USE CASE : AI BASED VIDEO ANNOTATION AND RETRIEVAL

**We have shown above that for a given video, the model can detect classes within the video. What if we saved the boundary box information, the class detected and time of detection etc into a database? Can we then search the database for videos that meet the specific criteria?**
<br><br>
 **Example**
<p>
<table>
    <thead>
        <tr>          
          <th>File Name</th>          
          <th>File Location</th>
          <th>Classes Detected</th>          
    </tr>
     </thead>
    <tbody>
        <tr>
            <td>video_01.mp4</td>
            <td>data/videos</td>
            <td>[person, dog, horse]</td>
         <tr>
           <td>video_01.mp4</td>
           <td>data/videos</td>
           <td>[bagpack, bottle]</td>
         <tr>
         <tr>
           <td>video_03.mp4</td>
           <td>data/videos</td>
           <td>[car, truck, trafficlight]</td>
         <tr>
    </tbody>
</table>

**In the table above, we only show three variables, but the table can include metadata information from the video e.g. location video was shot, time/date of video was shotl, ength of video, Focal length, etc. Further, the task can be expanded to address activity tracking rather than object detection and those parameters can be added to the database.  With this information, we can answer the question, for example: "Show me the dunks made by Michael Jordan in NBA season of 1995"**

**For our simple case study here, we are going to look into a file directory of saved videos and return any video that contains the searched material**
- **Step 1 : For all the videos in the directory, if the video has not been added to the annotation database, we detect the objects in the video, and update the annotation database. In practice, this annotation job could be done in a big data environment using Apache Spark or equilavent environments**
- **Step 2: We search the annotation library and return videos that meet the search criteria** 

In [19]:
def annotate_videos(annnotation_df):
    for originalVideo in tqdm(os.listdir(TEST_DIR)):                        
        videosAlreadyAnnotated = annnotation_df['fname'].unique().tolist()
        
        if originalVideo in videosAlreadyAnnotated:
            print('Video Already Annotated ', originalVideo)
        else:
            fpath_originalVideo = os.path.join(TEST_DIR, originalVideo)
            detectionVideo = originalVideo[:-4]+'_detection.avi'
            fpath_detectionVideo = os.path.join(TEST_DIR_DETECTION, detectionVideo) 
            
            model = load_model(model_config, model_weights)
            data = detect_object(model, imgWidth,imgHeight, fpath_originalVideo,
                         fpath_detectionVideo, score_threshold, nms_threshold)
            
            scene = describe_video(video_output = data)
            
            row = pd.Series({'fname':originalVideo, 'floc': fpath_originalVideo,'classes':' '.join(list(scene.columns))})
            annnotation_df = annnotation_df.append(row, ignore_index=True)            
    annnotation_df.to_csv(ANNOTATION_PATH, index=False)
    return annnotation_df

In [20]:
ANNOTATON_DIR = 'annotation_db'
ANNOTATION_FNAME = 'annotation.csv'
ANNOTATION_PATH = os.path.join(ANNOTATON_DIR, ANNOTATION_FNAME)

In [21]:
import pathlib

def load_annotations(path = ANNOTATION_PATH):    
    file = pathlib.Path(ANNOTATION_PATH)
    if file.exists ():
        annnotation_df = pd.read_csv(ANNOTATION_PATH)
    else:
        annnotation_df = pd.DataFrame(data= {'fname':[] , 'floc': [], 'classes': []})      
    return  annnotation_df   

In [22]:
annotation_df = load_annotations()

**Dataframe before annotation**

In [23]:
annotation_df

Unnamed: 0,fname,floc,classes


In [24]:
annnotation_df =  annotate_videos(annnotation_df = annotation_df)

 20%|██        | 1/5 [07:36<30:24, 456.22s/it]

Object Detection Completed for:  test_videos_detection/video_01_detection.avi


 40%|████      | 2/5 [13:47<21:32, 430.69s/it]

Object Detection Completed for:  test_videos_detection/video_04_detection.avi


 60%|██████    | 3/5 [17:37<12:20, 370.47s/it]

Object Detection Completed for:  test_videos_detection/video_05_detection.avi


 80%|████████  | 4/5 [28:53<07:42, 462.15s/it]

Object Detection Completed for:  test_videos_detection/video_02_detection.avi


100%|██████████| 5/5 [36:27<00:00, 459.72s/it]

Object Detection Completed for:  test_videos_detection/video_03_detection.avi





**Dataframe after annotation**

In [25]:
annnotation_df

Unnamed: 0,fname,floc,classes
0,video_01.mp4,test_videos/video_01.mp4,car motorbike person
1,video_04.mp4,test_videos/video_04.mp4,elephant person
2,video_05.mp4,test_videos/video_05.mp4,car truck
3,video_02.mp4,test_videos/video_02.mp4,backpack bench bottle car chair cup diningtabl...
4,video_03.mp4,test_videos/video_03.mp4,car cow horse person truck


In [26]:
def searchVideos(searchList, annnotation_df = annnotation_df):
    def filterVideos(searchList, classes):
        success = False
        classesList = classes.split(' ')
        if [x for x in classesList  if x in searchList]:
            success = True
        return success
    
    results = annnotation_df[annnotation_df['classes'].apply(lambda x: filterVideos(searchList, x))]    
    if results.empty:
        print('No Results found for:', ' or '.join(searchList))
    else:
        return results

**With the annotation database update complete, we can search the database(dataframe) to get videos that match our search criteria**

In [31]:
#Show videos containing 'motorbike'
searchVideos(['motorbike'])

Unnamed: 0,fname,floc,classes
0,video_01.mp4,test_videos/video_01.mp4,car motorbike person


**Display Video to Confirm  results**

In [33]:
%%HTML
<video width='640' height='300' controls>
  <source src='test_videos/video_01.mp4' type='video/mp4'>
</video>

In [32]:
#Show videos containing 'elephant'
searchVideos(['elephant'])

Unnamed: 0,fname,floc,classes
1,video_04.mp4,test_videos/video_04.mp4,elephant person


In [34]:
%%HTML
<video width='640' height='300' controls>
  <source src='test_videos/video_04.mp4' type='video/mp4'>
</video>

# Conclusion Yolo V3

**We were able to test Yolov3 on random personal and internet videos from the internet. We did not have annotated video samples that contacted ground truth. So we were unable to test mAP for video.** 

**FUTURE WORK**

- **Train the Yolov3 network with new classes. We were not to because of hardware limitation. The process of training the network took at average of 5 days**
- **Study the perfomance of YOLOv3 on real time live data**
- **Video activity tracking**
- **Video captioning using RNN and NLP**

## References:
1. YOLOv3: An Incremental Improvement: https://arxiv.org/abs/1804.02767
2. http://cocodataset.org/#home
3. OpenCV: https://docs.opencv.org/4.1.0/
4. Understanding YOLO: https://hackernoon.com/understanding-yolo-f5a74bbc7967