# Object Detection using Tensorflow 

### Authors : Renu Hadke, Menita Koonani

## Objective 
Our objective is to identify and classify objects spotted on images and real-time video using TensorFlow and to determine the accuracy of each identification. We are considering two Models namely, SSD with MobileNet, SSD Inception V2 model and Faster RCNN Inception model to compare the accuracy and size of the models.
The principal difference between the models is that Faster RCNN Inception V2 is optimized for accuracy, while the MobileNets are optimized to be small and efficient, at the cost of some accuracy. The SSD with MobileNets detects objects in only a single shot with just two components in its architecture namely, Feature Extraction and Detection Generator, while the Faster R-CNN consists of three components- Feature Extraction, Proposal Generation and Box Classifier.

In the Faster R-CNN Inception model, a region proposal network is used to generate regions of interest and then either fully-connected layers or position-sensitive convolutional layers to classify those regions. SSD does the two in a “single shot,” simultaneously predicting the bounding box and the class as it processes the image.


## Understanding the terms
Let us first understand few terms before we jump to the process of object detection and comparing the models.

### Convolutional Neural Networks (CNN):

It is a class of deep, feed-forward artificial neural networks that has been applied to analyzing visual imagery. 
These are a special class of Multilayer perceptron which are well suited for pattern classification. 
It is specifically designed to recognize 2D shapes with a high level of invariance, skewing and scaling. 
They are made up of neurons that have learnable weights and biases. Each neuron receives some input, performs a dot product and optionally follows it with a non-linearity. 
The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function.
There are three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer
![title](images/cnn.png)

### SSD with MobileNet:
Out of the many detection models, we chose to work with the combination of Single Shot Detectors(SSDs) and MobileNets architecture as they are fast, efficient and do not require huge computational capability to fulfill the Object Detection task. The SSD approach is based on a feed-forward convolutional neural network which produces a fixed-size collection of bounding boxes and scores for the object class instances present in those boxes.

The main difference between a “traditional” CNN’s and the MobileNet architecture is instead of a single 3x3 convolution layer followed by batch norm and ReLU, MobileNets split the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv.


### SSD Inception V2 Model:

Given an input image and a set of ground truth labels, SSD does the following:

•	It passes the image through a series of convolutional layers, providing several sets of feature maps at different scales.
•	For each location in each of these feature maps, a 3x3 convolutional filter is used to evaluate a small set of default bounding boxes.
•	For each box, it simultaneously predicts the bounding box offset and the probabilities of each class.
•	During training, it matches the ground truth box with these predicted boxes based on IoU(Intersection over Union). The best predicted box is labeled a “positive” along with all the other boxes having an IoU with the truth greater than 0.5.


### Faster RCNN Inception Model:

The main insight of Faster R-CNN was to replace the slow selective search algorithm that was used in the R-CNN (Region-based Convolutional Neural Network), with a fast neural net. 
Faster R-CNN is similar to the original R-CNN but is improved on its detection speed  through two augmentations:

•	It performs feature extraction over the image even before proposing regions, thus running only one CNN over the entire image instead of running 2000 CNN’s across 2000 overlapping regions
•	It replaces the SVM with a softmax layer, thus extending the network for predictions instead of creating a new model


### TensorFlow:

TensorFlow is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices


### OpenCV:

OpenCV (Open Source Computer Vision) is a library of programming functions mainly aimed at real-time computer vision. The C++ API provides a class ‘videocapture’ for capturing video from cameras or for reading video files and image sequences. It is basically used to access the Webcam of our computer to capture real-time videos


### Pre-Trained dataset: 
COCO Dataset is a large-scale object detection, segmentation and captioning dataset. It is downloaded from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md and it has 200K labelled images categorized into 90 classes.

Choose any dataset you want from the following list
![title](images/models.png)

### COCO mAP:
The higher the mAp (minimum average precision), the better the model. Based on the observations, SSD with MobileNets provided much better results in terms of speed but the Faster RCNN Inception Model provided a higher accuracy with some compromise on the speed.


## Code With Documentation:

### Installation

### Install tensorflow in anaconda using the below command : 
pip install tensorflow


### Install OpenCV using following command:
conda install -c https://conda.binstar.org/menpo opencv


### Imports: 


In [None]:
# importing the necessary libraries
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from utils import label_map_util
from utils import visualization_utils as vis_util

# in order to display the images in line
%matplotlib inline

### Model URL
The model would download from the below url

download_url = 'http://download.tensorflow.org/models/object_detection/'

### To change the model name according of your choice, change the following code
"model = 'ssd_mobilenet_v1_coco_11_06_2017'" to
"model = 'faster_rcnn_inception_v2_coco_2018_01_28"
![title](images/ssd_model.png)
![title](images/fasterRCNN_model.png)


### Protobuf file:
Protocol Buffers is a method of serializing structured data. The pb file is a json file for us. You will find all the genral classes that are considered for detection.
You will find a protobuf file in the <folder> in the given github link
<githublink>
    A snapshot of the protobuf file (mscoco_label_map.pbtxt)
![title](images/protobuf.png)


### Opening the tar and downloading the model you have chosen

In [None]:
# opens the tar file and downloads the model to our system
opener = urllib.request.URLopener()
opener.retrieve(download_url + model_tar, model_tar)
file = tarfile.open(model_tar)
print(file)
for each_file in file.getmembers():
    each_file_name = os.path.basename(each_file.name)
    if 'frozen_inference_graph.pb' in each_file_name:
        file.extract(each_file, os.getcwd())

### Loading frozen TF model in the memory by creating a graph

In [None]:
# loading a frozen tensorflow model into memory
graph_detection = tf.Graph()
with graph_detection.as_default():
    graph_def = tf.GraphDef()
    with tf.gfile.GFile(path, 'rb') as fid:
        graph_serialized = fid.read()
        graph_def.ParseFromString(graph_serialized)
        tf.import_graph_def(graph_def, name='')

### Loading the lables from the frozen model

In [None]:
# loading labels and their mappings
label_map = label_map_util.load_labelmap(label_path)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=classes_num, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

### Dimensions of an image:
We are loading the dimensions of each image into a numpy array. The dimensions of the image are the height, the width and the RGB intensity at various points in the image.



In [None]:
# load images into a numpy array which consists of the dimensions of each image
def load_image_into_numpy_array(image):
    (im_width, im_height) = image.size
    return np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8) 

### Giving path for images to detect objects

In [None]:
# path where the test images are stored
test_images_dir = 'test_images'
test_image_path = [ os.path.join(test_images_dir, 'ILSVRC2017_test_00000013.jpeg')]

# Size of the output images in inches
image_size = (8, 5)

### Detecting objects in an image or set of images
The below code will be used for detecting objects in an image or a set of images


In [None]:
with graph_detection.as_default():
    with tf.Session(graph=graph_detection) as sess:
        for image_path in test_image_path:
            
            # opening images from the path
            image = Image.open(image_path)
            
            # array representation of the image used later to prepare the result image with bounding boxes with labels on it
            image_np = load_image_into_numpy_array(image)
            
            # expanding the dimensions of the image as the model expects images to have the shape: [1, None, None, 3]
            image_np_expanded = np.expand_dims(image_np, axis=0)
            image_tensor = graph_detection.get_tensor_by_name('image_tensor:0')
            
            # each box represents parts of the image where a particular object was detected
            boxes = graph_detection.get_tensor_by_name('detection_boxes:0')
            
            # each score represents the level of confidence for each of the objects
            # this score is shown on the result image along with the class label
            scores = graph_detection.get_tensor_by_name('detection_scores:0')
            classes = graph_detection.get_tensor_by_name('detection_classes:0')
            num_detections = graph_detection.get_tensor_by_name('num_detections:0')
            
            # Actual detection
            (boxes, scores, classes, num_detections) = sess.run([boxes, scores, classes, num_detections],
              feed_dict={image_tensor: image_np_expanded})
            
            # Visualization of the results of an indentified object
            vis_util.visualize_boxes_and_labels_on_image_array(
                image_np,
                np.squeeze(boxes),
                np.squeeze(classes).astype(np.int32),
                np.squeeze(scores),
                category_index,
                use_normalized_coordinates=True,
                line_thickness=8)
            plt.figure(figsize=image_size)
            plt.imshow(image_np)

### Detecting objects through webcam
The below code will be used to detect objects through webcam.
To capture a video, we need to create a VideoCapture object. Its argument can be either the device index or the name of a video file. Device index is just the number to specify which camera. Normally one camera will be connected (as in my case). So I simply pass 0 (or -1). You can select the second camera by passing 1 and so on. After that, you can capture frame-by-frame.

In [None]:
import cv2
cap=cv2.VideoCapture(0)
ret = True
with graph_detection.as_default():
    with tf.Session(graph=graph_detection) as sess:
   
     while(ret):
        ret,image_np=cap.read()
        image_np_expanded = np.expand_dims(image_np, axis=0)
        image_tensor = graph_detection.get_tensor_by_name('image_tensor:0')
        
      # Each box represents a part of the image where a particular object was detected.
        boxes = graph_detection.get_tensor_by_name('detection_boxes:0')
      # Each score represent how level of confidence for each of the objects.
      # Score is shown on the result image, together with the class label.
        scores = graph_detection.get_tensor_by_name('detection_scores:0')
        classes = graph_detection.get_tensor_by_name('detection_classes:0')
        num_detections = graph_detection.get_tensor_by_name('num_detections:0')
      # Actual detection.
        (boxes, scores, classes, num_detections) = sess.run(
          [boxes, scores, classes, num_detections],
          feed_dict={image_tensor: image_np_expanded})
      # Visualization of the results of a detection.
        vis_util.visualize_boxes_and_labels_on_image_array(
        image_np,
        np.squeeze(boxes),
        np.squeeze(classes).astype(np.int32),
        np.squeeze(scores),
        category_index,
        use_normalized_coordinates=True,
        line_thickness=8)
        cv2.imshow('image',cv2.resize(image_np,(1280,960)))
        if cv2.waitKey(25) & 0xFF==ord('q'):
            break
            cv2.destroyAllWindows()
            cap.release()

## Results:

### Below are the results that we got.

For **SSD with MobileNet**, the accuracy of object detection in image is 83% for person and 81% for laptop. This model worked fast though had the least accuracy.

![title](images/ssd_results.png)

For **SSD Inception V2 model**, the accuracy of the objects detected in the images is 90% for the person and 95% for the laptop.

![title](images/ssdV2_result.png)

For **Faster RCNN Inception Model**, the accuracy of object detection in image is 99% for person and 99% for laptop

![title](images/fasterRCNN_result.png)

In addition, we were also successful in accessing the webcam of our system using OpenCV to detect real-time objects. **The model used here is SSD with MobileNets as it produces much faster results as compared to the other two models.**
![title](images/real_time_detect.png)

## Conclusion:

**SSD with MobileNet model accuracy:**
Person 83% and laptop 81%

**SSD Inception V2 model accuracy:**
Person 90% and laptop 95%

**Faster RCNN Inception Model accuracy:**
Person 99% and laptop 99%

As we can see above, Faster RCNN Inception Model gives the highest accuracy and SSD with MobileNet gives the lowest.
But MobileNets are optimized to be small and efficient, at the cost of some accuracy.


### References:
    
    https://github.com/tensorflow/models/tree/master/research/object_detection
    https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

#### License:
MIT License

Copyright (c) 2018 renuHadke, menitakoonani

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

    

The code in the document by Menita Koonani and Renu Hadke is licensed under the MIT License https://opensource.org/licenses/MIT