# General Description
This notebook implements a simple door detection method. We use the pretrained inception_resnet_v2 model which has been trained on the Open Images V4 dataset. This is a link to the Open Images V6 website where you can explore the dataset: https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=detection&c=%2Fm%2F02dgv. The Open Images V4 dataset is also available for download from that link.

For door detection on Google streetview images it is not enough to simply make predictions using inception_resnet_v2. The door makes up too small of a portion of the image and is thus ignored by the model. Instead we detect doors by first detecting a house in the image. If multiple houses exist in the image we use the house whose bounding box has the largest area. The reasoning behind this is that a house with a larger bounding box is closer to the camera and is more likely to contain the door we are interested in. Once a target house is identified we zoom into that portion of the image. Since doors are often small and occluded this zoomed in portion of the image is still too large for the model to detect a door. We next crop out small sections from this zoomed in image of the house and prediction on them. These sections are small enough to detect the door.

However, these cropped sections must overlap to avoid cropping the door in half. This means that we will likely have multiple croppings which contain the door. To choose the best one we create a centrality metric. The cropping whose prediction is most central will be considered the best cropping and we will use the prediction which arises from this cropping as our final cropping.

# General Notes
- This notebook is a modified verision of this code: https://www.kaggle.com/xhlulu/intro-to-tf-hub-for-object-detection
- The images with bounding boxes overlayed do not show all of the model's predictions. It only shows the predictions with a confidence above a user-specificed threshold
- get_croppings has a useful debug mode parameter. When active it will display additional information, the zoomed target image, and the cropping themselves. Otherwise nothing is displayed
- predict_on_directory has three verbosity settings (0, 1, 2). These are useful for debugging and understanding where and how the model is making mistakes
- It takes ~5mins to predict on the entire test set with a GPU and >1h with a CPU. This time can be significantly improved without great effort

# Improvement Notes
- Batch computation is not supported so the gpu is not being fully utilized
- The network currently takes strings in binary format as input. This should probably be changed to something more
  convenient, like an array. This can be done by modifying the decoding step
- Only using the door prediction which arises from the best cropping allows for the possibility of false positives. If we   instead consider the loction of the door (relative to the original image) in each cropping, we can get most consistent     results
- When there are multiple houses in the image the wrong house is sometimes selected. Using a weighted average of: centrality, bounding box area, and prediction confidence will likely provide a good solution to this problem. Another option to fix this problem is to use GPS data. We can accurately select which house we want to target by knowing the camera's position and orientation. If we use GPS data we can also get multiple images of each house. Meaning we can get multiple predictions for multiple ground height calculations, which will result in a more accurate final calculation.
- There is an excessive amount of repeated computation between on the cropped image section. This can be avoided by using a fully convolutional architecture. Or alternatively the sliding window size of the model (the object size which the model can detect) can be lowered to match the size of a door.

# Results

#### Clarification about the definition of accuracy within this notebook
In the tests below accuracy refers to the number of times a door was identified in an image. This does NOT mean that the correct door was found or that the bounding box for the door is correct! The reason we use this metric is because we do not have access to bounding box labled test data. Also, the pretrained model we are using generally provides great bounding box predictions, thus if a door is detected it is likely both the correct door and it likely has a good bounding box.

#### Results on the test set
We test the door detection algorithm on a test set of **24 images** named "big-streetview-pics." These images are screenshots from google images. Some of the image have easy to detect doors, but there are a many images where the door is not easy to detect. However, that being said, it is reasonable to expect that a good model can detect a door in every image.

The algorithm gets a score of **58%**. However, in **9 images, (38%)** of the failure cases it failed because it selected the wrong house as the target house (the house which we zoom into). I believe this problem can be almost completely fixed by following the improvement note written above about target house selection. Additionally, the cropped sections that the model preforms predictions on are very blurry. I was unable to download the full resolution images from Google street view thus the dataset uses lower resolution screenshots. I think that if the full resolution images are used the accuracy will be greatly improved.

To test this claim I created a new dataset named "big-streetview-pics-zoomed" which contains the same houses as in the original test set, but I zoomed in to the correct house before taking a screenshot. This dataset simulates the algorithm selecting the correct house (which I think is very achievable) and simulates having a higher resolution image. On this new dataset the algorithm gets a score of **96%**. Therefore, we see that increasing resolution and selecting the correct target house greatly improves performance.

But this is a misleadingly high accuracy. As mentioned above the accuracy only refers to the number of times a door was detected in an image. It relies on the assumption that because the pretrained model is good, whenever a door is detected it is a good prediction. However, looking at the results we see that is not the case.

The model found a door in **23/24** of the images. In the one case the where the model was unable to identify a door, the door was only visible through the leaves of a tree. But out of the remaining images where a door was found it still made mistakes. In **1** of the images it identified a side door instead of the front door. This can be fixed by tweeking the method by which we select our final bounding box from among all the bounding boxes found within all the croppings. Finally, the model was confused by **5** images. In these images it mislabled objects as doors. The common mislabled objects were garage doors and windows. However, in most of these images it was still able to identify the door, it just so happened the that the mislabled object had a better centrality score causing it be selected as the best prediction. This can be fixed by editing the process we use to select our final door bounding box and by tuning the model to better distinguish between doors, garage doors, and windows. Note that the Open Images V4 dataset which was used to train this model contained very few training examples of house front doors. The doors in its training data were often very unusual (weird side ally doors, doors on ancient buildings, etc). Thus, some fine tuning on common house front doors will likely provide a good improvement.

#### Conclusion
In conclusion, I believe that this is a good method for detecting doors. It is not production ready but it can be greatly improved without too much effort. I believe that this approach is still worthwhile even if the model is modified (as mentions in the improvement notes) to be able to detect smaller object without the use of zooming or cropping. This is because I believe that by zooming into the target house we are greatly lowering the number of predictions we are considering and are thus greatly lowering the chances of getting a false positive door detection. Ideally, we would be able to modify the model to detect smaller objects and we would simply ignore all predictions which are not contained within the target house's bounding box. However, this is practically identical to our current method since both the challenge to idendentify the correct target house and the challenge to identify the correct door bounding box remain. The only difference would be that predictions on the bounding boxes are done simultaneously and without excess computations. Thus, it would provide a significant speed improvement.

In [None]:
import os
from pprint import pprint
from six import BytesIO

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from PIL import Image, ImageColor, ImageDraw, ImageFont, ImageOps
from tqdm import tqdm

In [None]:
import keras
import math

In [None]:
import base64

# Utility Functions

In this section, we define a few functions that will be used for processing images and formatting the output prediction. You can safely skip this section and use the following functions as is:
* `format_prediction_string(image_id, result)`: `image_id` is the ID of the test image you are trying to label. `result` is the dictionary created from running a `tf.Session`. The output is a formatted output row (i.e. `{Label Confidence XMin YMin XMax YMax},{...}`), so we need to modify the order from Tensorflow, which is by default `YMin XMin YMax XMax` (Thanks to [Nicolas for discovering this](https://www.kaggle.com/nhlr21/tf-hub-bounding-boxes-coordinates-corrected/notebook)).
* `draw_boxes(image, boxes, class_names, scores, max_boxes=10, min_score=0.1)`: `image` is a numpy array representing an image, `boxes`, `class_names`, and `scores` are directly retrieved from the model predictions.
* `display_image(image)`: Display a numpy array representing an `image`.

In [None]:
def format_prediction_string(image_id, result):
    prediction_strings = []
    
    for i in range(len(result['detection_scores'])):
        class_name = result['detection_class_names'][i].decode("utf-8")
        YMin,XMin,YMax,XMax = result['detection_boxes'][i]
        score = result['detection_scores'][i]
        
        prediction_strings.append(
            f"{class_name} {score} {XMin} {YMin} {XMax} {YMax}"
        )
        
    prediction_string = " ".join(prediction_strings)

    return {
        "ImageID": image_id,
        "PredictionString": prediction_string
    }

In [None]:
def display_image(image):
    fig = plt.figure(figsize=(20, 15))
    plt.grid(False)
    plt.axis('off')
    plt.imshow(image)

Unhide the cell below to see how the intermediate function `draw_bounding_box_on_image` is constructed.

In [None]:
def draw_bounding_box_on_image(image,
                               ymin,
                               xmin,
                               ymax,
                               xmax,
                               color,
                               font,
                               thickness=4,
                               display_str_list=()):
    """Adds a bounding box to an image."""
    draw = ImageDraw.Draw(image)
    im_width, im_height = image.size
    (left, right, top, bottom) = (xmin * im_width, xmax * im_width,
                                  ymin * im_height, ymax * im_height)
    draw.line([(left, top), (left, bottom), (right, bottom), (right, top),
               (left, top)],
              width=thickness,
              fill=color)

    # If the total height of the display strings added to the top of the bounding
    # box exceeds the top of the image, stack the strings below the bounding box
    # instead of above.
    display_str_heights = [font.getsize(ds)[1] for ds in display_str_list]
    # Each display_str has a top and bottom margin of 0.05x.
    total_display_str_height = (1 + 2 * 0.05) * sum(display_str_heights)

    if top > total_display_str_height:
        text_bottom = top
    else:
        text_bottom = bottom + total_display_str_height
    # Reverse list and print from bottom to top.
    for display_str in display_str_list[::-1]:
        text_width, text_height = font.getsize(display_str)
        margin = np.ceil(0.05 * text_height)
        draw.rectangle([(left, text_bottom - text_height - 2 * margin),
                        (left + text_width, text_bottom)],
                       fill=color)
        draw.text((left + margin, text_bottom - text_height - margin),
                  display_str,
                  fill="black",
                  font=font)
        text_bottom -= text_height - 2 * margin

In [None]:
def draw_boxes(image, boxes, class_names, scores, max_boxes=10, min_score=0.1):
    """Overlay labeled boxes on an image with formatted scores and label names."""
    colors = list(ImageColor.colormap.values())

    try:
        
        #was originally simply set to 25
        #I used height instead of width because generally the images are wider than they are long,
        #but our cropped sections are squares
        font_size = math.ceil(image.shape[0] /20)
        
        font = ImageFont.truetype(
            "/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Regular.ttf",
            font_size)
    except IOError:
        print("Font not found, using default font.")
        font = ImageFont.load_default()

    for i in range(min(boxes.shape[0], max_boxes)):
        if scores[i] >= min_score:
            ymin, xmin, ymax, xmax = tuple(boxes[i].tolist())
            display_str = "{}: {}%".format(class_names[i].decode("ascii"),
                                           int(100 * scores[i]))
            color = colors[hash(class_names[i]) % len(colors)]
            image_pil = Image.fromarray(np.uint8(image)).convert("RGB")
            draw_bounding_box_on_image(
                image_pil,
                ymin,
                xmin,
                ymax,
                xmax,
                color,
                font,
                display_str_list=[display_str])
            np.copyto(image, np.array(image_pil))
    return image

# Understanding the model

We will be using a Single Shot MultiBox Detector (SSD) model with a MobileNet v2 as the backbone (SSD+MobileNetV2). If you are not familiar with the literature, check out [this article about SSD](https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11), and [this post explaining what's new with MobileNet v2](https://towardsdatascience.com/mobilenetv2-inverted-residuals-and-linear-bottlenecks-8a4362f4ffd5). On a high level, you can think of MobileNet as a lightweight CNN that extract features from the image, and SSD as a method to efficiently scale a set of default bounding boxes around the targets.

The model is trained on [Open Images V4](https://ai.googleblog.com/2018/04/announcing-open-images-v4-and-eccv-2018.html), which is the dataset used for last year's competition. Fortunately, the labels are still the same, so the outputs of the model can be directly submitted to this competition. Here's what the author says about the dataset:

> Today, we are happy to announce Open Images V4, containing 15.4M bounding-boxes for 600 categories on 1.9M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8 per image on average; visualizer). 

For our implementation, it is important to note those following points:
* This model does NOT support fine-tuning.
* The model does NOT support batching, so the input has to be **ONE** image of shape `(1, height, width, 3)`.
* It is recommended to run this module on GPU to get acceptable inference times.
* The model is loaded directly from Tensorflow Hub, so this code might not work offline.

## Running the model on a Sample Image

Let's start by running the model on a single image. We will go through each step of the process afterwards.

In [None]:
sample_image_path = "../input/d/andreimuresanu/jpgstreetviewimages/51_prince_philip.jpg"

with tf.Graph().as_default():
    #the decoding step should probably be changed to use numpy or something
    
    # Create our inference graph
    image_string_placeholder = tf.placeholder(tf.string)
    decoded_image = tf.image.decode_jpeg(image_string_placeholder)
    decoded_image_float = tf.image.convert_image_dtype(
        image=decoded_image, dtype=tf.float32
    )
    # Expanding image from (height, width, 3) to (1, height, width, 3)
    image_tensor = tf.expand_dims(decoded_image_float, 0)

    # Load the model from tfhub.dev, and create a detector_output tensor
    #model_url = "https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1"
    model_url = "https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1"
    detector = hub.Module(model_url)
    detector_output = detector(image_tensor, as_dict=True)
    
    # Initialize the Session
    init_ops = [tf.global_variables_initializer(), tf.tables_initializer()]
    sess = tf.Session()
    sess.run(init_ops)

    # Load our sample image into a binary string
    with tf.gfile.Open(sample_image_path, "rb") as binfile:
        image_string = binfile.read()

    # Run the graph we just created
    result_out, image_out = sess.run(
        [detector_output, decoded_image],
        feed_dict={image_string_placeholder: image_string}
    )

Let's see what it looks like:

In [None]:
image_with_boxes = draw_boxes(
    np.array(image_out), result_out["detection_boxes"],
    result_out["detection_class_entities"], result_out["detection_scores"]
)
display_image(image_with_boxes)

## Crop/zoom utility functions

In [None]:
def visible_print(text):
    print("||||||||||", text, "||||||||||")

In [None]:
def angry_print(text):
    print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    print(text)
    print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")

In [None]:
def show_image(image, size=None):
    '''
    image is an array
    size is a tuple (width, height)
    '''
    if size is not None: plt.rcParams["figure.figsize"] = (size[0], size[1])
        
    plt.imshow(image, interpolation='nearest')
    plt.axis("off")
    
    plt.show()
    
    if size is not None: plt.rcParams['figure.figsize'] = plt.rcParamsDefault['figure.figsize'] #resetting the plot size

In [None]:
def getCrops(base_image, predictions, shift=0.3, height=0.7, min_score=0.1, debug=False):
    '''
    the cropped sections will be square
    base_image is a numpy array
    
    predictions['detection_boxes'][i] is in the format: ymin, xmin, ymax, xmax
    and every value is a percentage
    
    when debug is true it will display additional information, the zoomed target image, and the cropping themselves
    otherwise nothing is displayed
    '''
    
    #we will only have a single target class
    target_class = "House".encode('utf-8') #this is the class we will zoom into
    
    if target_class in predictions["detection_class_entities"]:
        if debug: print("base_image")
        if debug: show_image(base_image)

        #we will only zoom into the target_class which takes up the most area
        largest_area = -1
        largest_area_idx = None
        for i in range(len(predictions['detection_boxes'])):
            if (predictions['detection_class_entities'][i] == target_class
                    and predictions["detection_scores"][i] >= min_score):
                ymin, xmin, ymax, xmax = predictions['detection_boxes'][i]
                cur_area = (ymax - ymin) * (xmax - xmin)
                if cur_area > largest_area:
                    largest_area = cur_area
                    largest_area_idx = i
        
        if largest_area_idx is None:
            raise Exception(target_class.decode('utf-8') + " was not found with a confidence above " + str(min_score))
        
        if debug: print("largest_area", largest_area, "largest_area_idx", largest_area_idx)
        
        ymin, xmin, ymax, xmax = predictions['detection_boxes'][largest_area_idx]
        target_image = np.array(base_image[int(ymin*base_image.shape[0]):int(ymax*base_image.shape[0]),
                                           int(xmin*base_image.shape[1]):int(xmax*base_image.shape[1]),
                                           :]) #[y, x, z]
        
        if debug: print("target_image, shape:", target_image.shape)
        if debug: show_image(target_image)
        
        #target_image.shape := [height, width, colour channel count] 
        target_h = target_image.shape[0]
        target_w = target_image.shape[1]
        
        crop_h = int(target_h * height) #cropping height
        shift_size = int(shift * crop_h)
    
        crop_count = int((target_w - target_h * height) // shift_size)
    
        if debug:
            print("cropped images")
            print("cropping height:", crop_h)
            print("shift size:", shift_size)
            print(crop_count, "croppings")
        
        crop_img_list = []
        for i in range(crop_count):
            cur_xmin = i * shift_size
            cur_xmax = cur_xmin + crop_h
            
            if debug: print("target_image", target_image.shape)
            if debug: print("xmin", cur_xmin, "xmax", cur_xmax)
            
            crop_img = np.array(target_image[int((1-height)*target_image.shape[0]):, cur_xmin:cur_xmax, :]) #[y, x, z]

            crop_img_list.append(crop_img)
            if debug: show_image(crop_img)
            
        return crop_img_list, (ymin, xmin, ymax, xmax)
        
    else:
        raise Exception(target_class.decode('utf-8') + " was not found in the list of predictions")

### Testing the cropping function on the sample image

In [None]:
#converting the previously selected and predicted on sample path to a numpy array
sample_img_np = np.asarray(Image.open(sample_image_path))

crop_img_list = getCrops(sample_img_np, result_out, debug=True)

# Predicting on the cropped image sections

In [None]:
def find_centrality(predictions, target_class="Door", min_score=0.1):
    '''
    Given the list of prediction results, this function returns a measure of centrality for the target_class.
    The closer to 0 the result is, the closer the predicted bounding box is to the center.
    A large result means the predicted bounding box is far from the center.
    If the result is -1, the target_class was not found in the list of predictions.
    This measure does not give directional information.
    
    We ingore all predictions with a lower confidence than min_score
    '''
    
    target_class = target_class.encode('utf-8')
    
    if target_class in predictions["detection_class_entities"]:
        
        #this loops through all the bounding boxes
        for i in range(len(predictions['detection_boxes'])):
            if (predictions['detection_class_entities'][i] == target_class and
                    predictions["detection_scores"][i] >= min_score):
                
                #we will assume that only one door is detected
                #(this means that if two doors are found we calculate centrality based on the first one)
                
                #min and max are percentages
                ymin, xmin, ymax, xmax = predictions['detection_boxes'][i]
                
                predicted_center = ((xmax - xmin) /2 + xmin)
                
                return abs(0.5 - predicted_center) #because xmin and xmax are precentages, 0.5 is the image center
            
        return -1
        
    else:
        return -1

In [None]:
def predict_on_directory(directory, verbosity=1, one_prediction=False):
    '''
    if one_prediction is True it will only predict on the first image in the directory then stop
    
    verbosity = 0: only final accuracy is outputted
    verbosity = 1: final accuracy, base image with base prediction overlayed, best cropped, best cropping with predictions
                   overlayed
    verbosity = 2: everything from verbosity 1, every cropping, every croppping with predictions overlayed
    '''
    
    my_results = []

    total_predictions = 0
    total_correct = 0

    for filename in os.listdir(directory):
        total_predictions += 1

        # Load the image string
        image_path = os.path.join(directory, filename)
        if verbosity > 0: print(image_path)
        with tf.gfile.Open(image_path, "rb") as binfile:
            image_string = binfile.read()


        #Here we make a prediction to find where the house is located in the image
        #We use this to know which portion of the image to zoom into

        # Run our session
        result_out = sess.run(
            detector_output,
            feed_dict={image_string_placeholder: image_string}
        )


        #We must check to see if a house was detected
        #(we will only zoom/crop if there is a house in the image)
        house_class = "House".encode('utf-8')

        if house_class in result_out["detection_class_entities"]:

            sample_img_np = np.asarray(Image.open(image_path)) #get numpy version of the image
            #zoom_range := (ymin, xmin, ymax, xmax)
            crop_img_list, zoom_range = getCrops(sample_img_np, result_out) #get all the croppings


            if verbosity > 0:
                visible_print("base prediction")

                image_with_boxes = draw_boxes(
                    sample_img_np.copy(), result_out["detection_boxes"],
                    result_out["detection_class_entities"], result_out["detection_scores"]
                )
                show_image(image_with_boxes, (16,10))


                zoom_image = np.array(sample_img_np[int(zoom_range[0]*sample_img_np.shape[0]):
                                                       int(zoom_range[2]*sample_img_np.shape[0]),
                                                    int(zoom_range[1]*sample_img_np.shape[1]):
                                                       int(zoom_range[3]*sample_img_np.shape[1]),
                                                    :]) #[y, x, z]

                visible_print("zoom image")
                show_image(zoom_image)    


            best_centrality = -1 #-1 means that the object wasn't detected
            best_centrality_idx = None
            crop_res_list = []
            for i, crop_img in enumerate(crop_img_list):

                #convert the numpy cropping to the correct format
                #Something clever can probably be done to avoid this saving
                #But the best option would be to modify the input of graph (the decoding step)

                crop_pil = Image.fromarray(crop_img)
                crop_pil.save("temp_img.jpg")

                with tf.gfile.Open("./temp_img.jpg", "rb") as binfile:
                    crop_string = binfile.read()

                # Run our session
                crop_res_out = sess.run(
                    detector_output,
                    feed_dict={image_string_placeholder: crop_string}
                )

                crop_res_list.append(crop_res_out)

                cur_centrality = find_centrality(crop_res_out)

                if verbosity > 1:
                    visible_print("detected classes for cropping " + str(i+1) + "/" + str(len(crop_img_list)))
                    print("centrality:", cur_centrality)

                if cur_centrality != -1 and (best_centrality == -1 or cur_centrality < best_centrality):
                    best_centrality = cur_centrality
                    best_centrality_idx = i


                if verbosity > 1:
                    show_image(crop_img)

                    image_with_boxes = draw_boxes(
                        crop_img.copy(), crop_res_out["detection_boxes"],
                        crop_res_out["detection_class_entities"], crop_res_out["detection_scores"]
                    )
                    show_image(image_with_boxes)



            #finding the best cropping

            if verbosity > 0: visible_print("best cropping")

            if best_centrality != -1:
                if verbosity > 0:
                    show_image(crop_img_list[best_centrality_idx])

                    image_with_boxes = draw_boxes(
                        crop_img_list[best_centrality_idx].copy(), crop_res_list[best_centrality_idx]["detection_boxes"],
                        crop_res_list[best_centrality_idx]["detection_class_entities"], crop_res_list[best_centrality_idx]["detection_scores"]
                    )
                    show_image(image_with_boxes)

                total_correct += 1

            else:
                if verbosity > 0: angry_print("no door found in " + filename)

        else:
            if verbosity > 0: angry_print("no house found in " + filename)

        if one_prediction: break

    angry_print("final accuracy: " + str(100 * total_correct / total_predictions) + "%")

### Predicting on the cropped sections of the image for door detection. Using a small dataset

In [None]:
predict_on_directory("../input/raw-street-view", verbosity=2)

# Testing on a larger dataset

In [None]:
predict_on_directory("../input/big-streetview-pics", verbosity=1)

# Testing on a zoomed in version of the previous dataset
We do this to see if an increased resolution improves results. Zooming in improves resolution because we are not working with the original full images, we are only using screenshots of Google maps.

In [None]:
predict_on_directory("../input/big-streetview-pics-zoomed", verbosity=1)

## End the session

In [None]:
sess.close()