# Depth Estimation and Object Detection

In this notebook, we will be working on a depth estimation and object detection task. The goal is to estimate the depth of objects in an image and identify what those objects are.

We will be using several libraries to accomplish this task:

- **OpenCV**: Open Source Computer Vision Library, includes several hundreds of computer vision algorithms.
- **scikit-image**: A collection of algorithms for image processing.
- **timm**: (PyTorch Image Models) includes a slew of models/architectures for PyTorch.
- **gTTS**: (Google Text-to-Speech) a Python interface for Google's Text to Speech API.
- **torchvision**: consists of popular datasets, model architectures, and common image transformations for computer vision.

Let's start by installing these libraries.

In [1]:
!pip install opencv-python
!pip install scikit-image
!pip install timm
!pip install gTTS
!pip install torchvision




Now that we have installed the necessary libraries, let's import them.

In [2]:

import cv2
import skimage
from skimage.io import imread, imshow
from skimage.color import rgb2hsv
import numpy as np
import matplotlib.pyplot as plt
import torch
import ipywidgets as widgets
from IPython.display import display
from PIL import Image
import io
from gtts import gTTS
from IPython.display import Audio
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision import transforms


In this task, we will be segmenting the depth map into different ranges and assigning a color to each range. We will also be providing a verbal description for each range. Let's define the ranges, their corresponding colors, and the verbal descriptions.

In [3]:

# Define the ranges for depth segmentation
ranges = [[0.0, 0.2], [0.2, 0.4], [0.4, 0.6], [0.6, 0.8], [0.8, 1.0]]

# Define verbal descriptions for each range
distance_descriptions = ["far", "mid-range", "close", "very close", "extremely close"]

# Define colors for each range
colors = [[255, 0, 0], [0, 255, 0], [0, 0, 255], [255, 255, 0], [0, 255, 255]]


COCO_INSTANCE_CATEGORY_NAMES = [
    'background', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

Next, we will define a function to segment the depth map into the defined ranges and color each segment with the corresponding color. This function will also count the number of pixels in each segment.

In [4]:
def segment_and_color_depth_map(depth_map, ranges, colors):
    color_map = np.zeros((*depth_map.shape, 3), dtype=np.uint8)
    counts = []

    for r, color in zip(ranges, colors):
        mask = (depth_map > r[0]) & (depth_map <= r[1])
        color_map[mask] = color
        counts.append(np.count_nonzero(mask))

    return color_map, counts

We will be using the MiDaS model for depth estimation. MiDaS stands for Multi-Instance Depth Aggregation System. It is a model trained by Intel Labs for monocular depth estimation. The model takes an RGB image as input and produces a depth map as output. The depth map indicates the distance of each pixel in the image from the camera.

Let's load the MiDaS model and move it to the GPU if available.

The MiDaS model requires the input image to be transformed in a certain way before it can be passed into the model. The required transformations are provided by the MiDaS team and can be loaded from their GitHub repository.

Let's load these transformations.

For object detection, we will be using the Faster R-CNN model with a ResNet-50 backbone pre-trained on the COCO dataset. The COCO dataset is a large-scale object detection, segmentation, and captioning dataset that contains over 200,000 labeled images.

Let's load the object detection model and move it to the GPU if available.

In [5]:

# Load the depth estimation model
model_type = "DPT_Large"
midas = torch.hub.load("intel-isl/MiDaS", model_type)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
midas.to(device)
midas.eval()

# Define the transformations for depth estimation
midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
if model_type == "DPT_Large" or model_type == "DPT_Hybrid":
    transform_midas = midas_transforms.dpt_transform
else:
    transform_midas = midas_transforms.small_transform

# Load the object detection model
model_detection = fasterrcnn_resnet50_fpn(pretrained=True)
model_detection = model_detection.to(device)
model_detection.eval()


Using cache found in /root/.cache/torch/hub/intel-isl_MiDaS_master
Using cache found in /root/.cache/torch/hub/intel-isl_MiDaS_master


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

The object detection model also requires the input image to be transformed in a certain way before it can be passed into the model. We will define these transformations using the `transforms` module from `torchvision`.

We will be using widgets to allow the user to upload an image and run the depth estimation and object detection on the uploaded image. Let's create these widgets.

In [6]:
# Define the transformations for object detection
transform_detection = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((800, 800)),
    transforms.ToTensor()
])

upload_button = widgets.FileUpload()
run_button = widgets.Button(description="Run depth estimation")
output_widget = widgets.Output()


Now, we will define a function that will be executed when the 'Run depth estimation' button is clicked. This function will perform the following steps:

1. Clear the output widget.
2. Retrieve the uploaded file and read it as an image.
3. Convert the image to RGB format.
4. Apply the necessary transformations and run depth estimation using the MiDaS model.
5. Normalize the output depth map.
6. Apply the necessary transformations and run object detection using the Faster R-CNN model.
7. Extract the bounding boxes and labels of the detected objects.
8. Segment and color the depth map using the function we defined earlier.
9. Determine the dominant distance range in the depth map and generate a verbal description.
10. If an object was detected, generate a verbal description of the object and its distance from the camera.
11. Convert the verbal description to speech using the gTTS library and play the speech.
12. Display the depth map and the colored depth map in the output widget.

In [7]:

def on_run_button_clicked(b):
    output_widget.clear_output()
    uploaded_file = list(upload_button.value.values())[0]
    image_data = uploaded_file['content']
    image = Image.open(io.BytesIO(image_data))
    image_rgb = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)

    # Apply transformations and run depth estimation
    input_batch = transform_midas(image_rgb).to(device)
    with torch.no_grad():
        prediction = midas(input_batch)
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=image_rgb.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()
    output = prediction.cpu().numpy()
    output = (output - output.min()) / (output.max() - output.min())

    # Apply transformations and run object detection
    input_batch_detection = transform_detection(image_rgb).unsqueeze(0).to(device)
    with torch.no_grad():
        predictions = model_detection(input_batch_detection)

    pred_score = list(predictions[0]['scores'].detach().cpu().numpy())
    pred_t = [pred_score.index(x) for x in pred_score if x > 0.5][-1]
    pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(predictions[0]['boxes'].detach().cpu().numpy())]
    pred_boxes = pred_boxes[:pred_t + 1]
    pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(predictions[0]['labels'].cpu().numpy())]
    pred_class = pred_class[:pred_t + 1]

    color_map, counts = segment_and_color_depth_map(output, ranges, colors)
    dominant_description = distance_descriptions[np.argmax(counts)]
    dominant_range_text = f"The dominant distance is: {dominant_description}"

    if pred_class:
        object_description = f"The object in front of you is a {pred_class[0]} and it's {dominant_description}"
        print(object_description)
        tts = gTTS(object_description)
        tts.save('object_description.wav')
        sound_file = 'object_description.wav'
        display(Audio(sound_file, autoplay=True))

    with output_widget:
        plt.imshow(output)
        plt.show()
        plt.imshow(color_map)
        plt.show()




Finally, let's display the widgets. The user can now upload an image and click the 'Run depth estimation' button to run depth estimation and object detection on the uploaded image.

In [8]:
run_button.on_click(on_run_button_clicked)
display(upload_button)
display(run_button)
display(output_widget)

FileUpload(value={}, description='Upload')

Button(description='Run depth estimation', style=ButtonStyle())

Output()

The object in front of you is a person and it's close


The object in front of you is a person and it's close
