# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [1]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import math
import numpy as np
import sys
import os

from utils import get_dataset

%matplotlib inline

In [None]:
# dir argument is supposed to be a glob pattern see: https://github.com/tensorflow/models/blob/bce895d9c18fb47d9c304ebd49c58c03671c67a3/research/object_detection/builders/dataset_builder.py#L78
# doc on glob pattern i.e. tf.io.gfile.glob: https://www.tensorflow.org/api_docs/python/tf/io/gfile/glob

tfrecord_path = "/app/project/data/*.tfrecord"
dataset = get_dataset(tfrecord_path)
dataset_size = dataset.reduce(0, lambda count, _: count + 1).numpy()

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a dataset as an input and displays an images with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
def display_instances(dataset, num_images=16):
    """
    This function takes a dataset with object detection examples [1]
    and displays images with the associated bounding boxes.

    Args:
        dataset: A `tf.data.Dataset` object with object detection examples [1]
        num_images: Number of example images to display

    [1] https://github.com/tensorflow/models/blob/a77f240fc39cd9baa2ab897af5fcec5551a0e85a/research/object_detection/data_decoders/tf_example_decoder.py#L499
    """

    cols = 4
    rows = math.ceil(num_images / cols)
    fig, axs = plt.subplots(rows, cols, figsize=(12, 12))
    for n, ex in enumerate(dataset.take(num_images)):
        #print(n)
        #sys.exit()
        i = n // cols
        j = n %  cols
        ax = axs[i, j]

        im = ex["image"].numpy()
        axs[i, j].imshow(im)
        ax.get_yaxis().set_visible(False)
        ax.get_xaxis().set_visible(False)

        boxes = ex["groundtruth_boxes"].numpy()
        classes = ex["groundtruth_classes"].numpy()

        clrs = {1: 'red', 2: 'blue', 4: 'green'}

        for box, label in zip(boxes, classes):
            xy = box[1]*im.shape[1], box[0]*im.shape[0]
            h = (box[2] - box[0]) * im.shape[0]
            w = (box[3] - box[1]) * im.shape[1]
            rec = mpatches.Rectangle(xy, w, h, linewidth=1, edgecolor=clrs[label], facecolor='none')
            ax.add_patch(rec)

    plt.subplots_adjust(wspace=0.01, hspace=0.01, top=1, bottom=0, left=0, right=1)

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
# dataset_size is 15947, it cannot be fully shuffled
# as 32GB of RAM isn't enough, hence resort to a 
# smaller buffer_size
buffer_size = 1000
dataset_shuffled = dataset.shuffle(buffer_size, seed=0, reshuffle_each_iteration=False)

In [None]:
display_instances(dataset_shuffled, 16)

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...