# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [None]:
from utils import get_dataset
from matplotlib.patches import Rectangle
import copy
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

%matplotlib inline

In [None]:
#pulled records from provided workspace data file
dataset = get_dataset("/home/workspace/data/*/*.tfrecord")

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
def display_images(batch):
    """
    This function takes a batch from the dataset and display the image with 
    the associated bounding boxes.
    """
    #explore batch record output --> display_images(dataset.take(1))
    
    #extract values of interest from batch record
    classes = batch['groundtruth_classes'].numpy()
    bboxes = batch['groundtruth_boxes'].numpy()
    image = batch['image'].numpy()
    height, width, _ = image.shape

    #resize bboxes,
    model_bboxes = copy.deepcopy(bboxes)
    model_bboxes[:, [0,2]] = model_bboxes[:, [0, 2]] * height
    model_bboxes[:, [1,3]] = model_bboxes[:, [1, 3]] * width
    #color to class map --> red is car, blue is pedestrian, and green is cyclist
    color_map = {1: 'red', 2: 'blue', 4: 'green'}
    
    #prepare visual
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.set_xlabel("Width(px)")
    ax.set_ylabel("Height(px)")
    
    #add image to plot
    ax.imshow(image)
    
    #add bboxes to image
    for bbox, cl in zip(model_bboxes, classes):
        y1, x1, y2, x2 = bbox
        rec = patches.Rectangle((x1, y1), x2-x1, y2-y1, facecolor='none', edgecolor=color_map[cl])
        ax.add_patch(rec)

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
dataset.shuffle(99)

lucky_imgs = dataset.take(10)

for batch in lucky_imgs:
    display_images(batch)

plt.show()

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

In [None]:
#data distribution. Ask ourselves how many of each entity or class is present in the current data set

class_names = ["Car", "Pedestrian", "Cyclist"]
class_counts = {'1': 0, '2':0, '4':0}

class_pos = np.arange(len(class_names))

dataset.shuffle(100)
b_size = (10000)
data = dataset.take(b_size)

for batch in data:
    gt_classes = batch['groundtruth_classes'].numpy()
    for cl in gt_classes:
        class_counts[str(cl)]+=1

plt.bar(class_pos,[class_counts['1'],class_counts['2'],class_counts['4']])
plt.xticks(class_pos, class_names)
plt.show()