# TensorFlow - Help Protect the Great Barrier Reef

# Table of Content
1. Introduction
2. Load Dataset & EDA
3. Bounding Box Analysis
    1. Ratio between Number of Boxes with Objects and Number of Boxes without Objects
    2. How to decode annotations from the dataframe loaded from the train.csv file
    3. Summary of Number of Boxes in Each Frame and Distributions of Number of Boxes
    4. Bounding Box Visualization
    5. Distribution of Bounding Box Center Coordinates on Image
4. Sequence Preview
    1. Generate Annotated Video from Sequences

# Introduction

<p style="text-align: justify;">In this competition, our goal is to predict the presence and position of crown-of-thorns starfish in sequences of underwater images taken at various times and locations around the Great Barrier Reef. Predictions take the form of a bounding box together with a confidence score for each identified starfish. An image may contain zero or more starfish. Our model should evaluate the images in the same order as they were recorded in the video.</p>

<img src="https://storage.googleapis.com/kaggle-media/competitions/Google-Tensorflow/video_thumb_kaggle.png" style="width:720px;height:480px"></img>

[Data Metadata](https://www.kaggle.com/c/tensorflow-great-barrier-reef/data): Based on the description in the data's metadata, we can summarize the structure of the data as following:
1. image_id is in a format of: video_id + "-" + video_frame
2. bounding box format is: (xmin, ymin, width, height) in pixels, the annotation is a list of dictionary with structure {x, y, width, height}
3. there are gaps between frames in videos
4. sequence is a subset of frames without gaps, but with no ordering

<p style="text-align: justify;">If you have any suggestions or find any issues related to interpretation, welcome to comment and to point those out. Thank you! Hope this notebook can be helpful for getting a better understanding about the data.</p>

In [None]:
import os

from PIL import Image, ImageDraw
import cv2
import re
import pandas as pd
import numpy as np
from tqdm import tqdm

from matplotlib import pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
from IPython.display import Video, display

# Load Dataset & EDA

In [None]:
dataset = {
    'root_dir': '../input/tensorflow-great-barrier-reef',
    'train_csv': '../input/tensorflow-great-barrier-reef/train.csv',
    'test_csv': '../input/tensorflow-great-barrier-reef/test.csv',
    'sample_submission_csv': '../input/tensorflow-great-barrier-reef/example_sample_submission.csv',
    'video_img_dir': '../input/tensorflow-great-barrier-reef/train_images'
}

In [None]:
train_csv = pd.read_csv(dataset['train_csv'])
test_csv = pd.read_csv(dataset['test_csv'])

In [None]:
print("number of frames:", len(train_csv))

In [None]:
train_csv.head()

<p style="text-align: justify;">Let's first see how many frames are there for each video; also, we need to notice that the frames might not be consecutive (there might be gaps in each video</p>

In [None]:
frame_counts = train_csv['video_id'].value_counts().sort_values().to_frame()
frame_counts.head()

<p style="text-align: justify;">Then, let's run a sanity check to make sure the number of images in the data folder match with the number of images in the train file</p>

In [None]:
print("number of records in video_0 matched: ", frame_counts.loc[0]['video_id'] == len(os.listdir(os.path.join(dataset['video_img_dir'], 'video_0'))))
print("number of records in video_0 matched: ", frame_counts.loc[1]['video_id'] == len(os.listdir(os.path.join(dataset['video_img_dir'], 'video_1'))))
print("number of records in video_0 matched: ", frame_counts.loc[2]['video_id'] == len(os.listdir(os.path.join(dataset['video_img_dir'], 'video_2'))))

<p style="text-align: justify;">There are 20 distinct sequences in the dataset, and a sequence is a gap-free subset of a given video, and we calculate number of frames in each sequence</p>

In [None]:
sequence_counts = train_csv['sequence'].value_counts().sort_values().reset_index()
sequence_counts.columns = [['sequence', 'num_frames']]
print("number of sequences:", len(sequence_counts))
sequence_counts.head()

# Bounding Box Analysis

In [None]:
num_no_obj_frame = train_csv[train_csv.annotations == '[]']['annotations'].count()
print("number of frames without objects:", num_no_obj_frame)

In [None]:
num_with_obj_frame = train_csv[train_csv.annotations != '[]']['annotations'].count()
print("number of frames with objects:", num_with_obj_frame)

<p style="text-align: justify;">From above, we can see that the number of frames with no objects is almost 3.7 times of the number of frames with objects, and only 21% of the frames in the training data contains objects</p>

In [None]:
train_csv[train_csv.annotations != '[]'].head()

## Ratio between Frames with Objects and Frames with No Objects

In [None]:
print('ratio of frames with objects:', num_with_obj_frame / len(train_csv))

fig, axes = plt.subplots(1,1, figsize=(12, 6))

sns.barplot(ax=axes, x=['Number of Frames with Objects', 'Number of Frames with No Objects'], y=[num_with_obj_frame, num_no_obj_frame])
axes.set_title("Distribution of Frames with/without Objects")
axes.set_xlabel("Frame Types")
axes.set_ylabel("Count")

plt.show()

<p style="text-align: justify;">Let's construct a function that can read the annotations and convert it to the format that can be used in object detection algorithm, and we calculate the number of boxes as well to help us get better understanding about the bounding box information</p>


## Decode Annotations

In [None]:
def decode_annotation(annot_line):
    # annot_line example: [{'x': 540, 'y': 310, 'width': 113, 'height': 105}, {'x': 657, 'y': 501, 'width': 95, 'height': 56}]
    boxes = []
    
    box_pattern = r'\{\'\w\'\:\s\d+\,\s\'\w\'\:\s\d+\,\s\'\w+\'\:\s\d+\,\s\'\w+\'\:\s\d+\}'
    val_pattern = r'\d+'
    
    annotations = re.findall(box_pattern, annot_line)
    for annot in annotations:
        x, y, width, height = re.findall(val_pattern, annot)
        x, y, width, height = float(x), float(y), float(width), float(height)
        confidence = 1.0
        
        box = [x, y, width, height, confidence]
        boxes.append(box)
        
    return boxes

def count_boxes(annot_line):
    
    annot_line  = annot_line[1:-1]
    box_pattern = r'\{\'\w\'\:\s\d+\,\s\'\w\'\:\s\d+\,\s\'\w+\'\:\s\d+\,\s\'\w+\'\:\s\d+\}'
    val_pattern = r'\d+'
    
    annotations = re.findall(box_pattern, annot_line)
    
    return len(annotations)


def test_decode_annotation(annot_line):
    print("sample:", annot_line)
    boxes = decode_annotation(annot_line)
    for i, box in enumerate(boxes):
        print(f"box {i}:", box)


In [None]:
test_samples = [
    "[{'x': 540, 'y': 310, 'width': 113, 'height': 105}, {'x': 657, 'y': 501, 'width': 95, 'height': 56}, {'x': 257, 'y': 101, 'width': 42, 'height': 59}]",
    "[{'x': 540, 'y': 310, 'width': 113, 'height': 105}, {'x': 657, 'y': 501, 'width': 95, 'height': 59}]",
    "[{'x': 12, 'y': 250, 'width': 143, 'height': 82}]",
    "[]"
]

for i, sample in enumerate(test_samples):
    num_boxes = count_boxes(sample)
    print(f"Test {i+1}:", f"found {num_boxes} boxes")
    
    test_decode_annotation(sample)
    print("")

## Number of Boxes for each Frame

In [None]:
train_csv['num_boxes'] = train_csv['annotations'].apply(count_boxes)

In [None]:
train_csv[train_csv.annotations != '[]'].head()

<p style="text-align: justify;">We can observe that most of the frames have only one bounding box, and 3 frames contain 18 bounding boxes</p>

In [None]:
boxes_dist = train_csv[train_csv.annotations != '[]']['num_boxes'].value_counts().sort_values(ascending=False).reset_index()
boxes_dist.columns = ['num_boxes', 'num_frames']
boxes_dist

## Distribution of Number of Boxes (Frame Counts)

In [None]:
fig = plt.figure(figsize=(24, 8))
sns.barplot(x=boxes_dist.num_boxes, y=boxes_dist.num_frames)

plt.title("Box Distribution")
plt.xlabel("Number of Boxes")
plt.ylabel("Frame Counts")

plt.show()

<p style="text-align: justify;">Let's visualize the image and draw the bounding boxes on it, for each case mentioned in the last step, to see some sample data</p>

## Bounding Box Visualization

In [None]:
def gen_file_path(image_id):
    # extract file path by using the image_id in the train file
    video_id = image_id.split('-')[0]
    image_id = image_id.split('-')[1]
    return os.path.join(dataset['video_img_dir'], 'video_' + video_id, image_id + '.jpg')

def draw_boxes(image_path, annot_line):
    
    boxes = decode_annotation(annot_line)

    coords = [] 
    for box in boxes: 
        coord = [] 
        coord.append(box[0]) 
        coord.append(box[1]) 
        coord.append(box[0] + box[2]) 
        coord.append(box[1] + box[3]) 
        coords.append(coord) 

    image = Image.open(image_path)
    imgcp = image.copy()
    imgcp_draw = ImageDraw.Draw(imgcp)

    for coord in  coords:
         imgcp_draw.rectangle(coord, fill = None, outline = "red", width=5)

    return imgcp

In [None]:
train_csv['file_path'] = train_csv['image_id'].apply(gen_file_path)

In [None]:
train_csv.head()

In [None]:
# total number of bounding boxes
total_num_boxes = train_csv.num_boxes.sum()
total_num_boxes

In [None]:
# extract the first sample in each group
samples = train_csv.groupby('num_boxes').first()
samples

In [None]:
plt.figure(figsize=(24, 36))

r, c = 7, 3
for index, row in samples.iterrows():
    image_path = row['file_path']
    annot_line = row['annotations']
    plt.subplot(r, c, index + 1)
    dimg = draw_boxes(image_path, annot_line)
    plt.imshow(dimg)
    
plt.tight_layout()
plt.show()

## Box Location Visualization

In [None]:
all_boxes_xy = []
all_boxes_wh = []

for index, row in tqdm(train_csv.iterrows(), total=len(train_csv)):
    if row['annotations'] != '[]':
        boxes = decode_annotation(row['annotations'])
        
        for box in boxes:
            all_boxes_xy.append([box[0], box[1]])
            all_boxes_wh.append([box[2], box[3]])
            
all_boxes_xy = np.array(all_boxes_xy)
all_boxes_wh = np.array(all_boxes_wh)

## Summary Statistics: Box Center Coordinates and Box Area

In [None]:
box_center_df = pd.DataFrame.from_records(all_boxes_xy, columns=['x', 'y'])

box_shape_df  = pd.DataFrame.from_records(all_boxes_wh, columns=['width', 'height'])
box_shape_df['area'] = box_shape_df['width'] * box_shape_df['height']

<p style="text-align: justify;">Box's (x, y) coordinate summary statistics</p>

In [None]:
box_center_df.describe()

<p style="text-align: justify;">Box's (width, height), and area summary statistics</p>

In [None]:
box_shape_df.describe()

## Distribution of Box Center on Image (all boxes)

In [None]:
plt.figure(figsize=(28, 16))
plt.scatter(x=all_boxes_xy[:,0], y=all_boxes_xy[:,1], s=0.5, color = 'red')
plt.title("Distribution of Box Center Coordinate on Image")
plt.xlabel("X value")
plt.ylabel("Y value")
plt.show()

# Sequence Preview

In [None]:
# calculate number of boxes in each sequence
# we can find that sequence 29424, 37114, 44160 do not contain any object
train_csv.groupby('sequence')['num_boxes'].sum().sort_values(ascending=False).to_frame().T

In [None]:
# sequence length
train_csv.groupby('sequence')['image_id'].count().sort_values(ascending=False).to_frame().T

In [None]:
# pick one sequence, and convert it to video and adding the annotations to the video
sample_seq = train_csv[train_csv.sequence == 22643]
sample_seq

In [None]:
def convert_frames_to_video(files, boxes, save_to,fps):
    
    frame_array = []
    
    print("loading ...")
    for filename, annot_line in tqdm(zip(files, boxes), total=len(files)):
        img = cv2.imread(filename)
        height, width, layers = img.shape
        size = (width,height)
        
        boxes = decode_annotation(annot_line)
        
        coords = [] 
        for box in boxes: 
            coord = [] 
            coord.append(box[0]) 
            coord.append(box[1]) 
            coord.append(box[0] + box[2]) 
            coord.append(box[1] + box[3]) 
            coords.append(coord) 

        imgcp = Image.fromarray(img)
        imgcp_draw = ImageDraw.Draw(imgcp)

        for coord in  coords:
             imgcp_draw.rectangle(coord, fill = None, outline = "blue", width=5)
        
        del imgcp_draw
        
        frame_array.append(np.array(imgcp))
        
    out = cv2.VideoWriter(save_to,cv2.VideoWriter_fourcc(*'DIVX'), fps, size)
    
    print(f"writing to {save_to}")
    for i in tqdm(range(len(frame_array))):
        # writing to a image array
        out.write(frame_array[i])
    out.release()

at current stage, you might need to download the output sequence...

In [None]:
convert_frames_to_video(sample_seq['file_path'].values.tolist(), sample_seq['annotations'].values.tolist(), './sequence.avi', 25)