# Introduction

Welcome to the first of a series of notebooks on the TensorFlow Great Barrier Reef Competition. This first notebook will be an EDA with animations to get a better idea of how to process/display the data.

# Part 1. EDA

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import os
import ast
import seaborn as sns
import cv2
from matplotlib import animation, rc
import matplotlib.pyplot as plt
%matplotlib inline
rc('animation', html='jshtml')

In [None]:
train = pd.read_csv('../input/tensorflow-great-barrier-reef/train.csv')
test = pd.read_csv('../input/tensorflow-great-barrier-reef/test.csv')

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
test.head()

In [None]:
test.tail()

In [None]:
train.info()

In [None]:
test.info()

In [None]:
print(train.duplicated().sum())
print(test.duplicated().sum())

We can see that all data is present and singular. However, note that the annotations are provided in string format. We need to use ast to convert to arrays.

In [None]:
train['annotations_list'] = train['annotations'].apply(lambda x: ast.literal_eval(x))

Displaying a few countplots to see distributions:

In [None]:
plt.figure(figsize=(15, 15))
plt.title("Number Of Frames Containing X Starfish")
starfish_count = sns.countplot(x = train['annotations_list'].apply(lambda x: len(x)))
starfish_count.bar_label(starfish_count.containers[0])

We can see here that most entries are empty, but a few frames have 1 starfish, and potentially as much as 18. 

In [None]:
video_count = sns.countplot(x = train['video_id'])
video_count.bar_label(video_count.containers[0])
plt.title("Number Of Frames For Each Video ID")

It's roughly an equal distribution of frames between the videos, with video_id 2 earning the plurality of them.

In [None]:
for i in train['video_id'].unique():
    plt.figure(figsize=(10, 10))
    plt.title("Number of Starfish In Each Frame of Video " + str(i))
    temp = train[train['video_id'] == i]
    starfish_count_vid = sns.countplot(x = temp['annotations_list'].apply(lambda x: len(x)))
    starfish_count_vid.bar_label(starfish_count_vid.containers[0])
    plt.show()

We see here that video 0 doesn't contain that many starfish, while videos 1 and 2 contain significantly more. In addition, video 0 will have a starfish on screen for longer than videos 1 and 2.

In [None]:
train['temp'] = list(zip(train.video_id, train.sequence))
sns.countplot(x = train['temp'])
plt.xticks(rotation = 90)

Lastly, we can see that there are 20 different sequences of videos, with the most coming from sequences 8503, 29859, 37114, and 60754. Of special note is that those sequences are from video IDs 1 and 2, primarily; video 0 instead contains more evenly distributed sets of sequences. 

With this out of the way, we can proceed to head to the videos themselves.

In [None]:
def createVideo(video_id, num_frames, start_frame):
    video = []
    directory = ('../input/tensorflow-great-barrier-reef/train_images/video_' + str(video_id) + '/')
    for i in range(num_frames):
        image = cv2.imread(directory + str(start_frame + i) + '.jpg', 1)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        temp = train[(train['video_id'] == video_id) & (train['video_frame'] == start_frame + i)].iloc[0]
        starfish_box = temp.annotations_list
        for j in starfish_box:
            x0, y0, x1, y1 = (j['x'], j['y'], j['x'] + j['width'], j['y'] + j['height'])
            cv2.rectangle(image, (x0, y0), (x1, y1), (255,0,0), 3)
        video.append(image)
    return video

In [None]:
def showVideo(video_array):
    fig = plt.figure(figsize=(9, 9))
    plt.axis('off')
    im = plt.imshow(video_array[0])
    def animate_func(i):
        im.set_array(video_array[i])
        return [im]
    return animation.FuncAnimation(fig, animate_func, frames = len(video_array), interval = 1000 // 24)

In [None]:
video_array = createVideo(video_id = 0, num_frames = 100, start_frame = 0)
showVideo(video_array)

Credit to Diego Gomez for the inspiration: https://www.kaggle.com/diegoalejogm/great-barrier-reefs-eda-with-animations

Next time, we proceed with a baseline model. See you then!

UPDATE 1/1/22: Added in sequence countplot. 