# TensorFlow Great Barrier Reef

This notebook was built to quickly explore training data from tensorflow-great-barrier-reef competition. Feel free to reuse.

**Objectives**
1. Load the data
2. Get a high-level view of the data structure
3. Get a high-level view of data distribution
4. Visualise some images with starfishes

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches

## Get the Data

In [None]:
# Get the data

path = '../input/tensorflow-great-barrier-reef/'
train = pd.read_csv(path+'train.csv')
test = pd.read_csv(path+'test.csv')

# Add images path to data

train['image_path'] = '../input/tensorflow-great-barrier-reef/train_images/video_'+train['video_id'].astype(str)+'/'+train['image_id'].apply(lambda x: x.split('-')[1])+'.jpg'

# Reorganise columns

cols = train.columns[:-2].tolist()+['image_path']+[train.columns[-2]]
train = train[cols]

print('First rows of training data:\n')
print(train.head(), '\n')

print('Training data types:\n')
print(train.dtypes)

## Exploratory Data Analysis

### Training data

In [None]:
# How many videos are there in the training set?

print('Number of unique videos in the training set: {}\n'.format(train['video_id'].nunique()))

# How many frames are there in each video of the training set?

print('Number of video frames per video in the training set:'')
train.groupby('video_id', as_index=False)['video_frame'].count()

In [None]:
# How does the presence of a starfish materialises in the data?

mask_starfish = train['annotations'] != '[]'
train[mask_starfish].head()

In [None]:
# "annotations" is in str() format. Let's transform it as a list
# It will help later

train['annotations'] = train['annotations'].apply(lambda x: eval(x))

In [None]:
# How are starfishes distributed in the training data?

train['is_starfish'] = train['annotations'].apply(lambda x: 1 if len(x)>0 else 0)

plt.figure(figsize=(8, 6))
plt.title('Distribution of frames showing a starfish [1] vs. frames without starfish [0]')
plt.xlabel('Presence of a starfish');
plt.xticks([0, 1]);
plt.ylabel('Number of video frames');
train['is_starfish'].hist();
plt.grid(False);

In [None]:
# How are starfish bounding boxes distributed in the training data?

plt.figure(figsize=(8, 6))
plt.title('Number of starfishes distribution in the training data')
train[mask_starfish]["annotations"].apply(lambda x: len(x)).value_counts().hist();
plt.xlabel('Number of video frames');
plt.ylabel('Number of bounding boxes per video frame');
plt.grid(False)

### Visualise some random images

In [None]:
# Pick a frame in the training data
frame = pd.DataFrame.sample(train, n=1)
n_starfish = frame['annotations'].apply(lambda x: len(x)).tolist()[0]
print('There is {} starfish(es) in this frame.'.format(n_starfish)) if n_starfish > 0 else print('There is no starfish in this frame.')

plt.figure(figsize=(15, 10))
img_path = frame['image_path'].tolist()[0]
img = plt.imread(img_path)
ax = plt.gca()

ann_mask = train['image_path'] == img_path
annotations = train[ann_mask]['annotations'].tolist()[0]
for bbox in annotations:
    x, y, w, h = bbox['x'], bbox['y'], bbox['width'], bbox['height']
    rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='darkorange', facecolor='orange', alpha=.5)
    ax.add_patch(rect)
plt.imshow(img);
plt.axis('off');