# Protecting the Great Barrier Reef 

In this notebook, I will present an approach to object detection using deep computer vision, for the purpose of solving the Kaggle challenge by Tensorflow, to find the crown of throwns starfish damaging the Great Barrier Reef ecology.

## Imports

In [None]:
!pip install --user webcolors  # used to convert RGB values to color names (in English)

In [None]:
import ast, os, sys

import matplotlib.pyplot as plt
import numpy as np  # array operations
import pandas as pd 
from PIL import Image  # image processing
import seaborn as sns  # another plotting library
import scipy  # scientific computing 
import tensorflow as tf  # AI/ML
import webcolors 

In [None]:
# plot pretty figures
%matplotlib inline
sns.set_style('darkgrid')

### Double-check Versions

In [None]:
print(sys.version)  # Python version
print(tf.__version__) # on Kaggle, this will be 2.6

## Exploratory Data Analysis

Here I will share what I've found interesting when exploring this dataset so far, since it might help contextualize the decisions I make later on, when implementing the machine learning models.

*Abbreviations*:     
- COTS = "Crown-of-thorns starfish"

Note: for the `data_path` variable - I clicked on the "copy" button icon next to the dataset folder icon in the "Data" tab (on the Kaggle kernel).

In [None]:
data_path = '../input/tensorflow-great-barrier-reef'
df = pd.read_csv(f"{data_path}/train.csv")

In [None]:
df.head(56)  # show enough rows to see what annotations look like

### How to Be an Observant Surveyor

My goal in this analysis is to build an object-detection system that can scale up the efforts of manual surveyors in the Great Barrier Reef. With that in mind... what makes a human surveyor great at spotting COTS in the first place?

**Question 1**: Do the COTS tend to lump close together?

*Part 1:* On average, how many COTS are seen together in a single video frame?
To do this, let's start by first adding a column with the counts of COTS seen in each particular frame:

In [None]:
type(df['annotations'][17])  # although the data type visually looks like a list, the CSV is all text

We know there may be multiple COTS spotted in a single frame, so let's count up each that is spotted in a new column. We'll using the `{` to know how many COTS are in each: 

In [None]:
count_func = lambda string: string.count('{')
spotted = df['annotations'].apply(count_func)

In [None]:
df = df.assign(starfish_spotted=spotted)
df.head()

Cool beans! Now we can calculate the average of COTS spotted in a given frame:

In [None]:
mean_starfish_spotted_in_a_frame = round(df["starfish_spotted"].mean(), 4)
print(f"On average, {mean_starfish_spotted_in_a_frame} COTS are seen together in a single video frame.")

Wowza, that seems very low. Let's also visualize the distribution of the `starfish_spotted` column using a histogram and PDF:

In [None]:
def plot_histogram_from_df(df, column, title, x_axis, y_axis):
    """
    Plots the PDF of a column in a given DataFrame, using Matplotlib.
    
    Credit for the equation used for plotting the PDF goes to the NumPy documentation:
        https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
    
    Args:
        df(pandas.DataFrame)
        column(str): name of the column being plotted
        title(str), x_axis(str), y_axis(str): will be added to the plot
        
    Returns: None
    """
    # A: calculate the mean and std dev of the column
    mu, sigma = df[column].mean(), df[column].std()
    # B: init the histogram
    bin_edges, bins_probabilites, ignored = plt.hist(df[column], density=True)
    # C: plot the PDF 
    plt.plot(bins_probabilites, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp(-(bins_probabilites - mu)**2 / (2 * sigma**2)),
             linewidth=2, color='r')
    # D: make the plot more presentable
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

In [None]:
plot_histogram_from_df(df, 'starfish_spotted', "PDF of COTS Spotted in Video Frames", "No. of COTS", "Probability")

One takeaway on this: the distribution of COTS per video frame is heavily skewed, and the majority of them have none at all. This reinforces the idea that we'll want to weigh the `recall` highly in evaluating the eventual model we build, so we can detect the relatively low number of COTS that exist per image.

*Part 2:* On average, how many video frames do we go without seeing any COTS in the provided videos?

With this question I am trying to get another measurement of how closely the groups of COTS are to one another. The approach which I'll take here is to gather a distribution of the numbers of frames that happen sequentially in the training data, in which there are zero COTS spotted.

Note that one limitation of this approach is that certain frames might be of the same location on the Great Barrier Reef (since we don't know if the camera-person is always moving). Regardless, I think we'll go ahead with this approach anyway, since I believe it's reasonable to assume the camera is moving for most of the time in the giving videos; therefore, the amount of frames in between the time we spot any COTS is like a "proxy" for how close they are together.

In [None]:
def zero_sequence_lengths(a):
    """Compute the lengths of the sequences of consecutive zeros in an array.
    
    This is a modification of code by Warren Weckesser, originally posted on Stack Overflow:
    https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array
    
    Example:
        >>> a = np.array([[1, 2, 3, 0, 0, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0, 0, 9, 8, 7, 0, 10, 11]])
        >>> zero_sequence_lengths(a)
            array([6, 4, 1])
            
    Args:
        a(numpy.ndarray): 1-dimensional. Can have positive or negative numbers.
        
    Returns: ndarray. 1D array-like object.
    """
    # A: Create an array that is 1 where a is 0, and pad each end with an extra 0
    is_zero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    # B: Zero out any of the"in between" 1's - only 1's at edges remain
    ones_at_edges = np.abs(np.diff(is_zero))
    # C: Get the indices of all the remaining 1's (the starts and ends)
    sequences = np.where(ones_at_edges == 1)[0].reshape(-1, 2)
    # D: Compute a 1D array with just the lengths of these sequences
    return np.squeeze(np.diff(sequences, axis=1))

I will define an "empty frame" as one having no COTS, and use that to make the variable names more brief:

In [None]:
# Compute the lengths of these consecutive frames with zero COTS
consecutive_empty_frames = zero_sequence_lengths(df["starfish_spotted"])
# print the mean
avg_empty = round(consecutive_empty_frames.mean(), 1)
print(f"There is average of {avg_empty} 'empty' frames in-between the video frames that contain COTS.")

So there's our answer! Out of curiosity, let's make a PDF from the distribution of these lengths:

In [None]:
def plot_histogram_from_arr(array, title, x_axis, y_axis):
    """
    Plots the PDF of an array, using Matplotlib.
    
    Credit for the equation used for plotting the PDF goes to the NumPy documentation:
        https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
    
    Args:
        array(array-like object): 1-dimensional, has numerical values
        title(str), x_axis(str), y_axis(str): will be added to the plot
        
    Returns: None
    """
    # A: calculate the mean and std dev of the array
    mu, sigma = array.mean(), array.std()
    # B: init the histogram
    bin_edges, bins_probabilites, ignored = plt.hist(array, density=True)
    # C: plot the PDF 
    plt.plot(bins_probabilites, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp(-(bins_probabilites - mu)**2 / (2 * sigma**2)),
             linewidth=2, color='r')
    # D: make the plot more presentable
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

In [None]:
plot_histogram_from_arr(consecutive_empty_frames, "PDF of Empty Video Frames", "No. of Frames", "Probability")

Given these plots, it might suggest that the COTS don't tend to grow near to each other, as the majority of empty sequences have a length that is between 0-1000. This also begs another question though: what percentage do empty video frames make out of the entire training dataset?

In [None]:
def percent_fmt(percent):
    return f"{round(percent, 3)}%"

all_empty = np.sum(consecutive_empty_frames)
wedges = [all_empty, (df.size - all_empty)]
plt.pie(wedges, labels=['Empty Frames', 'Non-empty Frames'], autopct=percent_fmt)
plt.title("Percentages of Empty and Non-Empty Video Frames")
plt.show()

So, should we be using a sequential model for this problem? The answer might be a very definite maybe.

On the one hand, there are some pretty large sequences of video frames without any COTS - this would suggest that they spread far out from each other.

On the other hand, we can see (from above) that about 88.7% of the video frames in the training dataset have at least 1 COTS. Therefore, the dataset is not exactly "sparse" for finding COTS.

For now, let's do a few more exploratory questions - and then later, it might be useful to try both an object detection model (e.g. YOLOv5) and some kind of sequential model for this problem.

**Question 2:** What are some "giveaways" that a certain object in an video frame is that of a COTS?

- *Part 1:* What is the distribution of the observed colors of COTS in the videos?

To approach this question, we need to first figure out what we mean by "observed color" of a single COTS. For the code below, I decided to just take the pixel value that remains after the bounding box is downsampled to be just a single pixel, with RGB channels. This of course means it takes on the dimensions of `(3, 1, 1)`.

In [None]:
from scipy.spatial import KDTree


def rgb_to_color(rgb):
    """Finds the color that most closely names the given RGB value.
    
    Adapted from the code by Mir AbdulHaseeb at this link:
    https://medium.com/codex/rgb-to-color-names-in-python-the-robust-way-ec4a9d97a01f
    
    Arg:
        rgb(np.ndarray): 3 values in an array, one for each of the RGB channels
        
    Returns: str
    """
    # A: map all the hex codes in CSS3 --> respective color names
    css3_dict = webcolors.CSS3_HEX_TO_NAMES
    color_names = []
    rgb_values = []
    for color_hex, color_name in css3_dict.items():
        color_names.append(color_name)
        rgb_values.append(webcolors.hex_to_rgb(color_hex))
    # B: retrieve the closest matching color
    kdt_db = KDTree(rgb_values)
    _, index = kdt_db.query(rgb)
    return color_names[index]
    
    # a dictionary of all the hex and their respective names in css3
    css3_db = css3_hex_to_names
    names = []
    rgb_values = []
    for color_hex, color_name in css3_dict.items():
        names.append(color_name)
        rgb_values.append(hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return names[index]


def get_downsampled_color(img, annotation):
    """Compute the downsampled color from a 3-dimensional array (of pixel values).
    
    To account for the variation of color within a single bounding box, 
    this function first downsamples the pixels representing the COTS down to a 
    single pixel across the RGB channels (using bicubic interpolation). 
    Then, it uses a KDTree to convert that single RGB pair to a color name.
    
    Args:
        img(PIL.Image)
        annotation(dict): contains the coords of the upper left corner of a single bounding box,
                          along with its width and height
    
    Returns: ndarray of 3 n
    """
    # A: get the pixels of just the COTS (using the bounding box)
    upper_left_x, upper_left_y = (annotation['x'], annotation['y'])
    width, height = annotation['width'],  annotation['height']
    lower_right_x, lower_right_y = upper_left_x + width, upper_left_y + height
    cots_pixels = img.crop((upper_left_x, upper_left_y, lower_right_x, lower_right_y))
    # B: resize it to a single pixel (with 3 channels)
    cots_pixel = np.array(cots_pixels.resize((1, 1), resample=Image.BICUBIC).getdata()).reshape(-1)
    # C: return the name of this pixel
    return rgb_to_color(cots_pixel)


def compute_cots_color_distribution(df, data_path):
    """Compute the distribution of colors shown by the COTS bounding boxes.
    
    By colors, we mean the "downsampled" color of each bounding box, 
    found using bicubic interpolation.
    
    Args: 
        df(pandas.DataFrame): the rows of the train.csv
        data_path(str): the root of the dataset
    
    Returns: ndarray of strings
    """
    # A: init a list
    colors = list()
    # B: traverse the DataFrame rows
    for _, frame in df.iterrows():
        f = frame
        # B1: get the image
        img_path = f"{data_path}/train_images/video_{f['video_id']}/{f['video_frame']}.jpg"
        img = Image.open(img_path)
        # B2: get the annotations of this image
        annotations = a = ast.literal_eval(f["annotations"])
        # B3: for each annotation - get the mean color
        colors.extend([get_downsampled_color(img, annotation) for annotation in a])
    # C: return the list
    return np.array(colors)

Warning: the code below still needs to be vectorized! Its runtime is on the order of minutes not seconds (on the GPU).

In [None]:
colors = compute_cots_color_distribution(df, data_path)  

In [None]:
colors

In [None]:
def compute_mode_color(color_strings):
    """Computes the mode of the RGB colors of COTS."""
    mode, counts = scipy.stats.mode(color_strings, axis=None)
    return mode[0]

In [None]:
mode_color = compute_mode_color(colors)
print(f"Most often, the COTS is a '{mode_color}' color.")

Because the color is now a categorical variable, we'll finally plot a bar chart to see its distribution.

In [None]:
from collections import Counter

def plot_histogram_from_strings(arr_strings, title, x_axis, y_axis):
    """
    Plots the PDF of an array of strings using Matplotlib.
    
    Credit to the Matplotlib documentation for giving example of how to plot categorical variables:
        https://matplotlib.org/stable/gallery/lines_bars_and_markers/categorical_variables.html#sphx-glr-gallery-lines-bars-and-markers-categorical-variables-py
    
    Args:
        array(array-like object): 1-dimensional, has string values
        title(str), x_axis(str), y_axis(str): will be added to the plot
        
    Returns: None
    """
    # A: map each unique value to its frequency
    string_counts = Counter(arr_strings)
    # B: make the plot 
    fig, ax = plt.subplots(1, 1, figsize=(45, 12))
    ax.bar(string_counts.keys(), string_counts.values())
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

In [None]:
plot_histogram_from_strings(colors, "PDF of the Colors of COTS", "Color", "Probability")

- *Part 2:* What is the distribution of the observed sizes of COTS in the videos?

For this question, I will choose the area of the bounding box around each COTS to approximate their sizes:

In [None]:
def compute_cots_size_distribution(df):
    """Compute the distribution of sizes taken up by the COTS.
    
    By size, we mean the area of the bounding box, (width * height), 
    which both come from the "annotations" column.
    
    Args: 
        df(pandas.DataFrame): the rows of the train.csv
    
    Returns: ndarray of numerical values
    """
    ### HELPER 
    def compute_area(box) -> float:
        """Returns area of bounding box found in the image."""
        return box["width"] * box["height"]
    
    ### MAIN
    # A: init a list
    sizes = list()
    # B: traverse the DataFrame rows
    for _, frame in df.iterrows():
        f = frame
        # B1: get the bounding boxes of this image
        boxes = ast.literal_eval(f["annotations"])
        # B2: for each annotation - get the area
        sizes.extend([compute_area(box) for box in boxes])
    # C: return the list
    return np.array(sizes)

In [None]:
sizes = compute_cots_size_distribution(df)

In [None]:
sizes

In [None]:
plot_histogram_from_arr(sizes, "PDF of COTS Sizes", "Area of Bounding Box (sq. pixels)", "Probability")

Woah - this PDF is heavily skewed. Instead of using the same `plot_histogram_from_arr()` function from before, let's look deeper into the range of values, and see if there might be outliers:

In [None]:
plt.boxplot(sizes)
plt.title("Box Plot of of COTS Sizes")
plt.ylabel("Area of Bounding Box (sq. pixels)")
plt.xlabel("Videos 0-2")
plt.show()

Yup. Looks like although the majority of bounding box sizes fall in a relatively small range, the overall range of sizes is huge. Perhaps there are some close-up shots of the COTS in the data?

To get a more accurate picture of this distribution, let's plot the PDF again after removing outliers (based on the IQR):

In [None]:
def find_remove_outlier_iqr(counts):
    """Find and delete the outliers from a dataset.
    
    Here, we assume the data is not normally distributed. 
    There we use IQR to find the outliers - specifically, by 
    taking out any values more than ((1.5 * IQR) + the 75th percentile), or 
    less than (the 25th percentile - 1.5 * IQR).
    
    Args:
        counts(np.ndarray) - presumbly an array of numerical values
        
    Returns: Tuple[np.ndarray] - 2 new arrays:
        a new array without any outliers
        a new array containing the outliers that were removed
    """
    # A: calculate interquartile range
    q25, q75 = np.percentile(counts, 25), np.percentile(counts, 75)
    iqr = q75 - q25
    print(f"The IQR of this array is: {iqr}.")
    # B: calculate the outlier thresholds
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    # C: identify outliers
    outliers = counts[np.logical_or(counts.any() < lower, counts.any() > upper)]
    # D: remove outliers
    outliers_removed = counts[np.logical_and(lower < counts, counts < upper)]
    return outliers_removed, outliers

In [None]:
sizes_no_outliers, _ = find_remove_outlier_iqr(sizes)

In [None]:
plot_histogram_from_arr(sizes_no_outliers, "PDF of COTS Sizes", "Area of Bounding Box (sq. pixels)", "Probability")

*Much* better! So now we can see that the area of the bounding boxes containing the COTS is relatively small for the most part, with most having somewhere between 1,000 to 2,500 square pixels.

## Modelling [TODO]

**Step 1:** Now let's load the YOLOv5 model from the PyTorch Hub. Note, there is also an implementation of this model [available on GitHub](https://github.com/ultralytics/yolov5/blob/c1249a47c7fe19e2067cb25ed8347e67d26ff1f1/models/tf.py#L323), which uses Tensorflow 2.x and is primarily authored by Jiacong Fang. 

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

In [None]:
!ls

In [None]:
import utils
display = utils.notebook_init()  # running the checks

**Step 2:** Now, let export to a `SavedModel` (this is using the "nano" version of YOLOv5, the smallest). Note that only need to do this *once*:

In [None]:
!python export.py --weights yolov5n.pt --include saved_model

**Step 4:** Now, let's test loading the model back in via Tensorflow 2:

In [None]:
tf_model = tf.keras.models.load_model('./yolov5n_saved_model')

Woohoo! We got a warning that the model has no training configuration, but that is to be expected (since we want to do all the training code henceforth ourselves, using Tensorflow code). 

Let's get started!