# **EDA Solution Simplified in TensorFlow Barrier Reef Competition (Adapted from Baek Kyun Shin, Diego Gomez, and Aryan Lala)** #

## Introduction before Coding ##
After we classified all the Crown-of-Thorns starfish in the previous notebook (https://www.kaggle.com/dinowun/the-help-protect-the-great-barrier-reef-code), the COTS has returned again, and they want payback! But theres another way to fix the COTS' payback plan, use the EDA solution. In this second notebook, we will go over my simplified walkthrough of the EDA Solution. 

## Load "Train" Data (and Imports) ##
Like I said, before classifying the COTS again, we need to import a module, called pandas. Thus, we have to import the file path into the data_path variable and define the train variable that concatenates the file path of the competition data and the train.csv file and read it.

In [None]:
import pandas as pd

PATH_OF_DATA = '/kaggle/input/tensorflow-great-barrier-reef/'
train_file = pd.read_csv(PATH_OF_DATA + 'train.csv')
train_file

## A Peek of the Train Data ##
Now, after we read out the full path of the train.csv file, let's go and analyze the data inside. First analysis, we get the info of the train_file variable using the info function.

In [None]:
train_file.info()

As we run through the info() function of train_file, we see that the class of the train_file variable is a DataFrame object and there are 23501 entries according to RangeIndex. Also, there are 6 data columns: video_id (0), sequence (1), video_frame (2), sequence_frame (3), image_id (4), and last but not least, annotations (5). From each column, there are 4 int64 dtypes and 2 object dtypes. Over 1.1 MB of memory were used by the train_file variable. Second analysis, we'll find any signs of duplicated data over the train_file data.

In [None]:
train_file.duplicated().sum()

**Excellent!** There are no signs of duplicated data over the train_file data. Last analysis, we will make a summary of features known as feature summary.

In [None]:
def a_legit_resumetable(df):
    '''This is a function to create summary but with features'''
    print(f"Shape: {df.shape}")
    one_summary = pd.DataFrame(df.dtypes, columns=['Data Type'])
    one_summary = one_summary.reset_index()
    one_summary = one_summary.rename(columns={'index': 'Features'})
    one_summary['Num of Null Value'] = df.isnull().sum().values
    one_summary['Num of Unique Value'] = df.nunique().values
    one_summary['1st Value'] = df.iloc[0].values
    one_summary['2nd Value'] = df.iloc[1].values
    one_summary['3rd Value'] = df.iloc[2].values
    return one_summary
    

After running a_legit_resumetable function, we can now run the function again but this time with the train variable.

In [None]:
a_legit_resumetable(train_file)

After running this line of code up there, we can see the shape of the train_file first, then we now observed that the one_summary variable is creating and reading out a csv file! There are 7 columns of labels: Features, Data Type, Num of Null Value, Num of Unique Value, 1st Value, 2nd Value, and 3rd Value. 

## Basic Engineering ##
Since we setup'd our train_data, let's do some basic engineering. First, we have to create a function to downcast the train_file variable. Let's see how it went.

In [None]:
def downcast_time(df, verbose=True):
    begin_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        dtype_name = df[col].dtype.name
        if dtype_name == 'object':
            pass
        elif dtype_name == 'bool':
            df[col] = df[col].astype('int8')
        elif dtype_name.startswith('int') or (df[col].round() == df[col]).all():
            df[col] = pd.to_numeric(df[col], downcast='integer')
        else:
            df[col] = pd.to_numeric(df[col], downcast='float')
    end_of_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print('{:.1f}% Compressed'.format(100 * (begin_mem - end_of_mem) / begin_mem))
        
    return df

Great! Now, we now define the train_file again with the downcast function with the train_file variable!

In [None]:
train_file = downcast_time(train_file)

After we defined the train_file variable again, we can see that 41.7% of the train_file has been compressed.

## Feature Engineering ##
Now that the train_file variable has been somewhat compressed, We can now start generating bounding boxes with annotations into the train_file variable! First of all, we have to import the ast module, then we convert the string of annotations to a list of annotations. Finally, we get the number of bounding boxes for each image.

In [None]:
# First, import the 'ast' module
import ast

# Next, convert string to list type
train_file['annotations'] = train_file['annotations'].apply(ast.literal_eval)

# Finally, get the number of bounding boxes for each image
train_file['num_bboxes'] = train_file['annotations'].apply(lambda i: len(i))

## Checking the Number of Frames with Bounding Boxes ##
After the train_file variable obtained the number of bounding boxes, we can inspect the train_file variable by nesting the train_file variable with the 'num_bboxes' key and set it greater than 0.

In [None]:
train_file[train_file['num_bboxes'] > 0]

After running this code, we see that all of the rows contained annotations of each images of the location of the COTS!

## Verification to Classify Whether There Is Corrupted Data ##
Before we plot frame images with bounding boxes that came from annotations, we have to verify if there are corrupted data in some images. So, let's take a run!

In [None]:
from os import listdir
from PIL import Image

def verification(video_id):
    path_of_file = PATH_OF_DATA + f'train_images/video_{video_id}/'
    for filename in listdir(path_of_file):
        if filename.endswith('.jpg'):
            try:
                image = Image.open(path_of_file + filename)
                image.verify() # We need this line of code so that we can verify that it is an image
            except (IOError, SyntaxError) as err:
                print("There's a bad file there:", filename) # This line prints out the names of corrupted files
    print(f'Video {video_id} has all of the valid images. Results: Verified!')
    
for video_id in range(3):
    verification(video_id)

Super! Video 0, 1, and 2 has all of the valid images, and none of the bad files were present in one of the folders. Now that there are no bad files in these folders, now lets get started on plotting frame images with bounding boxes!

## Plotting Frame Images with Bounding Boxes ##
Before we start plotting frame images with bounding boxes, we are going to load sequence of images with annotations. First, we are going to import numpy module as np and ImageDraw from the PIL (Python Image Library) module. Then, we will create two functions: one that fetch images, then fetch image lists.

In [None]:
import numpy as np
from PIL import ImageDraw

def fetch_image(df, video_id, frame_id):
    # Let's first get a frame!
    a_frame = df[(df['video_id'] == video_id) & (df['video_frame'] == frame_id)].iloc[0]
    # Now, we will get bounding_boxes!
    bounding_boxes = a_frame['annotations']
    # Finally, let's open images!
    img = Image.open(PATH_OF_DATA + f'train_images/video_{video_id}/{frame_id}.jpg')
    
    for a_box in bounding_boxes:
        x0, y0, x1, y1 = (a_box['x'], a_box['y'], a_box['x'] + a_box['width'], a_box['y'] + a_box['height'])
        drawing = ImageDraw.Draw(img)
        drawing.rectangle((x0, y0, x1, y1), outline=180, width=5)
    return img

def fetch_image_list(df, video_id, num_images, start_frame_idx):
    image_list = [np.array(fetch_image(df, video_id, start_frame_idx + index)) for index in range(num_images)]
    return image_list

Now, we are going to find the number of images in a list using the fetch_image_list function.

In [None]:
images = fetch_image_list(train_file, video_id=0, num_images=80, start_frame_idx=25)
print(f"The number of images is: {len(images)}")

As you can see, there are 80 images in all after running the fetch_image_list function. With all being set, let's plot the images with bounding boxes! First, we are going to import the two matplotlib modules, then set up the grid and the figsize, and finally, plot every bounding boxes in each frames by 5.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

the_grid = gridspec.GridSpec(4, 2)
plt.figure(figsize=(18, 20))

idx_list = [0, 5, 10, 15, 20, 25, 30, 35]

for i, idx in enumerate(idx_list):
    ax = plt.subplot(the_grid[i])
    plt.imshow(images[idx], interpolation='nearest')
    ax.set_title(f'frame index {idx}')
    plt.axis('off')
plt.show()

After we run this cell, we see that every 5 frames, there are bounding boxes plotted from the annotations in every index. Now we got every bounding boxes in every frame, we will now finally animate images with bounding boxes!

## Image Animation ##
In order to bind up all frame indexes, we are going to create an animated image or whatsoever call it a video. Another matplotlib module will be imported first with rc being set up, then the create_animation function will bind all frame indexes of images together in to one set thanks to the animate function inside, and lastly, set the frame interval to 130 and call the create_animation function.

In [None]:
from matplotlib import animation, rc
rc('animation', html='jshtml')

def create_an_animation(imgs, frame_interval=130):
    fig = plt.figure(figsize=(7, 4))
    plt.axis('off')
    img = plt.imshow(imgs[0])
    
    def animate(i):
        img.set_array(imgs[i])
        return [img]
    
    return animation.FuncAnimation(fig, animate, frames=len(imgs), interval=frame_interval)

In [None]:
frame_interval = 130 # Either set smaller number if wanted to play fast, else set bigger and slower
create_an_animation(images, frame_interval=frame_interval)

After all, we have done making a video that is made out of binded frame indexes and plotting a bounding box from the first image with annotations. But that's not all because we are going to submit predictions with inference over a TensorFlow COTS model.

## Loading the TensorFlow COTS Model to Run Inference ##
Now that we had done the video framing of binded indexes with bounding boxes, let's load the TensorFlow COTS Model to run inference. First, let's import tensorflow, os, time, and sys modules including greatbarrierreef, and then let's set the model directory with the path of the model directory, clear session, load the saved model and finally, measure the end time with the end time variable. Thus, we have to set up the elapsed time variable and print them out to measure how long the TensorFlow COTS model load. 

In [None]:
import os
import sys
import tensorflow as tf
import time

# Import the library that is used to submit the prediction result.
INPUT_DIR = '../input/tensorflow-great-barrier-reef/'
sys.path.insert(0, INPUT_DIR)
import greatbarrierreef

MODEL_DIRECTORY = '../input/cots-detection-w-tensorflow-object-detection-api/cots_efficientdet_d0'
starting_time = time.time()
tf.keras.backend.clear_session()
detect_fn_tf_odt = tf.saved_model.load(os.path.join(os.path.join(MODEL_DIRECTORY, 'output'), 'saved_model'))
ending_time = time.time()
total_time = ending_time - starting_time
print("The total time is: " + str(total_time) + "s")

Now after that, let's load images into a numpy array using a load_image_into_numpy_array function. Thus, we are going to set up detection by create a detect function (see code for further explanation).

In [None]:
def load_image_into_numpy_array(path):
    """This actually loads an image from a file to a numpy array.
    
    Here's how:
    The function puts image into numpy array to feed into tensorflow graph.
    Note that by convention, we put it into a numpy array with shape into a format of height, width, and channels (which equals to 3) for RGB.
    
    The Args:
    Path: a file path (that can be local or on colosus)
    
    How it returns:
    The path returns an uint8 numpy array with shape with a format containing img_height, img_width, and 3.
    """
    img_data = tf.io.gfile.GFile(path, 'rb').read()
    image = Image.open(io.BytesIO(img_data))
    (im_width, im_height) = image.size
    
    return np.array(image.getdata()).reshape(
    (im_height, im_width, 3)).astype(np.uint8)

def detect(image_np):
    """This function can detect the infamous COTS from a given numpy image."""
    
    input_tensor = np.expand_dims(image_np, 0)
    starting_time = time.time()
    detections = detect_fn_tf_odt(input_tensor)
    return detections

## The Usage of The Provided Python Time-Series API To Create Submission file ##
Within this format of submission in a csv file way, we are going to start initialize the environment with the make_env function over greatbarrierreef in an env variable, then iterate which loops over the test set and sample submissions with the iter_test function with the iteration_test variable.

In [None]:
env = greatbarrierreef.make_env()  # First, initialize the environment
iteration_test = env.iter_test()  # Next, call iteration_test over an iterator which loops over the test set and sample submissions

After we created our env, we will start by setting the DETECTION_THRESHOLD to 0.19, set up submission_dict as a dictionary containing id and prediction_string. Then, we will start looping iteration_test with image_np and sample_prediction_df over detection_boxes, detection_scores, and num_detections. And then after that, generate the submission data.

In [None]:
detection_threshold = 0.19 # Change it whatever you want...

submission_dict = {
    'id': [],
    'prediction_string': [],
}

for (image_np, sample_prediction_df) in iteration_test:
    height, width, _ = image_np.shape
    
    # Run object detection using the TensorFlow model.
    detections = detect(image_np)
    
    # Parse the detection result and generate a prediction string.
    num_detections = detections['num_detections'][0].numpy().astype(np.int32)
    predictions = []
    for index in range(num_detections):
        score = detections['detection_scores'][0][index].numpy()
        if score < detection_threshold:
            continue

        bbox = detections['detection_boxes'][0][index].numpy()
        y_min = int(bbox[0] * height)
        x_min = int(bbox[1] * width)
        y_max = int(bbox[2] * height)
        x_max = int(bbox[3] * width)
        
        bbox_width = x_max - x_min
        bbox_height = y_max - y_min
        
        predictions.append('{:.2f} {} {} {} {}'.format(score, x_min, y_min, bbox_width, bbox_height))
    
    # Generate the submission data.
    prediction_str = ' '.join(predictions)
    sample_prediction_df['annotations'] = prediction_str
    env.predict(sample_prediction_df)

    print('Predictions: ', prediction_str)

As a result, we have some predictions to observe on, and thus, a submission.csv file appeared just in sight! If we submit this file, I expect that this scored a prediction score of 0.420ish or so...

## Acknowledgements ##
Credit to Aryan Lala and Diego Gomez for the EDA part and the prediction training part.
Training and Prediction Part: 
https://www.kaggle.com/aryanlala/object-detection-great-barrier-reef/notebook
EDA Part:
https://www.kaggle.com/diegoalejogm/great-barrier-reefs-eda-with-animations