## This notebook is inspired from Ruchi Bhatia's Notebook [Link](https://www.kaggle.com/ruchi798/and-identification-eda-augmentation/notebook) and Andrada Olteanu's notebook [Link](https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance#7.-Cosine-Distance)

# Import libraries

In [None]:
!pip install imagesize

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import os
import cv2

from termcolor import colored
import imagesize

from tqdm import tqdm

import warnings
warnings.filterwarnings("ignore")

In [None]:
train_data = pd.read_csv("../input/happy-whale-and-dolphin/train.csv")
train_data.head(5)

# Figure out unique data and fix repeating names

In [None]:
print("Number of unique species :", train_data["species"].nunique())
print("Unique Species names :", list(train_data["species"].unique()))

#### 1. Bottlenose_dolphin has been mispelled as bottlenose_dolpin
#### 2. Killer_whale has been mispelled as kiler_whale

#### Fixing...

#### pilot_whale and globis are both short_finned_pilot_whale and thus the three can be merged - From [discussions](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305468)

In [None]:
train_data["species"] = train_data["species"].str.replace('bottlenose_dolpin', 'bottlenose_dolphin')
train_data["species"] = train_data["species"].str.replace('kiler_whale', 'killer_whale')
train_data["species"] = train_data["species"].str.replace('pilot_whale', 'short_finned_pilot_whale')
train_data["species"] = train_data["species"].str.replace('globis', 'short_finned_pilot_whale')

train_data["species"] = train_data["species"].str.replace('short_finned_short_finned_pilot_whale', 'short_finned_pilot_whale')
train_data["species"] = train_data["species"].str.replace('long_finned_short_finned_pilot_whale', 'long_finned_pilot_whale')


print(colored("Duplicates fixed...", 'red'))

print(colored("Number of unique species : {0}".format(train_data["species"].nunique()), "green"))
print("Unique Species names :", list(train_data["species"].unique()))

# Train and Test data

In [None]:
test_img_dir = '../input/happy-whale-and-dolphin/test_images'
train_img_dir = '../input/happy-whale-and-dolphin/train_images'

def getCompleteImagePath(path):
    image_names = []
    for dirname, _, filenames in os.walk(path):
        for imageName in filenames:
            complete_path = os.path.join(dirname, imageName)
            image_names.append(complete_path)
    return image_names, len(image_names)

train_img_path, num_train_images = getCompleteImagePath(train_img_dir)
test_img_path, num_test_images = getCompleteImagePath(test_img_dir)

print(colored("Number of train images : {0}".format(num_train_images), 'blue'))
print(colored("Number of test images : {0}".format(num_test_images), 'green'))

In [None]:
def addCompleteImagePath(train_data, train_img_dir):
    train_data["complete_img_path"] = "empty"
    for index, row in train_data.iterrows():
        row["complete_img_path"] = os.path.join(train_img_dir, row["image"])
    return train_data
    
train_data = addCompleteImagePath(train_data, train_img_dir)
train_data.head(5)

In [None]:
def displayImages(image_path, rows, cols, title, figsize=(20, 8)):
    figure, ax = plt.subplots(nrows=rows, ncols=cols, figsize=figsize)
    plt.suptitle(title, fontsize=18)
    for index, pathImage in enumerate(image_path):
        img = cv2.cvtColor(cv2.imread(pathImage), cv2.COLOR_BGR2RGB)
        try:
            ax.ravel()[index].imshow(img)
            ax.ravel()[index].set_axis_off()
        except:
            continue;
    plt.tight_layout()
    plt.show()

In [None]:
displayImages(train_img_path[:30], 6, 5, "Train images(30)")

In [None]:
displayImages(test_img_path[:30], 6, 5, "Test images(30)")

# Whales : Dolphins as Super Class

#### Let's split the classes to Super_class to understand distribution of whales and dolphins

In [None]:
train_data["super_class"] = train_data.species.map(lambda x : 'dolphin' if 'dolphin' in x else 'whale')

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
fig.suptitle("Distribution of Whales and Dolphins", size=22)
explode = (0.05, 0.05)
labels = list(train_data.super_class.value_counts().index)
sizes = train_data.super_class.value_counts().values
ax.pie(sizes, explode=explode, startangle=45, labels=labels, autopct="%1.0f%%", pctdistance=0.8, colors=['#03fcdf', '#03a5fc'])
ax.add_artist(plt.Circle((0,0), 0.6, fc="white"))
plt.show()

* #### The data is imbalanced from super_class perspective itself. We see Whale:Dolphin :: 67:33. 
* #### We may need to focus on Dolphins more while preprcessing/data augmentations to bring the ratio to atleast 55:45.
* #### This is just a general idea. Best ratio would be 50:50 anyday. To be realistic, I think, 55:45 should be a good ratio if achieveable. Again, this ratio needs to be tested at model performance level.

# Distribution within Whales and Dolphins

In [None]:
whales = train_data[train_data["super_class"]=="whale"]
dolphins = train_data[train_data["super_class"]!="whale"]

fig, ax = plt.subplots(nrows = 1,ncols = 2, figsize=(16, 8))

sns.countplot(y="species", data=whales, order=whales.iloc[0:]["species"].value_counts().index, ax=ax[0], color="#03fcdf")
ax[0].set_title("Distribution of Whales")
ax[0].set_ylabel(None)

sns.countplot(y="species", data=dolphins, order=dolphins.iloc[0:]["species"].value_counts().index, ax=ax[1], color="#03a5fc")
ax[1].set_title("Distribution of Dolphins")
ax[1].set_ylabel(None)

plt.tight_layout()
plt.show()

### **Whales :-**

* #### Mind = Blown xD. We have appx 7k Beluga and nearly 10~100 Pygmy killer whale. (As per the results of few searches, Pygmy killer whale is itself a very species of whale xD. So having less data is understood). 

* #### Balancing such data would be very challeneging. If it was real life scenario where you would very rarely see this whale, we would have omitted this class. This means adding data from other sources would be challeneging too (most probably)

* #### Even for Brydes whale, we lack information.

* #### So for last 8 whale species, we have less than 1k images.

### **Dolphins :-**

* #### 10k+ images for bottlenose dolphin, and nearly nil for frasiers dolphin. 

* #### Frasiers dolphin are mostly found in deep waters = hard to spot on surface, hence less images?

* #### For last 7 dolphin classes we have very less images.

### In this scenario, data augmentation might help, but till what extent? Can't increase 500 images to 5k using just augmentations xD. Augmentations should be very well experimented. Might do wonders!

In [None]:
plt.figure(figsize=(12, 12))
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
sns.countplot(y="species", data=train_data, order=train_data.iloc[0:]["species"].value_counts().index, palette="hls", linewidth=4)
plt.title("Entire Train set Distribution")
plt.show()

# Working with individual ids

In [None]:
all_indv = train_data["individual_id"].value_counts().reset_index()
individuals = train_data["individual_id"].value_counts().reset_index().head(1000)

print(colored("Total Unique IDs : {0}".format(all_indv.shape[0]),"blue"))
print(colored("ID : {1} occurs {0} times (Max)".format(all_indv.max()["individual_id"], all_indv.max()["index"]),"green"))
print(colored("ID : {1} occurs {0} times (Min)".format(all_indv.min()["individual_id"], all_indv.min()["index"]),"yellow"))

In [None]:
all_indv

In [None]:
plt.figure(figsize=(12, 12))
sns.barplot(data=all_indv.head(50), x="individual_id", y="index", palette="flare")
plt.title("Top 50 individual IDs")
plt.ylabel("Individual")
plt.xlabel("Frequncy")
plt.show()

# Visualize each species

In [None]:
def display_individual_species(species_name):
    
    if species_name not in train_data["species"].unique():
        print(colored("Invalid species name...", red))
        return
    
    possible_num_images = train_data[train_data["species"]==species_name].shape[0]
    
    if possible_num_images > 10:
        possible_num_images = 10
    
    indv_species = train_data[train_data["species"]==species_name].head(possible_num_images)
    
    if possible_num_images == 10:
        displayImages(indv_species["complete_img_path"].tolist(), 2, 5, species_name, figsize=(10, 5))
    else:
        displayImages(indv_species["complete_img_path"].tolist(), 1, 5, species_name, figsize=(10, 5))

In [None]:
unique_species = train_data["species"].unique().tolist()
for species in unique_species:
    display_individual_species(species)

# Take a look at the image sizes

In [None]:
width_list, height_list = [], []

for fish_path in tqdm(train_data["complete_img_path"]):
    width, height = imagesize.get(fish_path)
    width_list.append(width)
    height_list.append(height)

In [None]:
train_data["width"] = width_list
train_data["height"] = height_list
train_data["dimension"] = train_data["width"] * train_data["height"]

In [None]:
min_size = train_data[train_data["dimension"]==train_data["dimension"].min()]
max_size = train_data[train_data["dimension"]==train_data["dimension"].max()]

### Image with lowest dimension

In [None]:
min_size

### Image with highest resolution

In [None]:
max_size

In [None]:
train_data["dimension"].nunique()

### Out of 51k images, we have 24,233 unique image dimensions

In [None]:
temp_data = train_data[["species", "dimension", "super_class"]]

In [None]:
plt.figure(figsize=(16, 16))
sns.violinplot(data=temp_data, x="species", y="dimension", hue="super_class", palette="magma")
plt.xlabel("Species")
plt.ylabel("Dimension")
plt.xticks(rotation=90)
plt.show()

By this graph we can see that "frasiers_dolphin" species has very small resolution images. 

# Work under process...