# Introduction


This Kernel explores the **Happywhale - Whale and Dolphin Identification competition** dataset.

<img src="https://images.unsplash.com/photo-1570913179118-f3d24be1d1f7?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2143&q=80" width=500></img>

# Analysis preparation

Let's load the data and explore it preliminarly.

<img src="https://images.unsplash.com/photo-1568430328012-21ed450453ea?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1174&q=80" width=500></img>

In [None]:
!pip install imagesize

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import imagesize

In [None]:
print(f"Files and folders: {os.listdir('/kaggle/input/happy-whale-and-dolphin')}")

Let's load `train.csv` and `sample_submission.csv` first

In [None]:
train_df = pd.read_csv('/kaggle/input/happy-whale-and-dolphin/train.csv')
submission_df = pd.read_csv('/kaggle/input/happy-whale-and-dolphin/sample_submission.csv')

In [None]:
train_df.head()

In [None]:
submission_df.head()

# Data exploration

<img src="https://images.unsplash.com/photo-1611890129309-31e797820019?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1170&q=80" width=500></img>

Let's get some more insight into the train data and train and test images.

In [None]:
print(f"Images in train index file: {train_df.image.nunique()}")
print(f"Species in train index file: {train_df.species.nunique()}")
print(f"Individual IDs in train index file: {train_df.individual_id.nunique()}")

print(f"Images in train images folder: {len(os.listdir('/kaggle/input/happy-whale-and-dolphin/train_images'))}")
print(f"Images in test images folder: {len(os.listdir('/kaggle/input/happy-whale-and-dolphin/test_images'))}")

Let's look to the complete list of species.

In [None]:
print(f"Species: {train_df.species.unique()}")

From discussions on the discussion board for this competition and other Notebooks (ex: [Happywhale: Data Distribution](https://www.kaggle.com/awsaf49/happywhale-data-distribution/notebook)) we learn that:
* beluga and globis are whales;  
* we can identify some of the species as being dolphin and other as whales (using their suffix); therefore, we will also rename beluga and globis;  
We also observe that
* `bottlenose dolphin` has sometime a typo (typed `dolpin`)
* `killer whale` is typed incorrectly as `kiler`.


In [None]:
train_df.loc[train_df.species.str.contains('beluga'), 'species'] = 'beluga_whale'
train_df.loc[train_df.species.str.contains('globis'), 'species'] = 'globis_whale'

In [None]:
train_df['class'] = train_df.species.map(lambda x: 'whale' if 'whale' in x else 'dolphin')

In [None]:
train_df['species'] = train_df['species'].str.replace('bottlenose_dolpin','bottlenose_dolphin')
train_df['species'] = train_df['species'].str.replace('kiler_whale','killer_whale')

Let's check how many species of dolphin vs. whale are.

In [None]:
temp = train_df.groupby(["class"])["species"].nunique()
df = pd.DataFrame({'Classes': temp.index,
                   'Species': temp.values
                  })
df = df.sort_values(['Species'], ascending=False)
plt.figure(figsize = (6,6))
plt.title('Species distribution - grouped on Dolphins and Whales - train dataset')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Classes', y="Species", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Let's check more details about the column `individual_id` from `train_df` values distribution.

In [None]:
print("Top 10 individual_id")
train_df.individual_id.value_counts().head(10)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
sns.kdeplot(np.log(train_df.individual_id.value_counts()))
plt.title("Logaritmic distribution of individual_id frequency in images")
plt.show()

Let's also look separatelly on dolphins and whales.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
sns.kdeplot(np.log(train_df.loc[train_df["class"]=='whale'].individual_id.value_counts()))
sns.kdeplot(np.log(train_df.loc[train_df["class"]=='dolphin'].individual_id.value_counts()))
ax.legend(labels=['whale', 'dolphin'])
plt.title("Logaritmic distribution of individual_id frequency in images")
plt.show()

Let's check as well frequency of species in train dataset.

In [None]:
df = train_df.groupby(["class", "species"])["image"].count().reset_index()
df.columns = ["Class", "Species", "Images"]
df = df.sort_values(['Images'], ascending=False)
plt.figure(figsize = (12,6))
plt.title('Species distribution - images per each species - train dataset')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Species', y="Images", hue='Class', data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Let's see now how many individual ids are per each species.

In [None]:
df = train_df.groupby(["class", "species"])["individual_id"].nunique().reset_index()
df.columns = ["Class", "Species", "Unique ID Count"]
df = df.sort_values(["Unique ID Count"], ascending=False)

df = df.sort_values(['Unique ID Count'], ascending=False)
plt.figure(figsize = (12,6))
plt.title('Species distribution - Individual IDs per each species - train dataset')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Species', y="Unique ID Count", hue='Class', data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Let's check now the image sizes in train and test images datasets.

Let's check if set of images listed in `train_df` is identical with set of images in folder `train_images` 

In [None]:
train_df_list = list(train_df.image.unique())
train_images_list = list(os.listdir('/kaggle/input/happy-whale-and-dolphin/train_images'))
delta = set(train_df_list) & set(train_images_list)
minus = set(train_df_list) - set(train_images_list)
print(f"Images in train dataset: {len(train_df_list)}\nImages in train folder: {len(train_images_list)}\nIntersection: {len(delta)}\nDifference: {len(minus)}")

All images indexed in `train_df` are present in the images folder and viceversa.

# Images data exploration

First we test which function (based on cv2 or based on imagesize) will run faster.

In [None]:
# image size using cv2 imread shape
def read_image_sizes_cv2(file_name):
    image = cv2.imread('/kaggle/input/happy-whale-and-dolphin/train_images/' + file_name)
    return list(image.shape)

In [None]:
# image size using imagesize
def get_image_sizes_imagesize(file_name):
    width, height = imagesize.get('/kaggle/input/happy-whale-and-dolphin/train_images/' + file_name)
    return [width, height]

In [None]:
import time
sample_size = 100
start_time = time.time()
train_sample_df = train_df.sample(sample_size)
m = np.stack(train_sample_df['image'].apply(read_image_sizes_cv2))
df = pd.DataFrame(m,columns=['w','h','c'])
print(f"Total processing time for {sample_size} images (using cv2): {round(time.time()-start_time, 2)} sec.")

In [None]:
import time
sample_size = 100
start_time = time.time()
train_sample_df = train_df.sample(sample_size)
m = np.stack(train_sample_df['image'].apply(get_image_sizes_imagesize))
df = pd.DataFrame(m,columns=['w','h'])
print(f"Total processing time for {sample_size} images (using imagesize): {round(time.time()-start_time, 2)} sec.")

We decide to use imagesize based function, since this one is more effective.
We will run it for 2500 samples.

In [None]:
import time
sample_size = 2500
start_time = time.time()
train_sample_df = train_df.sample(sample_size)
m = np.stack(train_sample_df['image'].apply(get_image_sizes_imagesize))
df = pd.DataFrame(m,columns=['w','h'])
print(f"Total processing time for {sample_size} images (using imagesize): {round(time.time()-start_time, 2)} sec.")

All image sizes are extracted in a different Kernel, [Images Sizes Makes Whales (and Dolphins) Happy](https://www.kaggle.com/gpreda/images-sizes-makes-whales-and-dolphins-happy).

In [None]:
train_img_df = pd.concat([train_sample_df, df], axis=1, sort=False)
print(f"Number of different image size ( images samples): {train_img_df.groupby(['w','h']).count().shape[0]}")

It appears that there are many images sizes (we only sampled less than 5% of the total number of images).

Let's visualize the distribution of width/height and colors per species.

In [None]:
plt.figure(figsize = (12,6))
plt.title('Species distribution - width per each species - train dataset (5% random data sample)')
sns.set_color_codes("pastel")
s = sns.boxplot(x = 'species', y="w", data=train_img_df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
plt.figure(figsize = (12,6))
plt.title('Species distribution - height per each species - train dataset (5% random data sample)')
sns.set_color_codes("pastel")
s = sns.boxplot(x = 'species', y="h", data=train_img_df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Let's show the distribution of width and height per species using a scatterplot.

In [None]:
def plot_species_scatter(train_img_df):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    species = list(train_img_df.species.unique())
    fig, ax = plt.subplots(5, 5,figsize=(15, 12))

    for spec in species:
        i += 1
        plt.subplot(5, 5,i)
        df = train_img_df.loc[train_img_df.species==spec]
        plt.scatter(df['w'], df['h'], marker='+')
        plt.xlabel(spec, fontsize=9)
    plt.show();
plot_species_scatter(train_img_df.dropna())

The number of colors seems to be allways 3 for the 5% random sample used.



Let's sample few of the train images, grouped on species.  

We create first a plotting function.

In [None]:
def plot_image_samples(species):
    root_path = "/kaggle/input/happy-whale-and-dolphin/"
    fig.subplots_adjust(hspace = .1, wspace=.1)
    images_folder="train_images/"
    df = train_df[train_df['species']==species].copy()
    df.index = range(len(df.index))

    f, ax = plt.subplots(4, 4, figsize=(16,16))

    for i in range(16):
        file = df.loc[i, 'image']
        species = df.loc[i, 'species']
        identifier = df.loc[i, 'individual_id']
        img = cv2.imread(root_path+images_folder+file)
        ax[i//4, i%4].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        ax[i//4, i%4].set_title(identifier+" ("+species+")")
        ax[i//4, i%4].axis('off')

In [None]:
plot_image_samples("bottlenose_dolphin")

In [None]:
plot_image_samples("beluga_whale")

In [None]:
plot_image_samples("humpback_whale")

In [None]:
plot_image_samples("blue_whale")

In [None]:
plot_image_samples("killer_whale")

In [None]:
plot_image_samples("spotted_dolphin")

Let's also look to a sample of test images.

In [None]:
def plot_image_samples_test():
    root_path = "/kaggle/input/happy-whale-and-dolphin/"
    fig.subplots_adjust(hspace = .1, wspace=.1)
    images_folder="test_images/"

    f, ax = plt.subplots(4, 4, figsize=(16,16))
    file_list = list(os.listdir(root_path+images_folder))
    for i in range(16):
        file = file_list[i]
        img = cv2.imread(root_path+images_folder+file)
        ax[i//4, i%4].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        ax[i//4, i%4].set_title("Test image: "+file)
        ax[i//4, i%4].axis('off')

In [None]:
plot_image_samples_test()

# Preliminary submission

<img src="https://images.unsplash.com/photo-1602264985195-52b338cb937b?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1170&q=80" width=500></img>

Let's rotate the identifiers so that`new_individual` became the first option.

In [None]:
def rotate_values(x):
    xcopy = x.split()
    temp = xcopy[4]
    xcopy[4] = xcopy[0]
    xcopy[0] = temp
    xcopy = " ".join(xcopy)
    return xcopy

In [None]:
submission_df["predictions"] = submission_df["predictions"].apply(lambda x: rotate_values(x))

In [None]:
submission_df.head()

We output the prepared submission file.

In [None]:
submission_df.to_csv('submission.csv', index=False)