# Happywhale - Whale and Dolphin Identification
### Identify whales and dolphins by unique characteristics

In continuation of the previous HappyWhale competition called **Humpback Whale Identification** where the objective was to identify whale ID based on the tail images of the whales, this time comptitation has been made more challenging by adding images of the dolphins and the objective has remained the same, to identify Whale/Dolphin ID on the basis of shapes, features and markings (some natural, some acquired) of dorsal fins, backs, heads and flanks.

In the below data exploration we'll see that some species of the whales and dolphins have very distinct feature which can be identifed very easily without much efforts and some species have very similar features which are very hard distingish.

This competition expands that task significantly: data in this competition contains images of over 15,000 unique individual marine mammals from 30 different species collected from 28 different research organizations. Individuals have been manually identified and given an individual_id by marine researches, and your task is to correctly identify these individuals in images. It's a challenging task that has the potential to drive significant advancements in understanding and protecting marine mammals across the globe.

So now that our objective is clear we'll start exploring the provided information.

In [None]:
import os
os.listdir("/kaggle/input/happy-whale-and-dolphin")

In the dataset, organisor has provided us following files/folders:-
- train.csv file
- sample_submissing.csv
- train_images - contains images for the training dataset
- test_images - contains images for the test dataset

Lets begin with exploration of train.csv file first

#### Loading libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import plotly.express as px
from cv2 import imread
from random import sample

%matplotlib inline

In [None]:
train = pd.read_csv("/kaggle/input/happy-whale-and-dolphin/train.csv")
display(train)

From above we can see that in the train.csv file orgainior has given three fields as follows:-
- Images - This is the name of the image that we can find in train_images directory
- Species - This contains name of the species of the whales and dolphines.
- Individual_ID - This contains ID of the individual whale/dolphin. This is the target variable in this competition.

In [None]:
print(f'Total number of Individual = {len(train.individual_id.unique())}.')

print(f'distinct count of each Individual are as follows :-')

count_df = train.individual_id.value_counts().rename_axis('unique_values').to_frame('counts')

print(f'Following are the first 5 individual IDs based on count :-')
display(count_df.head(5))
print(f'Following are the last 5 individual IDs based on count :-')
display(count_df.tail(5))


fig = px.histogram(train, x="individual_id",title="individual_id Distrbution",)
fig.show()


From the above frequnency distribution table and plot we can see that there is one individual with ID **37c7aba965a5** which has maximum number of images and we can see there are many individual IDs which have just 1 image in the training dataset.

As the Individual Id is the target variable in this competition, hence we can say that this is an imbalance data problem where for some individuals we have very few images. Also there is more than 15K unique individual, hence this has a high cardinality issues as well which we may try to resolve by merging the individuals on the bases of species.

In [None]:
print(f'Total number of species = {len(train.species.unique())}.')

print(f'distinct count of each species are as follows :-')

count_df = train.species.value_counts().rename_axis('Species').to_frame('counts')
print(f'Following are the First 5 Species based on count :-')
display(count_df.head(5))
print(f'Following are the last 5 Species based on count :-')
display(count_df.tail(5))

fig = px.histogram(train, x="species",color = "species",title="Species Distrbution")
fig.show()


While going through the name of the species we have found that there exists two different variants of the two species names as follows:-
- Killer Whale is given as Kiler Whale as well
- Bottlenose Dolphin is given as Bottlenose Dolpin as well.

So our next steps would be to correct these name and to the species exploration again.

In [None]:
train.species = train.species.replace("kiler_whale","killer_whale")
train.species = train.species.replace("bottlenose_dolpin","bottlenose_dolphin")

print(f'Total number of species = {len(train.species.unique())}.')

print(f'distinct count of each species are as follows :-')

count_df = train.species.value_counts().rename_axis('Species').to_frame('counts')
print(f'Following are the First 5 Species based on count :-')
display(count_df.head(5))
print(f'Following are the last 5 Species based on count :-')
display(count_df.tail(5))

fig = px.histogram(train, x="species",color = "species",title="Species Distrbution")
fig.show()


- As from the above histogram plot we can see that the images for different species are not equally available, hence when building model to identify the species, we'll have to use data balancing technique.

In [None]:
train["Whale_dolphin_others"] = [x[-1] for x in train.species.str.split("_")]

print(f'distinct count of each whale/dolphin/others are as follows :-')

count_df = train.Whale_dolphin_others.value_counts().rename_axis('Whale_dolphin_others').to_frame('counts')
display(count_df)

fig = px.histogram(train, x="Whale_dolphin_others",color = "Whale_dolphin_others",title="Whale_dolphin_others Distrbution")
fig.show()

After doing research on the species on google and other notebooks of the competitions, one of which is [What about species](https://www.kaggle.com/kwentar/what-about-species) where Aleksey has done a very nice exploration on the species, we concluded following :-
- Beluga is one of the species of the whale.
- Not found much information on google but based on the above mentioend notebook we conclude that this is one of the species of the whales.

So our next steps would be to merge the Beluga and Globis to whale, then we would explore the name proportion of the whales and dolphines.

In [None]:
train["Whale_dolphin_others"] = ["whale" if x in ["beluga","globis"] else x for x in train.Whale_dolphin_others]

fig = px.pie(train.Whale_dolphin_others, values=train.Whale_dolphin_others.value_counts().values, names=train.Whale_dolphin_others.value_counts().index,title="Number Of Training Images Of Whales and Dolphins")
fig.show()

From above we can see that we exactly have 1/3rd part of image data containing images for the dolphins and 2/3rd part data as whale images and rest very small percentage of Globis specie. Reason for such imbalance could be obvious that just like previous competition, this competition was orignally desgined to identify whale individuals and dolphins could be add later to increase the complexity of the of the competition.

Anyway, information provided for the species would helpful in first identifing whether image belongs a whale or a dolphin and than identifying the individuals id, which is just like idenfiying the gender or the race in the human face dataset and then identifying individual.

In [None]:
Species_DF = train[["species","Whale_dolphin_others"]].drop_duplicates()
Species_GP_DF=Species_DF.groupby("Whale_dolphin_others").agg("count").rename(columns = {"species": "Species_Count"})
                                                                  
fig = px.pie(Species_GP_DF, values=Species_GP_DF.Species_Count, names=Species_GP_DF.index,title='Number Of Unique Species In Whales and Dolphins')
fig.show()

From above we noticed that we have imbalance in the number of species of Dolphins and Whales as well. **Count of Dolphins spccies are almost 50% of the count of Whale species.**

Futher lets explore number of Individual IDs on the Dophin/Whale group. 

In [None]:
Individuals_DF = train[["individual_id","Whale_dolphin_others"]].drop_duplicates()
Individuals_GP_DF = Individuals_DF.groupby("Whale_dolphin_others").agg("count").rename(columns = {"individual_id": "Individual_ID_Count"}) 
# display(Individuals_DF.groupby("Whale_dolphin_others").agg("count").rename(columns = {"individual_id": "Individual_ID_Count"}))

fig = px.pie(Individuals_GP_DF, values=Individuals_GP_DF.Individual_ID_Count, names=Individuals_GP_DF.index,title='Number Of Individuals In Whales and Dolphins')
fig.show()

Found Whales and Dolphins individual counts, we noticed that **count of unique individuals in Dolphin category are also almost 50% of the count of individuals in Whale Category.**

Now that we have explored count of individuals in Dolphins and whales, we should now explore the number of Individuals in each species.

In [None]:
Individuals_DF = train[["individual_id","species"]].drop_duplicates()
Individuals_GP_DF = Individuals_DF.groupby("species").agg("count").rename(columns = {"individual_id": "Individual_ID_Count"}) 

c = dict(zip(Individuals_GP_DF.index.unique(), px.colors.qualitative.Alphabet))
fig = px.pie(Individuals_GP_DF, values=Individuals_GP_DF.Individual_ID_Count,names=Individuals_GP_DF.index,color=Individuals_GP_DF.index,color_discrete_map=c,title='Species Wise Number of Individuals')
fig.show()

count_df = train.species.value_counts().rename_axis('Species').to_frame('Imgae_counts')
fig = px.pie(count_df, values=count_df.Imgae_counts, names=count_df.index,color=count_df.index,color_discrete_map=c,title='Species Wise Number of Images in training data')
fig.show()


by comparing number of individuals in each species by number of image count in the data we found following:-

- By Individual count Dusky_dolphin has the most individuals which is not even in top 5 by training image count of species.
- Beluga which is at the 2nd in training image count chart is at the 5th in Individual count chart.
- BottleNose Dolphin which has most training images has is not even in top 5 species by Individual IDs count.
- Humpback Whale and Blue Whale has almost similar position in the top 5 categories in both charts
- Mellon Head Whale which is at the 4th in Individual count chart is on 9th in training image count chart.

Now that we have explored train.csv file, next let have a submission.csv file 

In [None]:
submission = pd.read_csv("/kaggle/input/happy-whale-and-dolphin/sample_submission.csv")
display(submission)

From the above sample submission file we can see our objective is submit top 5 Individual IDs which matches with the test image. organisor has mentioned that there can be some images in the test dataset which may not exists in the training data. Hence for those image we'll have predict new category called **new_individual**    

## Image Data Exploration

Orgranisor has given us more than 51K images of Whales and dolphins mix. As our first step would to explore the images whales and dolphins seperately then we would explore images of the species.

In [None]:
def plot_grid(Images,species,n):
    plt.figure(figsize=(15,15))
    for i in range(1,n*n+1):
        plt.subplot(n,n,i,frameon=False)
        # plt.tight_layout()
        plt.imshow(Images[i-1])
        plt.title(species)

### Whale Images

In [None]:
train_image_path = "/kaggle/input/happy-whale-and-dolphin/train_images"
test_image_path = "/kaggle/input/happy-whale-and-dolphin/test_images"

Grid_size = 4

whale_Image_id = sample(list(train[train.Whale_dolphin_others=="whale"].image),Grid_size*Grid_size)
whale_images = [imread(os.path.join(train_image_path,x)) for x in whale_Image_id]
plot_grid(whale_images,"Whale",Grid_size)

In [None]:
whale_Image_id = sample(list(train[train.Whale_dolphin_others=="whale"].image),Grid_size*Grid_size)
whale_images = [imread(os.path.join(train_image_path,x)) for x in whale_Image_id]
plot_grid(whale_images,"Whale",Grid_size)

### Dolphin Images

In [None]:
dolphin_Image_id = sample(list(train[train.Whale_dolphin_others=="dolphin"].image),Grid_size*Grid_size)
dolphin_images = [imread(os.path.join(train_image_path,x)) for x in dolphin_Image_id]
plot_grid(dolphin_images,"Dolphin",Grid_size)

In [None]:
dolphin_Image_id = sample(list(train[train.Whale_dolphin_others=="dolphin"].image),Grid_size*Grid_size)
dolphin_images = [imread(os.path.join(train_image_path,x)) for x in dolphin_Image_id]
plot_grid(dolphin_images,"Dolphin",Grid_size)

### Image Samples for each Species 

In [None]:
for i in train.species.unique():
    species_image_id = sample(list(train[train.species==i].image),4)
    species_images = [imread(os.path.join(train_image_path,x)) for x in species_image_id]
    plot_grid(species_images,i,2)
    

Futher point to work on:-
- As we can see from the image data exploration all images are not of the similar dimension, hence it would a good idea to explore the dimension distribution over dolphine and whales as well as on differnt species to explore for the biasness in the image dataset.
-  Explore issues in the images like blur effect, high brightness, low contrast etc.


This above exploration of the image dataset would be helpful in chosing the required image augmentations and chosing the base NN architecture desgin,  

# To be continued..