# Competition Introduction
## Objective
In the training set, Individuals have been manually identified and given an individual_id by marine researches. Our task is to individually identify each of the Whales and Dolphins in the test set images using their unique individual characteristics such as shapes, features and markings (some natural, some acquired) of dorsal fins, backs, heads and flanks. This is similar to uniquely identifying human individuals by looking at their faces and other visual features of the body

## Evaluation
The evaluation metric is the Mean Average Precision @ 5 (MAP@5):
$$
MAP@5 = \frac{1}{U} \sum_{u=1}^{U}  \sum_{k=1}^{min(n,5)} P(k) \times rel(k)
$$
where U is the number of customers, P(k) is the precision at cutoff k, n is the number predictions per image, and rel(k) is an indicator function equaling 1 if the item at rank k is a relevant (correct) label, zero otherwise.

# Exploring the data

## Exploring train.csv

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import plotly.express as px
import cv2

In [None]:
train_df = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')
train_df.head()

In [None]:
train_df.info()

In [None]:
print(f'There are {train_df["individual_id"].count()} images corresponding to {train_df["individual_id"].nunique()} individuals')

In [None]:
sorted(train_df['species'].unique())

As we can see, there are some inconsistent names in the spcies column. Let's correct these using the code snippet shared in [this](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305574) discussion thread

In [None]:
train_df.species.replace({"globis": "short_finned_pilot_whale",
                          "pilot_whale": "short_finned_pilot_whale",
                          "kiler_whale": "killer_whale",
                          "bottlenose_dolpin": "bottlenose_dolphin"}, inplace=True)

### distribution of train images by species

In [None]:
# Now let us plot the distribution of train images by species
fig = px.histogram(train_df, x="species")
fig.show()

### Distribution of train images at individual level
Now let us see how the distribution of images for each individual level i.e. how many average and maximum images for any individual and see this separately for each species

In [None]:
# Images per individual
df2 = train_df.groupby(['individual_id','species'])['image'].agg('count').reset_index()
fig2 = px.box(df2, y="image", color='species', log_y=True, labels={'image':'Number of images'})
fig2.show()

## Images
Let us see a few sample images from the dataset

In [None]:
rows, cols = 10,2
sample = train_df.sample(rows*cols)
print(sample)
f, ax = plt.subplots(rows, cols, figsize=(20,50))
root_path = '../input/happy-whale-and-dolphin/train_images/'
for i,indx in enumerate(sample.index):
    file = sample.loc[indx, 'image']
    species = sample.loc[indx, 'species']
    individual_id = sample.loc[indx, 'individual_id']
    img = cv2.imread(root_path+file)
    title = f'\n individual_id: {individual_id} \n Species: {species}'
    row, col = i//cols, i%cols
    ax[row, col].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    ax[row, col].set_title(title)
    ax[row, col].axis('off')
plt.tight_layout()
plt.show()

Following points are can be noted from these sample images:-
1. Image sizes are not uniform, there are many different image sizes
2. In addition to image sizes, the distance of the subject individual, and the part of body captured also vary significantly. These are in addition to usual image variables like contrast, focus etc.

## Datasets of resized images
Kagglers have created and generously shared datasets of resized images which might be good for training initial prototypes of the models. Below are the links to some of these:-
1. 512 x 512 by phalanx: https://www.kaggle.com/phalanx/whale2-cropped-dataset
2. 256 x 256 by RDizzl3: https://www.kaggle.com/rdizzl3/jpeg-happywhale-256x256

## *Work in Progress*
This is a work in progress, you are welcome to share your feedback in the comments.