# Packages used

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Loading the training dataframe

In [None]:
train_df = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')

In [None]:
train_df.head()

# General information on species and individuals

To start, let's retrieve the species names and how many of each there are in the dataset.

In [None]:
train_df.species.groupby(train_df.species).count()

In [None]:
num_species = train_df.species.groupby(train_df.species).count().nunique()
print('Number of species:', num_species)

As we can see above, there are some misspelled species names. Therefore, a replacement must be taken into account.
The issue was also identified by another Kaggler, [Aleksey Alekssev](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305574).

In [None]:
# from Aleksey Alekssev
replacement_data = {'globis': 'short_finned_pilot_whale',
                    'pilot_whale': 'short_finned_pilot_whale',
                    'kiler_whale': 'killer_whale',
                    'bottlenose_dolpin': 'bottlenose_dolphin'}

train_df.species.replace(replacement_data, inplace=True)

train_df.species.groupby(train_df.species).count()

To have a visualization on how the species are distributed, we can make use of a histogram.

In [None]:
plt.figure(figsize=(10, 6))
train_df.species.value_counts().sort_values(ascending=True).plot(kind='barh')

And finally, some basic descriptive statistics about the `species` and `individual_id`.

In [None]:
train_df[['species', 'individual_id']].describe()

From all the information retrieved so far, we now know how many classes should be used in our CNN model.
* 15587 classes for the known unique individuals plus 1 class for never seen individuals.

As a complementary information, let's see who is the most famous individual in the dataset.

In [None]:
train_df[train_df.individual_id == '37c7aba965a5'].head(1)

Who else appears many times in the dataset?

In [None]:
train_df.individual_id.value_counts().sort_values(ascending=False).head(10)

Taking into account that I am thinking about using CNN to tackle this competition problem, and the fact that from `individual_id` we can see that our training data is actually unballaced, in this case, I wonder about the following:
> To work with a CNN, should such unbalaced data be an issue for the final model?

And regarding the images size:
> What could be a good initial approach to deal if different images sizes in order to feed them to the CNN?

kaggler RDizzl3 shared a topic on [Reduced Resolution Image Data](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/304686) to improve computational performance for those working for the first time with computational vision.




