# <B><u>Preview of Whale and Dolphin Dataset with Plotly/Matplotlib</u></B>

# [1] Introduction

The dataset for "Happywhale - Whale and Dolphin Identification" competition is visualized with <u>[Plotly](https://plotly.com/python/)</u> and <u>[Matplotlib](https://matplotlib.org/)</u> in the kernel. It may help us to get some insight into strategy of training, data augumentation, etc.

All of the figures, tables, etc. for train images in the kernel are as follows,
- Number of train images,
- Number of species,
- Graph/Table of number of images/individuals for each species (with Plotly),
- Graph/Table of number of images for each individual (with Plotly),
- Sample of train images for each species/individual (with Matplotlib).

All of the figures, tables, etc. for test images in the kernel are as follows,
- Number of test images,
- Sample of test images (with Matplotlib).

NOTE1 : The author is a beginner of Kaggle/MachineLearning/Python/English. So the kernel may have several bugs/wrongs. I am happy to get your comments. Thank you in advance for your kind advice to make the kernel so NICE! and to make me NICE deep learning guy!! 

 NOTE2 : Those contents may be usefull for beginners of Kaggle/MachineLearning/Python.
 - How to load competition data in Kaggle notebook
 - How to process images with Pillow/Numpy
 - How to process data with Pandas
 - How to visualize data with Plotly/Matplotlib

# [2] Preparation of dataset

The dataset for "Happywhale - Whale and Dolphin Identification" can be loaded by clicking the following buttons in the sidebar of kaggle notebook,

###  "+ Add data" -> "Competition Data" -> "Add (Happywhale - Whale and Dolphin Identification)".

If it succeeds, the data are loaded to the following path. 

In [None]:
path_to_inputs = "/kaggle/input/happy-whale-and-dolphin"
!ls {path_to_inputs}

The contents of csv files are shown.

In [None]:
!head {path_to_inputs}/sample_submission.csv

In [None]:
!head {path_to_inputs}/train.csv

The list of train/test images are shown.

In [None]:
!ls {path_to_inputs}/train_images | head

In [None]:
!ls {path_to_inputs}/test_images | head

The number of train/test images are shown.

In [None]:
!echo "Number of train_images:"
!ls {path_to_inputs}/train_images | cat -n | tail -1 | cut -f1
!echo ""
!echo "Number of test_images:"
!ls {path_to_inputs}/test_images | cat -n | tail -1 | cut -f1

The metadata for train images exists, but the one for test images is not exists. So the dummy metadata for test images is generated. Same format as the metadata for train images is assumed for loading test images with same manner.

In [None]:
# Generates metadata for test images. 
path_to_test_metadata = "/kaggle/working/test.csv"

!echo "image,species,individual_id" > {path_to_test_metadata}
!ls {path_to_inputs}/test_images | sed "s/.jpg/.jpg,unknown,unknown/g" >> {path_to_test_metadata}

# Shows contents of generated metadata.
!head {path_to_test_metadata}

# [3] Showing train images

The train images and those statistics are shown,
- Number of species,
- Graph/Table of number of images/individuals for each species (with Plotly),
- Graph/Table of number of images for each individual (with Plotly).

## [3-1] Loading dataset

In [None]:
# Installs required libraries.
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install Pillow
!pip install plotly

In [None]:
# Import required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
import plotly.graph_objects as go
from plotly.subplots import make_subplots
#import IPython

In [None]:
# Defines the class to load metadata and images and to process those.
class WhaleAndDolphin():
    def __init__(self, path_to_metadata, path_to_dir_images):
        self._path_to_metadata = path_to_metadata
        self._path_to_dir_images = path_to_dir_images
        self._metadata = pd.read_csv(path_to_metadata)
        
    def getAllSpecies(self):
        return self._metadata["species"].unique()
    
    def sliceMetadata(self, query):
        return self._metadata.query(query).reset_index(drop=True)
    
    def getAllIndividualIDs(self, metadata):
        return metadata["individual_id"].unique()
    
    def showImagesTile(self, metadata, num_cols=4):
        num_rows = len(metadata) // num_cols + 1
        fig = plt.figure(figsize=(6.4 * num_cols, 4.8 * num_rows))
        
        for row in metadata.itertuples():
            ax = fig.add_subplot(num_rows, num_cols, row.Index + 1)
            
            title = self.getTitle(row)
            ax.set_title(title)
            
            image = self.getImage(row)
            plt.imshow(image)
            
        plt.show()
        plt.clf()
        plt.close()
        
    def getTitle(self, metadata_row):
        #return self._title(metadata_row.individual_id, metadata_row.species)
        return metadata_row.image
    
    def getImage(self, metadata_row):
        path_to_image = self._pathToImage(self._path_to_dir_images, \
                                          metadata_row.image)
        return Image.open(path_to_image)
    
    def _title(self, individual_id, species):
        return "%s (%s)" % (individual_id, species)
    
    def _pathToImage(self, path_to_dir_images, file_name):
        return "%s/%s" % (path_to_dir_images, file_name)
    
    def getImageArray(self, metadata_row):
        image_pil = self.getImage(metadata_row)
        return np.array(image_pil)
    
    def showIndividualImagesTile(self, metadata, num_cols=3, \
                                 max_num_individual_images=3, \
                                 max_num_individuals=10):
        individual_ids = self.getAllIndividualIDs(metadata)
        for individual_id in individual_ids[:max_num_individuals]:
            print()
            print("Individual ID : %s" % individual_id)
            metadata_individual = \
                metadata.query("individual_id == @individual_id").reset_index(drop=True)
            self.showImagesTile(
                metadata=metadata_individual[:max_num_individual_images], \
                num_cols=num_cols \
            )

The dataset for train images is loaded. The dataset for test images will be loaded later using same class.

In [None]:
# Loads metadata for train images.
path_to_metadata = "%s/train.csv" % path_to_inputs
path_to_dir_images = "%s/train_images" % path_to_inputs

whale_and_dolphin = WhaleAndDolphin(
    path_to_metadata=path_to_metadata,
    path_to_dir_images=path_to_dir_images
)

## [3-2] Statistics of train images

### [3-2-1] Number of species

The number of species and name of species are shown.

In [None]:
# Shows number of species and its names.
all_species = whale_and_dolphin.getAllSpecies()

print("Number of species:")
print(len(all_species))
print()

print("Name of species:")
print(all_species)
print()

### [3-2-2] Number of images/individuals for each species

The graph/table of number of images/unique individuals for each species are shown with Plotly.

In [None]:
# Calculates numbers of images/individuals for each species are calculated.
metadata = {}
stats_species = pd.DataFrame(columns=["num_of_images", "num_of_individuals"], \
                             index=all_species)

for species in all_species:
    # Calculates number of images for each species.
    metadata[species] = whale_and_dolphin.sliceMetadata(query="species == @species")
    num_images = len(metadata[species])
    
    # Calculates number of individuals for each species. 
    individual_ids = whale_and_dolphin.getAllIndividualIDs(metadata[species])
    num_individuals = len(individual_ids)
    
    # Appends the result into summary table.
    stats_species.loc[species] = [num_images, num_individuals]

# Calculates total number of images/individuals and appends the result into summary table.
stats_species.loc["total"] = [stats_species["num_of_images"].sum(), stats_species["num_of_individuals"].sum()]

In [None]:
# Shows the graph of numbers of images/individuals for each species with Plotly.
fig = go.Figure()
for column_name, items in stats_species[:len(stats_species)-1].iteritems():
    trace = go.Bar(x=items.index.tolist(), y=items.tolist(), name=column_name)
    fig.add_trace(trace)
fig.update_layout(yaxis_title="Number of images/individuals for each species")
fig.show()

In [None]:
# Shows the table of numbers of images/individuals for each species.
print("Number of images/individuals for each species:")
stats_species

### [3-2-3] Number of images for each individual

The graph/table of numbers of images for each individuals are shown with Plotly. Total number of individuals is too large to show all images at once, so only the numbers of images for a species "false_killer_whale" is shown as an example. The species can be changed easily by changing the variable "species".

In [None]:
species = "false_killer_whale" # It can be changed to "melon_headed_whale", "humpback_whale", etc.
individual_ids = whale_and_dolphin.getAllIndividualIDs(metadata[species])

stats_individuals = pd.DataFrame(columns=["num_of_images"], index=individual_ids)

for individual_id in individual_ids:
    # Calculates number of images for each individual and appends the result into summary table.
    metadata_individual = metadata[species].query("individual_id == @individual_id").reset_index(drop=True)
    num_images = len(metadata_individual)
    
    stats_individuals.loc[individual_id] = [num_images]

# Calculates total number of images/individuals and appends the result into summary table.
stats_individuals.loc["total"] = [stats_individuals["num_of_images"].sum()]

In [None]:
# Shows the graph of numbers of images for each individuals with Plotly.
fig = go.Figure()
for column_name, items in stats_individuals[:len(stats_individuals)-1].iteritems():
    trace = go.Bar(x=items.index.tolist(), y=items.tolist(), name=column_name)
    fig.add_trace(trace)
fig.update_layout(title_text="Species : %s" % species, yaxis_title="Number of images for each individuals")
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

In [None]:
# Unsets limitation of display of rows.
pd.set_option("display.max_rows", None)

# Shows the table of number of images for each individuals.
print("Number of images for each individual of %s:" % species)
print("(Total number of individuals for %s: %d)" % (species, len(stats_individuals)-1))
stats_individuals

## [3-3] Sample of train images

The train images for the first 10 individuals for each species are shown with Matplotlib. The images for same individual are shown in same row. 
The number of images for each row can be changed by changing the variable "num_cols". The number of images for each individual can be changed by changing the variable "max_num_individual_images". The number of individuals can be changed by changing the variable "max_num_individuals".

In [None]:
# Shows train images for the first 10 individuals for each species.
num_cols = 4
max_num_individual_images = 4
max_num_individuals = 10

for species in all_species:
    print()
    print("--------------------------------------------------")
    print()
    print("   Images for %s" % species)
    print()
    print("--------------------------------------------------")
    whale_and_dolphin.showIndividualImagesTile( \
        metadata=metadata[species], \
        num_cols=num_cols, \
        max_num_individual_images=max_num_individual_images, \
        max_num_individuals=max_num_individuals \
    )
    print()
    print()

# [4] Showing test images

The test images and those statistics are shown,
- Number of images.

## [4-1] Loading dataset

The dataset for test images is loaded using the same class as test images.

In [None]:
# Loads metadata for test images.
path_to_metadata = "%s" % path_to_test_metadata
path_to_dir_images = "%s/test_images" % path_to_inputs

whale_and_dolphin = WhaleAndDolphin(
    path_to_metadata=path_to_metadata,
    path_to_dir_images=path_to_dir_images
)

## [4-2] Statistics of test images

The number of species and name of species are shown.

In [None]:
# Shows number of images for each species.
all_species = whale_and_dolphin.getAllSpecies()

print("Number of species:")
print(len(all_species))
print()

print("All species:")
print(all_species)
print()

Number of images are shown.

In [None]:
print("Number of images:")
print()
print("species, num_of_images")

metadata = {}
for species in all_species:
    metadata[species] = whale_and_dolphin.sliceMetadata(query="species == @species")
    num_images = len(metadata[species])
    
    print("%s, %d" % (species, num_images))

## [4-3] Sample of test images

Sample of test images are shown with Matplotlib. 
The number of images can be changed by changing the variable "num_images". The number of images for each row can be changed by changing the variable "num_cols".

In [None]:
# Shows first 100 test images.
num_images = 100
i_first = 0
i_end = i_first + num_images
num_cols = 4

for species in all_species:
    print()
    print("--------------------------------------------------")
    print()
    print("   Images for %s" % species)
    print()
    print("--------------------------------------------------")
    print()
    whale_and_dolphin.showImagesTile( \
        metadata=metadata[species][i_first:i_end], \
        num_cols=num_cols \
    )
    print()
    print()