# Overview

## Discription

Researchers manually track marine life by the shape and markings on their tails, dorsal fins, heads and other body parts. Identification by natural markings via photographs—known as photo-ID—is a powerful tool for marine mammal science. It allows individual animals to be tracked over time and enables assessments of population status and trends. To automate whale and dolphin photo-ID, researchers can reduce image identification times by over 99%. More efficient identification could enable a scale of study previously unaffordable or impossible.

Thousands of hours go into manual matching, which involves staring at photos to compare one individual to another, finding matches, and identifying new individuals. Manual matching limits the scope and reach.

In this competition, you’ll develop a model to match individual whales and dolphins by unique characteristics of their natural markings. You'll pay particular attention to dorsal fins and lateral body views in image sets from a multi-species dataset. The best submissions will suggest photo-ID solutions that are fast and accurate.

## Evaluate

Submissions are evaluated according to the Mean Average Precision @ 5 (MAP@5):

__$
MAP@5 = \cfrac{1}{U}\displaystyle\sum^{U}_{u=1}\displaystyle\sum^{min(n, 5)}_{k=1}P(k) \times rel(k)
$__

Where $U$ is the number of images, $P(k)$ is the precision at cutoff $k$, $n$ is the number predictions per image, and $rel(k)$ is an indicator function equaling 1 if the item at rank $k$ is a relevant (correct) label, zero otherwise.

Once a correct label has been scored for an observation, that label is no longer considered relevant for that observation, and additional predictions of that label are skipped in the calculation. For example, if the correct label is A for an observation, the following predictions all score an average precision of 1.0.

# Imports

In [None]:
%%time
import os
import cv2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from tensorflow.keras import layers

tf.config.list_physical_devices()

# Consts

In [None]:
ROOT_PATH = "../input/happy-whale-and-dolphin/"

TRAIN_CSV = ROOT_PATH + "train.csv"
TRAIN_DIR = ROOT_PATH + "train_images/"
TEST_DIR = ROOT_PATH + "test_images/"

TRAIN_IMAGES = os.listdir(TRAIN_DIR)
TEST_IMAGES = os.listdir(TEST_DIR)

SAMPLE_SUBMITION_CSV = ROOT_PATH + "sample_submission.csv"

Whales and dolphins in this dataset can be identified by shapes, features and markings (some natural, some acquired) of dorsal fins, backs, heads and flanks. Individuals have been manually identified and given an __individual_id__ by marine researches, and your task is to correctly identify these individuals in images.

An important note about data quality: Bringing together this dataset from many different research organization posed a number of practical challenges.

## Files
- __train.csv__ - provides the species and the individual_id for each of the training images
- __train_images/__ - a folder containing the training images
- __test_images/__ - a folder containing the test images; for each image, your task is to predict the __individual_id__; no species information is given for the test data; there are individuals in the test data that are not observed in the training data, which should be predicted as __new_individual__.
- __sample_submission.csv__ - a sample submission file in the correct format

# EDA

## Dataframe

In [None]:
train_df = pd.read_csv(TRAIN_CSV)
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.describe()

- All images are unique
- Total 30 species
- Total 15587 individuals

In [None]:
51_033 / 15_587

For each image need to collect 5 individual_id. This means that individual_id must appear on different images.

In [None]:
plt.figure(figsize=(14, 8))

plt.title("species distribution")
sns.countplot(data=train_df, y='species')

plt.xticks(np.linspace(0, 10_000, 11))
plt.grid(axis='x')

It is clear that some species are not balanced: 
- These species should be deleted or duplicated.

## Images

In [None]:
class PrepImgs:
    
    def __init__(self, root: str, file: str):
        """Allows to manipulate of image
        Methods:
            info(): image description
            prep(): preprocess for this competition
            save(): image save
        Paramns:
            root: image folder
            file: image name with extension
        """
        self.root = root
        self.file = file
        self.path = root + file
        
        self.array = cv2.imread(self.path)
        self.array = self.array[:, :, ::-1]
        
    def save(self, to: str = "./new_image/"):
        """Image save to path
        Params:
            to: where save image
        """
        array = self.array[:, :, ::-1]
        array = array if array.max() > 1 else array * 255
        array = array.astype(np.uint8)
        
        cv2.imwrite(to + self.file, array)
        
    def prep(self, size: int = 256):
        """Preprocess image for this competition
        Params:
            size: size for new image
        """
        self.array = layers.Resizing(size, size)(self.array) # equate and lowering
        self.array = self.array.numpy() # tf in numpy
        self.array = self.array.astype(np.uint8)
        
        return self.array
        
    def info(self, show: bool = True):
        """Description of the image: shows it, shows the path, min, max and shape of image
        Params:
            show: flag for view images
        """
        if show:
            plt.figure(figsize=(2, 2))
            plt.imshow(self.array)
            plt.axis("off")
            plt.show()
        
        print(f"path: {self.path}",
              f"min: {self.array.min()}",
              f"max: {self.array.max()}",
              f"shape: {self.array.shape}", sep="\n")

In [None]:
PrepImgs(TRAIN_DIR, TRAIN_IMAGES[0]).info()
PrepImgs(TEST_DIR, TEST_IMAGES[0]).info()

Img encoding in RGB 0 - 255:
- Need to be rescaling in RGB 0 - 1

In [None]:
def view_images(directory, files, titles_from=[], name="", rows=2, cols=3, figsize=(13, 7)):
    """Show sample of images in grid
    Params:
        directory:  path where has images
        files:      names with extensions
        title_from: where take titles for images
        name:       title for all sample
        rows:       rows of grid
        cols:       cols of grid
        figsize:    size of all grid
    """

    fig, ax = plt.subplots(rows, cols, figsize=figsize)
    
    fig.suptitle(name, fontsize=16)
    
    for r in range(rows):
        for c in range(cols):
            
            file = files[c + r*3]
            img = PrepImgs(directory, file).array
            
            if len(titles_from):
                title = titles_from.loc[titles_from["image"] == file, "species"]
                title = title.values[0]
                
                ax[r, c].set_title(title)
                
            ax[r, c].imshow(img)

In [None]:
view_images(TRAIN_DIR, TRAIN_IMAGES, train_df, name="TRAIN IMAGES")

In [None]:
view_images(TEST_DIR, TEST_IMAGES, name="TEST IMAGES")

All images have different shapes and too big size for NN:
- Need to lowering size this images
- And equate image sizes

Since need to predict 5 individual_id for each image, it would be interesting to look at different images of the same animal.

In [None]:
individual = "60008f293a2b"
example = train_df[train_df["individual_id"] == individual]
example = list(example["image"])

view_images(TRAIN_DIR, example, name=individual)

## Resume

In process part need to do next actions:
1. Need to labeling for each individual_id
2. Equate image sizes
3. Need to lowering sizes
4. Rescaling image from 0 - 255 in 0 - 1
5. Decide what to do with unbalanses species

# Preproces

## Labeling

In [None]:
labels = train_df["individual_id"].unique()
labels = {ind: idx for idx, ind in enumerate(labels)}

train_df["label"] = train_df["individual_id"].map(labels)
train_df["label"] = train_df['label'].astype(np.uint)
train_df.head()

## Images

I want to resize images and save them, for speed up of loading later

In [None]:
def save_new_images(path, to="./new_images/", count=None):
    """Resaves images in new path
    Params:
        path: where to get files from
        to:   path to save files
    """
    try:
        os.mkdir(to)
    except FileExistsError as e:
        print(e)
    
    files = os.listdir(path)
    if count:
        files = files[:count]
    
    for file in files:
        img = PrepImgs(path, file)
        img.prep()
        img.save(to)

# new image dirs
NEW_TRAIN_DIR = "./new_train_images/"
NEW_TEST_DIR = "./new_test_images/"
        
save_new_images(TRAIN_DIR, NEW_TRAIN_DIR, 100)
# save_new_images(TEST_DIR, NEW_TEST_DIR)

In [None]:
view_images(NEW_TRAIN_DIR, TRAIN_IMAGES, train_df, "TRAIN IMAGES", cols=4)