![](https://api.time.com/wp-content/uploads/2015/04/humpback-whale.jpg)
**A Breaching HumpBack (image credits: time)**


**Whales are Intelligent and Emotional beings. Here is a recommended watch on humpback,where the humpback saved the life of a biologist from the threat of a lurking tiger shark**

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo(id='OXNCCdcBhcY',width=1000,height=600, allow_autoplay=False)


# About 

**We use fingerprints and facial recognition to identify people, but can we use similar approaches with animals? In fact, researchers manually track marine life by the shape and markings on their tails, dorsal fins, heads and other body parts. Identification by natural markings via photographs—known as photo-ID—is a powerful tool for marine mammal science. It allows individual animals to be tracked over time and enables assessments of population status and trends.**
![](https://storage.googleapis.com/kaggle-media/competitions/Happywhale/AU%20Kaggle%20Competition%20Description%20Image-03.jpg)

**Algorithms developed in this competition will be implemented in Happywhale, a research collaboration and citizen science web platform. Its mission is to increase global understanding and caring for marine environments through high quality conservation science and education.**


**In this competition, you’ll develop a model to match individual whales and dolphins by unique—but often subtle—characteristics of their natural markings. You'll pay particular attention to dorsal fins and lateral body views in image sets from a multi-species dataset built by 28 research institutions. The best submissions will suggest photo-ID solutions that are fast and accurate.**

> **To Summarize, the aim of the competitions is to develop algorithm to identify individual whales and dolphins , based on thier dorsal fin and lateral body images.**

# References and Resources 
* Fantastic Notebook by @ruchi798 : https://www.kaggle.com/ruchi798/and-identification-eda-augmentation
* Fantastic notebook by  @awsaf49 : https://www.kaggle.com/awsaf49/happywhale-data-distribution


# Imports

In [None]:
import os 
import gc
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np 
import pandas as pd 
from glob import glob
from tqdm import tqdm
#colored print
from termcolor import colored

#deep learning 
import tensorflow as tf 




try:  # if gpu is ON
    from cuml import TSNE, UMAP   # cuml is gpu accelerated library 
except:
    from sklearn.manifold import TSNE # for cpu
    from umap import UMAP

**Setting a Configuration Object**
    

Doing this as it will be easy to change, track or retrieve the parameters(like say image height and width) easily.


In [None]:
class config:
    def __init__(self,
                 seed=7,
                 img_size=(264,264),
                 batch=32):
        self.seed = seed
        self.img_size= img_size  #image size 
        self.batch_size = batch
    
    def set_seed(self):
        '''set seed for reproduciblity'''

        tf.random.set_seed(self.seed)
        os.environ['PYTHONHASHSEED'] = str(self.seed)
        np.random.seed(self.seed)
        print(f'Setting Random Seed  to {self.seed}')

cfg = config()
cfg.set_seed()

# Helper Functions

In [None]:
def load_image(path,
               img_size=None,
               expand_dims=False,
               rescale=False):
    '''load image at given path'''
    img= tf.io.read_file(path)             # load image
    img = tf.image.decode_image(img, expand_animations = False)   
    img = tf.cast(img,float)
    
    if rescale:
        img = img /255.0                           # convert img pixels in range [0,1]
    if img_size:
        img = tf.image.resize(img,
                              size=img_size)  # resize image 
    if expand_dims:
        img = tf.expand_dims(img,axis=0)
        
    
    # handle single channel images 
    if img.shape[2] == 1 or len(img.shape)==2: # if image has single band (b and w), stack that band together                 
#         tf.print(img.shape)
        if img.shape[2]:
            img = tf.squeeze(img,axis=-1) # squeeze (n,n,1) to (n,n)
        
        #stack the single band to create n,n,3 image
        img = tf.stack([img,img,img],axis=-1)
#         tf.print(img.shape)
        
        
    return img


def plot_image_grid(image_list,
                    label_list,
                    sample_images=False,
                    num_images=12,
                    pre_title='class',
                    num_img_per_row=3,
                    cmap=None,
                    img_h_w=3):
    '''viz images from a list of images and labels
    INPUTS:
    image_list: a list of images to be plotted,
    label_list: a list of correspomding image labels'''
    


    #number of img rows
    n_row= num_images//num_img_per_row

    plt.subplots(n_row,num_img_per_row,figsize=(img_h_w*num_img_per_row,(img_h_w-2) *n_row))

    if sample_images:
    #select_random images 
        sampled_ids = random.choices(np.arange(0,len(image_list)),k=num_images)

        for i,idx in enumerate(sampled_ids):

            img = image_list[idx]
            label = label_list[i]
            plt.subplot(n_row,num_img_per_row,i+1)
            plt.title(f'{pre_title} - {label}')
            plt.axis('off')
            plt.imshow(img,cmap=cmap)
    else:
        for i,img in enumerate(image_list):

            label = label_list[i]
            plt.subplot(n_row,num_img_per_row,i+1)
            plt.title(f'{pre_title} - {label}')
            plt.axis('off')
            plt.imshow(img,cmap=cmap)

            # break the loop 
            if i==num_images-1 :
                  break 

    #show
    plt.tight_layout()
    plt.show()

# Loading Data

**Train Csv Description**

    image : Image Path in train directory.
    
    species: Species name of the whale or the dolphin in the Image.
    
    individual_id: Unique Id associated with the individual(something like a Tag).

In [None]:
train_dir = '../input/happy-whale-and-dolphin/train_images/' #train directory
test_dir = '../input/happy-whale-and-dolphin/test_images/'   #test directory

#add data dirs to config for ease of access
cfg.train = train_dir
cfg.test  = test_dir


train = pd.read_csv('../input/happy-whale-and-dolphin/train.csv') #train csv
train.drop_duplicates(inplace=True)


#adding path to train 
train['path'] = train_dir + train['image']

#there are some duplicates in species 

# fixing duplicate labels
train['species'] = train['species'].str.replace('bottlenose_dolpin','bottlenose_dolphin')
train['species'] = train['species'].str.replace('kiler_whale','killer_whale')

#beluga and globis are whales, so replacing them as that
train["species"].replace({"beluga": "beluga_whale",
                         "globis": "globis_whale"},
                          inplace=True)

#there is a typo in bottlenose_dolpin,correcting that 
train["species"].replace({'bottlenose_dolpin':'bottlenose_dolphin'},
                        inplace=True)



sample_sub = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv') #sample submission 



print_color = 'green'

print(colored(f'Number of Images in train directory {train.image.nunique()}',print_color))
print(colored(f'Number of Images in test directory {len(os.listdir(test_dir))}',print_color))
print(colored(f'Number of Unique Individuals in train directory {train.individual_id.nunique()}',print_color))
print(colored(f'Number of Unique Species in train directory {train.species.nunique()}',print_color))

# Checking the Distribution of Dolphin and Whales Species 

**Whales**

In [None]:


whales = [x for x in train.species.unique() if 'whale' in x]


print(f'Number of Unique Whale Species in dataset {len(whales)}')

whls = train[train['species'].isin(whales)]
train.loc[train['species'].isin(whales),'type'] = 'whales'

print(colored(f'Number of Whales Images in train directory {whls.image.nunique()}',print_color))
print(colored(f'Number of Unique Whales in train directory {whls.individual_id.nunique()}',print_color))

In [None]:
plt.figure(figsize=(20,10))
plt.yticks(fontsize=16)
sns.countplot(y="species",
              data=whls,
              order=whls.iloc[0:]["species"].value_counts().index,
              palette="GnBu_r",
              linewidth=3)
plt.title("Whale Species Distribution",font="Serif", size=20,color='k')
plt.show()

**Dolphins**

In [None]:
dolphin = [x for x in train.species.unique() if 'dolphin' in x]

print(f'Number of Unique Dolphin Species in dataset {len(dolphin)}')

dlps = train[train['species'].isin(dolphin)]
train.loc[train['species'].isin(dolphin),'type'] = 'dolphins'

print(colored(f'Number of dolphin Images in train directory {dlps.image.nunique()}',print_color))
print(colored(f'Number of Unique Dolphins in train directory {dlps.individual_id.nunique()}',print_color))

In [None]:
plt.figure(figsize=(20,10))
plt.yticks(fontsize=16)
sns.countplot(y="species",
              data=dlps,
              order=dlps.iloc[0:]["species"].value_counts().index,
              palette="GnBu_r",
              linewidth=3)
plt.title("Dolphin Species Distribution",font="Serif", size=20,color='Green')
plt.show()

# Visualizing Whale Species

In [None]:
sample_ids = []
sample_sp  =[]

for species in whales:
    sample = whls[whls['species']==species].sample(3)
    sample_ids.extend(sample.path) # sample paths 
    sample_sp.extend(sample.species) # sample species 
    

whale_imgs = [load_image(path,rescale =True) for path in sample_ids] # load sample images to view 

plot_image_grid(whale_imgs[:12],
                sample_sp[:12],
                sample_images=False,
                num_images=12,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

In [None]:

plot_image_grid(whale_imgs[12:24],
                sample_sp[12:24],
                sample_images=False,
                num_images=12,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

In [None]:

plot_image_grid(whale_imgs[24:36],
                sample_sp[24:36],
                sample_images=False,
                num_images=12,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

In [None]:

plot_image_grid(whale_imgs[36:54],
                sample_sp[36:54],
                sample_images=False,
                num_images=18,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

#  Visualizing Dolphin Species

In [None]:
sample_ids = []
sample_sp  =[]

for species in dolphin:
    sample = dlps[dlps['species']==species].sample(3)
    sample_ids.extend(sample.path) # sample paths 
    sample_sp.extend(sample.species) # sample species 
    

dolp_imgs = [load_image(path,rescale =True) for path in sample_ids]

plot_image_grid(dolp_imgs[:15],
                sample_sp[:15],
                sample_images=False,
                num_images=15,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

In [None]:

plot_image_grid(dolp_imgs[15:],
                sample_sp[15:],
                sample_images=False,
                num_images=15,
                pre_title='',
                num_img_per_row=3,
                cmap=None,
                img_h_w=5)

In [None]:
#delete imgs loaded in RAM
del whale_imgs,dolp_imgs,sample_ids,sample_sp,whls,dlps; gc.collect()

# Getting Image Embeddings

    What are these Embeddings? Well we just predict on the Images to a Pretrained Convolutional Nueral Network (without the final softmax layer), which gives us the representations that the model has extracted from the Images.

In [None]:
#loading EfficientNet B0 pretrained on ImageNet
efnet = tf.keras.applications.EfficientNetB0(include_top=False, # exclude the final prediction layer 
                                            weights='imagenet')# trained on imagenet 
                                            

In [None]:
def build_model(backbone):
    '''Generate Embeddings from pretrained models
    
    Inputs:
    backbone : pretrained models'''
    inp = tf.keras.layers.Input(shape=(*cfg.img_size,3))
    x = backbone(inp)
    output = tf.keras.layers.GlobalAveragePooling2D()(x)
    
    return tf.keras.Model(inputs=inp,outputs=output)

model = build_model(backbone=efnet)

# Loading Data for Generating Embeddings

**Taking 5000 images from training and test set to check thier distribution**

In [None]:
#loading train data 
num_samples = 5000


#store the paths to train files in these lists
tr_files = glob(train_dir + '*.jpg')
train_files = tf.data.Dataset.list_files(file_pattern=tr_files, 
                                      shuffle=True,
                                      seed= cfg.seed).take(num_samples)

#store the paths to test files in these lists
ts_files = glob(test_dir + '*.jpg') 
test_files = tf.data.Dataset.list_files(ts_files,
                                      shuffle=True,
                                      seed= cfg.seed).take(num_samples)


**AUTOTUNE FUNCTION**

In [None]:
# this function will autotune number of parallel calls.
AUTOTUNE = tf.data.experimental.AUTOTUNE

**LOADING IMAGES**

In [None]:
train_ds = train_files.map(load_image,num_parallel_calls=AUTOTUNE)
test_ds = test_files.map(load_image,num_parallel_calls=AUTOTUNE)

**PREPROCESS IMAGES**

In [None]:
def preprocess_img(img,
                  img_size=cfg.img_size,
                  expand_dims =False):
    '''resize images to common shape and rescale by / 255'''
    
    img = tf.image.resize(images=img,size=img_size)
    img = img / 255.0    
    
    if expand_dims:
        img = tf.expand_dims(img,
                             axis=-1)
    
    return img

In [None]:
#apply preprocessing function to dataset

train_ds = train_ds.map(preprocess_img,num_parallel_calls=AUTOTUNE)
test_ds  = train_ds.map(preprocess_img,num_parallel_calls=AUTOTUNE)

In [None]:
def optimize_pipeline(tf_dataset,
                      batch_size = cfg.batch_size,
                      Autotune_fn = AUTOTUNE,
                      cache= False,
                      batch = True):
    if cache:
        tf_dataset = tf_dataset.cache()                        # store data in RAM  
        
    tf_dataset =  tf_dataset.shuffle(buffer_size=100)         # shuffle 
    
    if batch:
        tf_dataset = tf_dataset.padded_batch(batch_size)              #split the data in batches
    
    # prefetch(load the data with cpu,while gpu is training) the data in memory 
    tf_dataset = tf_dataset.prefetch(buffer_size=Autotune_fn)    
    
    return tf_dataset

In [None]:
# optimize for performance 
train_ds=optimize_pipeline(train_ds)
test_ds=optimize_pipeline(test_ds)

**Now Our data is ready to be passed into the model**

# Generate Embeddings

In [None]:
def predict(dataset,model = model):
    
    embeddings = []
    
    for batch in tqdm(train_ds.as_numpy_iterator()):   #iterate over the batches 
        
        embeds = model(batch)   # pass (predict) with pretrained model 
        
        embeddings.extend(embeds)        # append batch predictions to the list 
        
    return np.array(embeddings)    

In [None]:
#get embeddings
train_embeddings = predict(train_ds)

In [None]:
test_embeddings = predict(test_ds)

In [None]:
print(f'train embeddings are in range{train_embeddings.min(),train_embeddings.max()}')
print(f'test embeddings are in range{test_embeddings.min(),test_embeddings.max()}')

# Checking the distribution of the data using UMAP and TSNE 

**TSNE**

* **T-SNE is a Non-Linera dimensionality reduction technique that preserves the local clusters , while reducing the dimensionality.**
* **It is useful in visualizing higher dimensional data,by projecting it into lower dimensions that can be visualized.**

In [None]:
%%time
#instantiate tsne object with following params
tsne = TSNE(n_components=3,perplexity=20,n_iter=2500,random_state=cfg.seed)

#project the (5000,1280) array to (5000,3)
train_tsne = tsne.fit_transform(train_embeddings)

In [None]:
#project the (5000,1280) array to (5000,3)
test_tsne = tsne.fit_transform(test_embeddings)

In [None]:
df_tsne = pd.DataFrame(train_tsne,columns=['dim1','dim2','dim3'])
df_tsne['set'] = 'train'

test_tsne = pd.DataFrame(test_tsne,columns=['dim1','dim2','dim3'])
test_tsne['set'] = 'test'


#append train and test df together 
df_tsne = df_tsne.append(other=test_tsne)

#set the train color to green
df_tsne.loc[df_tsne['set']=='train','color'] = 'green'

#set test color to blue
df_tsne.loc[df_tsne['set']=='test','color'] = 'blue'

df_tsne.shape

In [None]:
plt.style.use('seaborn-white')
from mpl_toolkits import mplot3d 

fig = plt.figure(figsize =(15,15)) 
ax = plt.axes(projection ='3d') 

ax.scatter(xs=df_tsne['dim2'].values,
           ys=df_tsne['dim1'].values,
           zs=df_tsne['dim3'].values,
           c=df_tsne['color'],
           s= 3)


ax.set_xlabel('dim 2')
ax.set_ylabel('dim 1')
ax.set_zlabel('dim 3')

plt.legend(df_tsne['set'])
plt.title('TSNE reduced embeddings')
plt.show()

**UMAP**

* It is an similar dimensionality reduction algorithm.

In [None]:
%%time
umap =UMAP(n_neighbors=20,
          n_components=3,
          min_dist=0.3,
          metric='euclidean')

#project the (5000,1280) array to (5000,3)
train_umap = umap.fit_transform(train_embeddings)

In [None]:
#project the (5000,1280) array to (5000,3)
test_umap = umap.fit_transform(test_embeddings)

In [None]:
df_umap = pd.DataFrame(train_umap,columns=['dim1','dim2','dim3'])
df_umap['set'] = 'train'

test_umap = pd.DataFrame(test_umap,columns=['dim1','dim2','dim3'])
test_umap['set'] = 'test'

#append train and test df together 
df_umap = df_umap.append(other=test_umap)

#set the train color to green
df_umap.loc[df_umap['set']=='train','color'] = 'blue'

#set test color to blue
df_umap.loc[df_umap['set']=='test','color'] = 'red'

df_umap.shape

In [None]:

fig = plt.figure(figsize =(15,15)) 
ax = plt.axes(projection ='3d') 

ax.scatter(xs=df_umap['dim1'].values,
           ys=df_umap['dim2'].values,
           zs=df_umap['dim3'].values,
           c=df_umap['color'],
           s=3)


ax.set_xlabel('dim 1')
ax.set_ylabel('dim 2')
ax.set_zlabel('dim 3')

plt.legend(df_umap['set'])
plt.title('UMAP reduced embeddings')
plt.show()