# Visualize Embedding using W&B Embedding Projector

W&B has recently launched it's own embedding projector using which we can easily visualize the embedding space in 2D using TSNE, UMAP, or PCA techniques. 

You can refer the [documentation](https://docs.wandb.ai/ref/app/features/panels/weave/embedding-projector) for more information. 

# Imports and Setup

In [None]:
import os
import gc
import numpy as np
import pandas as pd
from tqdm import tqdm
from functools import partial
from argparse import Namespace
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [None]:
import wandb
wandb.login()

In [None]:
IMG_DIR = '../input/jpeg-happywhale-128x128/train_images-128-128/train_images-128-128/'
df = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')

def source_path(row):
    return f'{IMG_DIR}/{row.image}'

df['img_path'] = df.apply(lambda row: source_path(row), axis=1)

# Fix wrong label issue
df = df.replace({'bottlenose_dolpin': 'bottlenose_dolphin',
                 'kiler_whale': 'killer_whale'})

df.head()

In [None]:
labels_list = list(df.species.unique())
label2ids = {label: idx for idx, label in enumerate(labels_list)}
id2labels = {val:key for key, val in label2ids.items()}

# Add a column of label ids
df['target'] = df['species'].replace(label2ids)

# Config

In [None]:
args = Namespace(
    labels = label2ids,
    num_labels = len(label2ids),
    image_height = 128,
    image_width = 128,
    resize=False,
    batch_size = 128,
)

# Dataloader

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

@tf.function
def decode_image(img, resize=True):
    # convert the compressed string to a 3D uint8 tensor
    img = tf.image.decode_jpeg(img, channels=3)
    # Normalize image
    img = tf.image.convert_image_dtype(img, dtype=tf.float32)
    # resize the image to the desired size
    if resize:
        img = tf.image.resize(img, [args.image_height, args.image_width], 
                              method='bicubic', preserve_aspect_ratio=False)
        img = tf.clip_by_value(img, 0.0, 1.0)

    return img

@tf.function
def parse_data(df_dict, resize=True):
    # Parse Image
    image = tf.io.read_file(df_dict['img_path'])
    image = decode_image(image, resize)
    # Parse Target
    label = df_dict['target']
    label = tf.one_hot(indices=label, depth=args.num_labels)
    
    return image, label

In [None]:
def get_dataloader(df):
    dataloader = tf.data.Dataset.from_tensor_slices(dict(df))

    # Training Dataset
    dataloader = (
        dataloader
        .map(partial(parse_data, resize=args.resize), num_parallel_calls=AUTOTUNE)
        .batch(args.batch_size)
        .prefetch(AUTOTUNE)
    )
    
    return dataloader

In [None]:
def get_label_name(one_hot_label):
    label = np.argmax(one_hot_label, axis=0)
    return id2labels[label]

def show_batch(image_batch, label_batch):
  plt.figure(figsize=(20,20))
  for n in range(25):
      ax = plt.subplot(5,5,n+1)
      plt.imshow(image_batch[n])
      plt.title(get_label_name(label_batch[n].numpy()))
      plt.axis('off')

# Sanity Check
dataloader = get_dataloader(df)

sample_imgs, sample_labels = next(iter(dataloader))
    
show_batch(sample_imgs, sample_labels)

# 1. Log the Images as W&B Table [Optional]

Note that running the cell below might take some time as you are uploading images onto W&B. Later on we will be using reference to the uploaded images, keep reading. :D

If you are okay with images of resolution 128x128 for analysis you can ignore to run the cell below. Instead go to section 1.1 to download the table I have logged. 

Here's the link to the logged table: https://wandb.ai/ayut/happywhale/artifacts/run_table/run-34086otd-data_table/d529d0d65d18ff406099/files/data_table.table.json

In [None]:
LOG_FULL_TABLE = False

In [None]:
if LOG_FULL_TABLE:
    # Initialize a W&B run
    run = wandb.init(project='happywhale')

    # Initialize an empty W&B Table
    data_table = wandb.Table(columns=['individual_id', 'image'])

    # for unique_id, tmp_df in tqdm(df.groupby('individual_id')):
    for i in tqdm(range(len(df))):
        row = df.loc[i]
        # Add data to the table row-wise
        data_table.add_data(row.individual_id,
                            wandb.Image(f'{IMG_DIR}/{row.image}'))

    # Log the table
    wandb.log({'data_table': data_table})

    # Finish the run
    wandb.finish()

# 1.1 Grab the Table and Get Image Index [Optional]



In [None]:
# Initialize a W&B run
run = wandb.init(project='happywhale')

# Use the already logged dataset
data_artifact = run.use_artifact('ayut/happywhale/run-34086otd-data_table:latest')

# Get the data_table to access the data
data_table = data_artifact.get("data_table")

wandb.finish()

# 1.2 Sample the data

We will not be logging the embedding for the entire dataset but a stratified sample.

In [None]:
_, sampled_df = train_test_split(df, test_size=0.33, random_state=42, stratify=df.species.values)
sampled_ids = sampled_df.index

print(sampled_df.shape)

In [None]:
sampledloader = get_dataloader(sampled_df)

# 2. Pretrained Weights

It would be interesting to look at the 2D projection of embedding from pretrained weights. The model was trained on ImageNet dataset which has natural images. Since the images in this competition have some domain similarity it would be interesting to look at the embedding projection.

In [None]:
EMBEDDING_DIM = 1280

In [None]:
def get_pretrained_model():
    base_model = tf.keras.applications.EfficientNetB0(
        input_shape=(args.image_height, args.image_width, 3),
        include_top=False, 
        weights='imagenet'
    )
    
    base_model.trainabe = False
    feature = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)

    return tf.keras.models.Model(base_model.input, feature)

tf.keras.backend.clear_session()
pretrained_model = get_pretrained_model()
# pretrained_model.summary()

In [None]:
# Initialize a W&B run
run = wandb.init(project='happywhale')

# Initialize an empty W&B Table
embedding_cols = [f'e_{i}' for i in range(EMBEDDING_DIM)]
pretrained_embedding_table = wandb.Table(columns=['image', 'truth']+embedding_cols)

# Get embedding
print('Getting embedding...')
embedding = pretrained_model.predict(sampledloader)

# Get the row number associated to the logged table
for i, table_row_id in tqdm(enumerate(sampled_ids)):
    # Add data to the table row-wise
    pretrained_embedding_table.add_data(
        wandb.Image(data_table.data[table_row_id][1]), # Notice here!!!
        sampled_df.loc[table_row_id].species,
        *embedding[i]
    )

# Log the table
wandb.log({'embedding_table': pretrained_embedding_table})

# Finish the run
wandb.finish()

# 3. FineTuned Weights (Model)

Next up I have trained EfficientNetB0 on this competition's dataset and the finetuned weights are attached as Kaggle dataset. It was trained for 30 epochs with 5 fold stratified split of the dataset. No fancy augmentation or training techniques was used. `ReduceLRonPlateua` was used as a learning rate scheduler while training. 

You can find the W&B run page for the training metrics here: https://wandb.ai/ayut/happywhale/groups/effnetb0/workspace

![img](https://i.imgur.com/RCqv3Zz.png)



In [None]:
!tar -xvf ../input/happywhale-supervised/model.tar
!ls

In [None]:
def get_finetuned_model(model_path):
    finetuned_model = tf.keras.models.load_model(model_path)
    feature_extractor = tf.keras.models.Model(
                            finetuned_model.input,
                            finetuned_model.get_layer('global_average_pooling2d').output
                        )
    return feature_extractor

In [None]:
tf.keras.backend.clear_session()
finetuned_model = get_finetuned_model('./model_0.h5')
finetuned_model.summary()

In [None]:
# Initialize a W&B run
run = wandb.init(project='happywhale')

# Initialize an empty W&B Table
embedding_cols = [f'e_{i}' for i in range(EMBEDDING_DIM)]
finetuned_embedding_table = wandb.Table(columns=['image', 'truth']+embedding_cols)

# Get embedding
print('Getting embedding...')
embeddings = []
for i in tqdm(range(5)):
    finetuned_model = get_finetuned_model(f'./model_{i}.h5')
    embedding = finetuned_model.predict(sampledloader)
    embeddings.append(embedding)
    
embeddings = np.mean(embeddings, axis=0)

# Get the row number associated to the logged table
for i, table_row_id in tqdm(enumerate(sampled_ids)):
    # Add data to the table row-wise
    finetuned_embedding_table.add_data(
        wandb.Image(data_table.data[table_row_id][1]), # Notice here!!!
        sampled_df.loc[table_row_id].species,
        *embeddings[i]
    )

# Log the table
wandb.log({'embedding_table': finetuned_embedding_table})

# Finish the run
wandb.finish()

# 4. Embeddings

Now that the embeddings are logged onto W&B, we can easily visualize the 2D projection of the embeddings using PCA, TSNE, and UMAP with just few clicks. 

Here's the [documentation](https://docs.wandb.ai/ref/app/features/panels/weave/embedding-projector) page. 

# 4.0 How to Easily Create Embeddings

Once you have logged the embedding as shown above, open the W&B run page and follow the following steps:
* Click on "Merge Tables: 2D Projection: Plot"
* It will automatically create the embedding (either one of PCA, t-SNE, or UMAP).
* Click on the "gear" icon to open the controls. 
* Select the dimentionality reduction algorightm. W&B will automatically compute the 2D projection. 
* You can also play with the hyperparameters of the algorithms. 
* IF you hover your mouse over the dots, you can see the logged images as well, which is a very handy way to visually validate the projection.

Check out the video below to follow along. 

![img](https://imgur.com/VmhtPkd.mp4)


# 4.1 PCA

Pretrained:

![img](https://i.imgur.com/81ixxTW.png)

Finetuned:

![img](https://i.imgur.com/OIbadx9.png)

# 4.2 TSNE

Pretrained:

![img](https://i.imgur.com/nyKynT0.png) 

Finetuned: 

![img](https://i.imgur.com/j21gN2m.png)

# 4.3 UMAP

Pretrained:

![img](https://i.imgur.com/959vjPK.png)

Finetuned:

![img](https://i.imgur.com/hJV0DtV.png)

By changing the `Neighbors` parameter to 5. The UMAP shows better clusters.

![img](https://i.imgur.com/Hl301zy.png)