# **TF Similarity**

# Notebook goal

<p>Today we are releasing the first version of TensorFlow Similarity, a python package designed to make it easy and fast to train similarity models using TensorFlow. This notebook demonstrates how to use TensorFlow Similarity to train a SimilarityModel()</p>

# Introduction

<p>TensorFlow Similarity provides all the necessary components to make similarity training evaluation and querying intuitive and easy. In particular, as illustrated below, TensorFlow Similarity introduces the SimilarityModel(), a new Keras model that natively supports embedding indexing and querying. This allows you to perform end-to-end training and evaluation quickly and efficiently.</p>

# Methodology

<p><ul><li>This notebook demonstrates how can we use Tensorflow Similarity to classify whales & dolphins.</li><li>This notebook is inspired by the blog published by tensorflow, you can read the blog through <a href="https://blog.tensorflow.org/2021/09/introducing-tensorflow-similarity.html">here</a></li><li>This notebook uses codes from Tensorflow Similarity tutorial on <a href="https://github.com/tensorflow/similarity/blob/master/examples/supervised_hello_world.ipynb">Github</a> and also the notebook <a href="https://www.kaggle.com/nicapotato/keras-efficientnet#Preparation-for-modeling">Keras EfficientNet</a> from Humpback Whale Identification Challenge</li></ul></p>

# Loading Libraries & Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import tensorflow as tf
from tensorflow.keras import models, layers
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam

import gc
import os
from tabulate import tabulate

# INFO messages are not printed.
# This must be run before loading other modules.
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

# ignoring warnings
import warnings
warnings.simplefilter("ignore")

import os, cv2, json
from PIL import Image
from random import randint

In [None]:
# install TF similarity if needed
try:
    import tensorflow_similarity as tfsim  # main package
except ModuleNotFoundError:
    !pip install tensorflow_similarity
    import tensorflow_similarity as tfsim

In [None]:
tfsim.utils.tf_cap_memory()

In [None]:
# Clear out any old model state.
gc.collect()
tf.keras.backend.clear_session()

In [None]:
print("TensorFlow:", tf.__version__)
print("TensorFlow Similarity", tfsim.__version__)

# Dataset

In [None]:
# File Parameters
WORK_DIR = "../input/happy-whale-and-dolphin"
label_col = "individual_id"
img_col = "image"
train_folder = "train_images"
test_folder = "test_images"

os.listdir(WORK_DIR)

In [None]:
print('Train images: %d' %len(os.listdir(
    os.path.join(WORK_DIR, train_folder))))

In [None]:
train_labels = pd.read_csv(os.path.join(WORK_DIR, "train.csv"))
label_names = train_labels[label_col].value_counts().index
label_map = {name:i for (i,name) in enumerate(label_names)}
inv_label_map = {v: k for k, v in label_map.items()}

train_labels['label_name'] = train_labels[label_col].copy()
train_labels[label_col] = train_labels[label_col].map(label_map)
display(train_labels.head())

In [None]:
train_labels.head()

# Notebook Configuration

In [None]:
# Main parameters
BATCH_SIZE = 32
STEPS_PER_EPOCH = len(train_labels)*0.8 / BATCH_SIZE
VALIDATION_STEPS = len(train_labels)*0.2 / BATCH_SIZE
EPOCHS = 4
TARGET_SIZE = 512

# Data Generation

In [None]:
# Traning data
train_datagen = ImageDataGenerator(validation_split = 0.2,
                                     preprocessing_function = None,
                                     rotation_range = 45,
                                     zoom_range = 0.2,
                                     horizontal_flip = True,
                                     vertical_flip = True,
                                     fill_mode = 'nearest',
                                     shear_range = 0.1,
                                     height_shift_range = 0.1,
                                     width_shift_range = 0.1)


train_generator = train_datagen.flow_from_dataframe(train_labels,
                         directory = os.path.join(WORK_DIR, train_folder),
                         subset = "training",
                         x_col = img_col,
                         y_col = label_col,
                         color_mode='grayscale',
                         target_size = (TARGET_SIZE, TARGET_SIZE),
                         batch_size = 8,
                         class_mode = "raw")

# Validation Data
validation_datagen = ImageDataGenerator(validation_split = 0.2)


validation_generator = validation_datagen.flow_from_dataframe(train_labels,
                         directory = os.path.join(WORK_DIR, train_folder),
                         subset = "validation",
                         color_mode='grayscale',
                         x_col = img_col,
                         y_col = label_col,
                         target_size = (TARGET_SIZE, TARGET_SIZE),
                         batch_size = 8,
                         class_mode = "raw")

# Model definition

<p><b>SimilarityModel()</b> models extend <b>tensorflow.keras.model</b>. Model with additional features and functionality that allow you to index and search for similar looking examples.

As visible in the model definition below, similarity models output a 64 dimensional float embedding using the <b>MetricEmbedding()</b> layers. This layer is a Dense layer with L2 normalization. Thanks to the loss, the model learns to minimize the distance between similar examples and maximize the distance between dissimilar examples. As a result, the distance between examples in the embedding space is meaningful; the smaller the distance the more similar the examples are.

Being able to use a distance as a meaningful proxy for how similar two examples are, is what enables the fast ANN (aproximate nearest neighbor) search. Using a sub-linear ANN search instead of a standard quadratic NN search is what allows deep similarity search to scale to millions of items. </p>

In [None]:
def get_model():
    inputs = tf.keras.layers.Input(shape=(TARGET_SIZE, TARGET_SIZE, 1))
    x = tf.keras.layers.experimental.preprocessing.Rescaling(1 / 255)(inputs)
    x = tf.keras.layers.Conv2D(32, 3, activation="relu")(x)
    x = tf.keras.layers.Conv2D(32, 3, activation="relu")(x)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Conv2D(64, 3, activation="relu")(x)
    x = tf.keras.layers.Conv2D(64, 3, activation="relu")(x)
    x = tf.keras.layers.Flatten()(x)
    # smaller embeddings will have faster lookup times while a larger embedding will improve the accuracy up to a point.
    outputs = tfsim.layers.MetricEmbedding(64)(x)
    return tfsim.models.SimilarityModel(inputs, outputs)


model = get_model()
model.summary()

# Loss definition

<p>Overall what makes Metric losses different from tradional losses is that:

**They expect different inputs**. Instead of having the prediction equal the true values, they expect embeddings as **y_preds** and the id (as an int32) of the class as **y_true**.

**They require a distance**. You need to specify which distance function to use to compute the distance between embeddings. cosine is usually a great starting point and the default.

In this example we are using the **MultiSimilarityLoss()**. This loss takes a weighted combination of all valid positive and negative pairs, making it one of the best loss that you can use for similarity training.</p>

In [None]:
distance = "cosine"  # @param ["cosine", "L2", "L1"]{allow-input: false}
loss = tfsim.losses.MultiSimilarityLoss(distance=distance)

# Compilation

<p>Tensorflow similarity use an extended <b>compile()</b> method that allows you to optionally specify distance_metrics (metrics that are computed over the distance between the embeddings), and the distance to use for the indexer.

By default the **compile()** method tries to infer what type of distance you are using by looking at the first loss specified. If you use multiple losses, and the distance loss is not the first one, then you need to specify the distance function used as distance= parameter in the compile function.</p>

In [None]:
LR = 0.001  # @param {type:"number"}
model.compile(optimizer=tf.keras.optimizers.Adam(LR), loss=loss)

# Training

<p>Similarity models are trained like normal models.</p>

In [None]:
EPOCHS = 10  # @param {type:"integer"}
history = model.fit(train_generator,
    steps_per_epoch = STEPS_PER_EPOCH,
    epochs = EPOCHS,
    validation_data = validation_generator,
    validation_steps = VALIDATION_STEPS)

# Plotting

In [None]:
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.legend(["loss", "val_loss"])
plt.title(f"Loss: {loss.name} - LR: {LR}")
plt.show()

# Prediction

In [None]:
ss = pd.read_csv(os.path.join(WORK_DIR, "sample_submission.csv"))
ss

In [None]:
preds = []
top_n = 5
for image_id in ss[img_col]:
    image = Image.open(os.path.join(WORK_DIR, test_folder, image_id)).convert('L')
    image = image.resize((TARGET_SIZE, TARGET_SIZE))
    image = np.expand_dims(image, axis = 0)
    arr = model.predict(image)[0].argsort()[-top_n:][::-1]
    p = " ".join(np.vectorize(inv_label_map.get)(arr))
    preds.append(p)
ss[label_col] = preds
ss

In [None]:
ss.drop(['predictions'], axis = 1)

In [None]:
ss.rename(columns = {'individual_id':'predictions'}, inplace = True)

# Submission

In [None]:
sub = pd.read_csv(os.path.join(WORK_DIR, "sample_submission.csv"))

In [None]:
sub["predictions"]=ss.iloc[:,2]

In [None]:
sub

In [None]:
sub.to_csv('submission_whale_and_dolphin.csv', index = False)
print(ss.shape)

# Reference

As mentioned at the beginning of the notebook, this notebook is inspired from the example on<a href= https://blog.tensorflow.org/2021/09/introducing-tensorflow-similarity.html> TF Similarity</a>, do check it out.

<h2> If you found it interesting & helpful, then <b>Upvote</b> the notebook!!</h2>