# Intro
Goal of this notebook is to explore embeddings generated from [Hotel-ID starter - similarity- training](https://www.kaggle.com/code/michaln/hotel-id-starter-similarity-training) notebook and try out different methods to identify similar images.

## Embeddings
To compare images we can use model to generate embeddings as their representation and then calculate distance/similarity between images to search for the most similar one.

We can use pretrained model without the last classification layer and add two linear layers. Features from pretrianed CNN will be used as input for embedding layer and result of embedding layer will be used for classification layer. Model will then output embeddings and predicted class. We can use a class prediction to calculate loss and train the model further and embeddings to search for similar images. Embeddings should contain enough information to predict correct class so they should be good representations of the image. 

![Embedding model](https://github.com/michal-nahlik/kaggle-hotel-id-2022/raw/master/doc/img/embedding_model.png)

## Data
This notebook uses preprocessed images that were resized and padded to 256x256 pixel.

Used dataset: [Hotel-ID 2022 train images 256x256](https://www.kaggle.com/datasets/michaln/hotelid-2022-train-images-256x256) created by [Hotel-ID - image preprocessing - 256x256](https://www.kaggle.com/code/michaln/hotel-id-image-preprocessing-256x256) notebook.

# Imports

In [None]:
!pip install timm

In [None]:
import numpy as np
import pandas as pd
import random
import os

In [None]:
from PIL import Image as pil_image
from tqdm import tqdm

import matplotlib
import matplotlib.pyplot as plt

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import timm

from sklearn.metrics.pairwise import cosine_similarity

# Global

In [None]:
IMG_SIZE = 256
SEED = 42
N_MATCHES = 5

PROJECT_FOLDER = "../input/hotel-id-to-combat-human-trafficking-2022-fgvc9/"
DATA_FOLDER = "../input/hotelid-2022-train-images-256x256/"
IMAGE_FOLDER = DATA_FOLDER + "images/"
OUTPUT_FOLDER = ""

In [None]:
print(os.listdir(PROJECT_FOLDER))

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

# Dataset and transformations

Coarse dropout with fill_value=(255,0,0) (full red channel) is used to simulate the occlussions like the one in test dataset. 
```python
A.CoarseDropout(p=1., max_holes=1, 
                min_height=IMG_SIZE//4, max_height=IMG_SIZE//2,
                min_width=IMG_SIZE//4,  max_width=IMG_SIZE//2, 
                fill_value=(255,0,0))
```

In [None]:
import albumentations as A
import albumentations.pytorch as APT
import cv2 

# used for validation dataset - only occlusions
val_transform = A.Compose([
    A.CoarseDropout(p=1., max_holes=1, 
                    min_height=IMG_SIZE//4, max_height=IMG_SIZE//2,
                    min_width=IMG_SIZE//4,  max_width=IMG_SIZE//2, 
                    fill_value=(255,0,0)),# simulating occlusions
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

# no augmentations
base_transform = A.Compose([
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

In [None]:
class HotelTrainDataset:
    def __init__(self, data, transform=None, data_path="train_images/"):
        self.data = data
        self.data_path = data_path
        self.transform = transform

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        record = self.data.iloc[idx]
        image_path = self.data_path + record["image_id"]
        image = np.array(pil_image.open(image_path)).astype(np.uint8)

        if self.transform:
            transformed = self.transform(image=image)
            image = transformed["image"]
        
        return {
            "image" : image,
            "target" : record['hotel_id_code'],
        }

# Model
Model uses pretrained CNN without classification layer. Features from CNN are used as input for embedding layer (linear) and embeddings are used for final classification. Model can output both embeddings and class prediction or just embeddings.

Input image -> [CNN] -> features -> [Embedding layer] -> embeddings -> [Classification layer] -> class prediction

In [None]:
class EmbeddingModel(nn.Module):
    def __init__(self, n_classes=100, embedding_size=64, backbone_name="efficientnet_b0"):
        super(EmbeddingModel, self).__init__()
        
        self.backbone = timm.create_model(backbone_name, num_classes=n_classes, pretrained=True)
        in_features = self.backbone.get_classifier().in_features
        
        self.backbone.classifier = nn.Identity()
        self.embedding = nn.Linear(in_features, embedding_size)
        self.classifier = nn.Linear(embedding_size, n_classes)

    def embed_and_classify(self, x):
        x = self.forward(x)
        return x, self.classifier(x)

    def forward(self, x):
        x = self.backbone(x)
        x = x.view(x.size(0), -1)
        x = self.embedding(x)
        return x

# Model helper functions

In [None]:
# method to iterate loader and generate embeddings of images
# returns embeddings and image class
def generate_embeddings(loader, model, bar_desc="Generating embeds"):
    targets_all = []
    outputs_all = []
    
    model.eval()
    with torch.no_grad():
        t = tqdm(loader, desc=bar_desc)
        for i, sample in enumerate(t):
            input = sample['image'].to(args.device)
            target = sample['target'].to(args.device)
            output = model(input)
            
            targets_all.extend(target.cpu().numpy())
            outputs_all.extend(output.detach().cpu().numpy())

    targets_all = np.array(targets_all).astype(np.float32)
    outputs_all = np.array(outputs_all).astype(np.float32)
            
    return outputs_all, targets_all

In [None]:
def get_train_valid_embeddings(args, data_df):
    model_name = f"embedding-model-{args.backbone_name}-{IMG_SIZE}x{IMG_SIZE}"
    print(model_name)

    seed_everything(seed=SEED)

    # split data into train and validation set
    hotel_image_count = data_df.groupby("hotel_id")["image_id"].count()
    # hotels that have more images than samples for validation
    valid_hotels = hotel_image_count[hotel_image_count > args.val_samples]
    # data that can be split into train and val set
    valid_data = data_df[data_df["hotel_id"].isin(valid_hotels.index)]
    # if hotel had less than required val_samples it will be only in the train set
    valid_df = valid_data.groupby("hotel_id").sample(args.val_samples, random_state=SEED).reset_index(drop=True)
    train_df = data_df[~data_df["image_id"].isin(valid_df["image_id"])].reset_index(drop=True)
    

    valid_dataset = HotelTrainDataset(valid_df, val_transform, data_path=IMAGE_FOLDER)
    valid_loader  = DataLoader(valid_dataset, num_workers=args.num_workers, batch_size=args.batch_size, shuffle=False)
    # base dataset for image similarity search
    base_dataset  = HotelTrainDataset(train_df, base_transform, data_path=IMAGE_FOLDER)
    base_loader   = DataLoader(base_dataset, num_workers=args.num_workers, batch_size=args.batch_size, shuffle=False)

    # use trained model from Hotel-ID starter - similarity- training
    checkpoint = torch.load(args.checkpoint_path)
    model = EmbeddingModel(args.n_classes, args.embedding_size ,args.backbone_name)
    model.load_state_dict(checkpoint["model"])
    model = model.to(args.device)
       
        
    train_embeds, _ = generate_embeddings(base_loader, model, "Generate embeddings for base images")
    train_df["embeddings"] = list(train_embeds)
    train_df.to_pickle(f"{OUTPUT_FOLDER}{model_name}_train-image-embeddings.pkl")
    
    valid_embeds, _ = generate_embeddings(valid_loader, model, "Generate embeddings for valid images")
    valid_df["embeddings"] = list(valid_embeds)
    valid_df.to_pickle(f"{OUTPUT_FOLDER}{model_name}_val-image-embeddings.pkl")
    
    return train_embeds, train_df, valid_embeds, valid_df

# Prepare data
We will generate embeddings using the pretrained model for training and validation dataset from the training notebook. For the validation dataset we will use val_transform so there will be occlusions in the images. Training dataset will not use any augmentations.

In [None]:
data_df = pd.read_csv(DATA_FOLDER + "train.csv")
# encode hotel ids
data_df["hotel_id_code"] = data_df["hotel_id"].astype('category').cat.codes.values.astype(np.int64)

In [None]:
# save hotel_id encoding for later decoding
hotel_id_code_df = data_df.drop(columns=["image_id"]).drop_duplicates().reset_index(drop=True)
hotel_id_code_df.to_csv(OUTPUT_FOLDER + 'hotel_id_code_mapping.csv', index=False)

In [None]:
class args:
    batch_size = 256
    num_workers = 2
    val_samples = 1
    embedding_size = 128
    backbone_name = "efficientnet_b0"
    checkpoint_path = "../input/hotel-id-starter-similarity-training/checkpoint-embedding-model-efficientnet_b0-256x256.pt"
    n_classes = data_df["hotel_id_code"].nunique()
    device = ('cuda' if torch.cuda.is_available() else 'cpu')
    
train_embeds, train_df, valid_embeds, valid_df = get_train_valid_embeddings(args, data_df)

# Visualization
We can project embeddings using different methods like PCA, TSNE, UMAP to new space with reduced dimensions and then plot the results. We will use only subsample of images from 20 different hotels.

In [None]:
import umap

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [None]:
sample_hotel_ids = train_df["hotel_id"].unique()[:20]
sample_df = train_df[train_df["hotel_id"].isin(sample_hotel_ids)].reset_index(drop=True)
sample_embeds = np.vstack(sample_df["embeddings"].values)
sample_labels = sample_df["hotel_id"].astype(str)

In [None]:
print("Selected hotels:", sample_hotel_ids)
print("Number of samples:", len(sample_df))
print("Emeddings dimensions:", np.shape(sample_embeds))

In [None]:
fig = px.histogram(sample_df, x=sample_labels, histfunc='sum',barmode='group')
fig.update_layout(title={'text' : "Selected samples - Image count per hotel"}, 
                  xaxis_title="Hotel ID",
                  yaxis_title="Image count",                  
                  template="simple_white", height=400)

## PCA - 2 components

In [None]:
pca = PCA(n_components=2)
pca_embeds = pca.fit_transform(sample_embeds)

fig = px.scatter(x=pca_embeds[:,0], y=pca_embeds[:,1], color=sample_labels, custom_data =[sample_labels, sample_df["image_id"].values])
fig.update_traces(hovertemplate="Hotel ID: %{customdata[0]}<br>Image: %{customdata[1]}<extra></extra>")
fig.update_layout(title="Embeddings - 2d projection using PCA", legend=dict(title="Hotel ID"),
                 height=400,)
fig.show()

## TSNE - 2 components

In [None]:
tsne = TSNE(n_components=2, learning_rate='auto', init='random')
tsne_embeds = tsne.fit_transform(sample_embeds)

fig = px.scatter(x=tsne_embeds[:,0], y=tsne_embeds[:,1], 
                 color=sample_labels, custom_data =[sample_labels, sample_df["image_id"].values])
fig.update_traces(hovertemplate="Hotel ID: %{customdata[0]}<br>Image: %{customdata[1]}<extra></extra>")
fig.update_layout(title="Embeddings - 2d projection using TSNE", legend=dict(title="Hotel ID"),
                 height=400,)
fig.show()

## TSNE - 3 components

In [None]:
tsne = TSNE(n_components=3, learning_rate='auto', init='random')
tsne_embeds = tsne.fit_transform(sample_embeds)

fig = px.scatter_3d(x=tsne_embeds[:,0], y=tsne_embeds[:,1], z=tsne_embeds[:,2], 
                    color=sample_labels, custom_data =[sample_labels, sample_df["image_id"].values])
fig.update_traces(hovertemplate="Hotel ID: %{customdata[0]}<br>Image: %{customdata[1]}<extra></extra>")
fig.update_layout(title="Embeddings - 3d projection using TSNE", legend=dict(title="Hotel ID"),
                 height=400,)
fig.show()

## UMAP - 2 components

In [None]:
reducer = umap.UMAP(random_state=SEED)
umap_embeds = reducer.fit_transform(sample_embeds)

fig = px.scatter(x=umap_embeds[:,0], y=umap_embeds[:,1], 
                 color=sample_labels, custom_data =[sample_labels, sample_df["image_id"].values])
fig.update_traces(hovertemplate="Hotel ID: %{customdata[0]}<br>Image: %{customdata[1]}<extra></extra>")
fig.update_layout(title="Embeddings - 2d projection using UMAP", legend=dict(title="Hotel ID"),
                 height=400,)
fig.show()

## UMAP - interactive with image display 
Wanted to display image on hover but couldn't find how to do it in plotly without Dash so gonna use bokeh

In [None]:
# methods based on: https://www.kaggle.com/code/parulpandey/visualizing-kannada-mnist-with-t-sne/notebook

# Encoding all the images for inclusion in a dataframe.
from io import BytesIO
import base64

def embeddable_image(data):
    image = pil_image.fromarray(data, 'RGB').resize((128,128), pil_image.BICUBIC)
    buffer = BytesIO()
    image.save(buffer, format='png')
    for_encoding = buffer.getvalue()
    return 'data:image/png;base64,' + base64.b64encode(for_encoding).decode()

In [None]:
# loading up bokeh and other tools to generate a suitable interactive plot.

from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Category20

output_notebook()

In [None]:
# prepare data
sample_image_data = []
for i in range(len(sample_df)):
    image_path = IMAGE_FOLDER + sample_df.loc[i, "image_id"]
    image = np.array(pil_image.open(image_path)).astype(np.uint8)
    sample_image_data.extend([image])
    
sample_df["image_data"] = list(map(embeddable_image, sample_image_data))
sample_df["x"] = umap_embeds[:, 0]
sample_df["y"] = umap_embeds[:, 1]

sample_df["hotel_id_code"] = sample_df["hotel_id"].astype('category').cat.codes.values.astype(np.int64)+1
sample_df["hotel_id_code"] = sample_df["hotel_id_code"].astype(str)

In [None]:
# Generating the plot itself with a custom hover tooltip 
datasource = ColumnDataSource(sample_df)
color_mapping = CategoricalColorMapper(factors=sample_df["hotel_id_code"].unique(), palette=Category20[20])

plot_figure = figure(
    title='Embeddings - 2d projection using UMAP',
    plot_width=700,
    plot_height=400,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.add_tools(HoverTool(tooltips="""
<div>
    <div>
        <img src='@image_data' style='float: left; margin: 5px 5px 5px 5px'/>
    </div>
    <div>
        <span style='font-size: 16px'>Hotel: @hotel_id</span>
    </div>
    <div>
        <span style='font-size: 14px'>Image: @image_id</span>
    </div>
</div>
"""))

plot_figure.circle('x', 'y',
    source=datasource,
    color={'field': 'hotel_id_code', 'transform': color_mapping},
    line_alpha=0.6,
    fill_alpha=0.6,
    radius=0.05,
    legend_field='hotel_id',
    size=4
)

show(plot_figure)

# Similarity search

## Similarity
To find if images are similar we can calculate distance/similarity of their emebeddings using methods like [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) or [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

After training model we can get embeddings for all images with known class (hotel id) and calculate their similarity to the test image we want to classify. We can rank the train images based on their similarity to embeddings of the test image and find 5 train images from different hotels that are most similar.


## Find 5 most similar images from different hotels

In [None]:
# find 5 most similar images from different hotels and return their hotel_id_code
def find_matches_cosine_similarity(query, base_embeds, base_targets, k=N_MATCHES):
    distance_df = pd.DataFrame(index=np.arange(len(base_targets)), data={"hotel_id_code": base_targets})
    # calculate cosine distance of query embeds to all base embeds
    distance_df["distance"] = cosine_similarity([query], base_embeds)[0]
    # sort by distance and hotel_id
    distance_df = distance_df.sort_values(by=["distance", "hotel_id_code"], ascending=False).reset_index(drop=True)
    # return first 5 different hotel_id_codes
    return distance_df["hotel_id_code"].unique()[:N_MATCHES]


def test_similarity(base_embeds, base_df, test_embeds, test_df):
    base_targets = base_df["hotel_id"]
    test_targets = test_df["hotel_id"]
    
    preds = []
    
    for query_embeds in tqdm(test_embeds, desc="Similarity - match finding"):
        tmp = find_matches_cosine_similarity(query_embeds, base_embeds, base_targets)
        preds.extend([tmp])
        
    preds = np.array(preds)
    test_targets_N = np.repeat([test_targets], repeats=N_MATCHES, axis=0).T
    # check if any of top 5 predictions are correct and calculate mean accuracy
    acc_top_5 = (preds == test_targets_N).any(axis=1).mean()
    # calculate prediction accuracy
    acc_top_1 = np.mean(test_targets == preds[:, 0])
    print(f"Cosine similarity accuracy: {acc_top_1:0.4f}, MAP@5: {acc_top_5:0.4f}")

In [None]:
def find_matches_distance(query, base_embeds, base_targets, distance_fc, k=N_MATCHES):
    distance_df = pd.DataFrame(index=np.arange(len(base_targets)), data={"hotel_id_code": base_targets})
    # calculate cosine distance of query embeds to all base embeds
    distance_df["distance"] = distance_fc([query], base_embeds)[0]
    # sort by distance and hotel_id
    distance_df = distance_df.sort_values(by=["distance", "hotel_id_code"], ascending=True).reset_index(drop=True)
    # return first 5 different hotel_id_codes
    return distance_df["hotel_id_code"].unique()[:N_MATCHES]
    
    
def test_distance(base_embeds, base_df, test_embeds, test_df, distance_fc):
    base_targets = base_df["hotel_id"]
    test_targets = test_df["hotel_id"]
    
    preds = []
    
    for query_embeds in tqdm(test_embeds, desc="Distance - match finding"):
        tmp = find_matches_distance(query_embeds, base_embeds, base_targets, distance_fc)
        preds.extend([tmp])
        
    preds = np.array(preds)
    test_targets_N = np.repeat([test_targets], repeats=N_MATCHES, axis=0).T
    # check if any of top 5 predictions are correct and calculate mean accuracy
    acc_top_5 = (preds == test_targets_N).any(axis=1).mean()
    # calculate prediction accuracy
    acc_top_1 = np.mean(test_targets == preds[:, 0])
    print(f"Distance accuracy: {acc_top_1:0.4f}, MAP@5: {acc_top_5:0.4f}")

In [None]:
test_similarity(train_embeds, train_df, valid_embeds, valid_df)

In [None]:
from sklearn.metrics.pairwise import euclidean_distances, cosine_distances

print("Euclidean distance")
test_distance(train_embeds, train_df, valid_embeds, valid_df, euclidean_distances)

print("\nCosine distance")
test_distance(train_embeds, train_df, valid_embeds, valid_df, cosine_distances)

## KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
def test_knn(x_train, y_train, x_test, y_test, n_neighbors=5, metric="euclidean"):
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
    knn.fit(x_train, y_train)
    preds = knn.predict_proba(x_test)

    top_1_pred = knn.classes_[np.argmax(preds, axis=1)]
    top_5_pred = knn.classes_[np.argsort(-preds, axis=1)[:,:5]]

    test_targets_N = np.repeat([y_test], repeats=N_MATCHES, axis=0).T
    # check if any of top 5 predictions are correct and calculate mean accuracy
    acc_top_5 = (top_5_pred == test_targets_N).any(axis=1).mean()
    # calculate prediction accuracy
    acc_top_1 = np.mean(y_test == top_1_pred)
    print(f"KNN ({n_neighbors}, {metric}) accuracy: {acc_top_1:0.4f}, MAP@5: {acc_top_5:0.4f}")

In [None]:
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=1)
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=5)
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=10)
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=25)
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=50)

In [None]:
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=1, metric="cosine")
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=5, metric="cosine")
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=10, metric="cosine")
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=25, metric="cosine")
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=50, metric="cosine")

In [None]:
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=5, metric="minkowski")
test_knn(train_embeds, train_df["hotel_id"], valid_embeds, valid_df["hotel_id"], n_neighbors=5, metric="manhattan")

# Output
You can use output of this notebook, load train and valid data with embeddings and try your own methods to classify images.
- training data: embedding-model-efficientnet_b0-256x256_train-image-embeddings.pkl
- validation data: embedding-model-efficientnet_b0-256x256_val-image-embeddings.pkl