This notebook is written to show the effect of **INB(Iterative Neighborhood Blending)** technique, which is part of the 1st place solution of *Shopee - Price Match Guarantee* competition.

We will use the validation set embedding from our model as the base embedding, and compare two visualizations.
1. TSNE visualization of the embedding **before** applying INB
2. TSNE visualization of the embedding **after** applying INB

Note that validation f1-score increased from 0.9060 to 0.9256 when we applied INB.

You can find explanation of the INB technique and overall solution [here](https://www.kaggle.com/c/shopee-product-matching/discussion/238136)

In [None]:
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import cuml
import pandas as pd
import numpy as np

In [None]:
def plot(emb, title, lg, ylim=None, xlim=None, target_lg=None):
    x, y = emb[:, 0], emb[:, 1]
    sns.scatterplot(x=x, y=y, hue=lg, legend=False)
    if xlim is not None: plt.xlim(xlim[0], xlim[1])
    if ylim is not None: plt.ylim(ylim[0], ylim[1])
    for i, txt in enumerate(lg):
        if target_lg is not None:
            if txt != target_lg:
                continue
        plt.annotate(txt, (x[i], y[i]))
    plt.title(title)
    
    
def plot_cropped(emb, cur_lg, lg, title, margin=5):
    cur = emb[np.where(lg==cur_lg)[0]]
    min_y, max_y = cur[:, 0].min(), cur[:, 0].max()
    min_x, max_x = cur[:, 1].min(), cur[:, 1].max()
    ylim = (min_y-margin, max_y+margin)
    xlim = (min_x-margin, max_x+margin)
    index = np.where((emb[:, 0]>ylim[0])&(emb[:, 0]<ylim[1])&(emb[:, 1]>xlim[0])&(emb[:, 1]<xlim[1]))[0]
    plot(emb[index], title, lg[index], ylim=xlim, xlim=ylim, target_lg=cur_lg)

In [None]:
orig_emb = joblib.load('../input/shopee-visualization/orig_emb.jl')
inb_emb = joblib.load('../input/shopee-visualization/inb_emb.jl')
lg = joblib.load('../input/shopee-visualization/label_group.jl')
lg = pd.factorize(lg)[0]
lg_diff = pd.read_pickle('../input/shopee-visualization/lg_diff.pkl')

`lg_diff` contains gap between these two
* mean f1 score per label group before applying INB
* mean f1 score per label group after applying INB

In [None]:
lg_diff

In [None]:
print('Improvement in validation set f1 score after applying INB')
print((lg_diff*lg_diff.index.map(pd.Series(lg).value_counts())).sum()/len(lg))

In [None]:
tsne = cuml.TSNE(random_state=0)
orig_emb_tsne = tsne.fit_transform(orig_emb)
inb_emb_tsne = tsne.fit_transform(inb_emb)

# TSNE Visualization of the Whole Validation Set Embeddings

You can open the output images in another tab or download it, and zoom it to see the label annotations.

### 1. Embeddings Before Applying INB

In [None]:
plt.figure(figsize=(100, 100))
plot(orig_emb_tsne, 'TSNE VISUALIZATION OF EMBEDDINGS BEFORE INB', lg)
plt.show()

### 2. Embeddings After Applying INB

In [None]:
plt.figure(figsize=(100, 100))
plot(inb_emb_tsne, 'TSNE VISUALIZATION OF EMBEDDINGS AFTER INB', lg)
plt.show()

We can see that the clusters became clearer after we applied INB.

# TSNE Visualizations of the Embeddings from Most Improved Label Groups

We'll pick label groups that improved the most, with group size > 3. Then we'll visualize how the positions of the items of that label group changed after applying INB. We will zoom into the relevant areas in the above visualizations to show you the effect better.

In [None]:
for cur_lg in lg_diff.index[:100]:    
    if (lg==cur_lg).sum() < 4:
        continue 
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plot_cropped(orig_emb_tsne, cur_lg, lg, f'{cur_lg} (BEFORE INB)')
    plt.subplot(1, 2, 2)
    plot_cropped(inb_emb_tsne, cur_lg, lg, f'{cur_lg} (AFTER INB)')
    plt.show()

We can see that the items of the same label group tends to cluster better after we apply INB.