This is a simple, unsupervised approach to clustering products in this competition. For each test record, [sentence embeddings](https://tfhub.dev/google/collections/universal-sentence-encoder) are generated on `title`. Universal sentence encoders are case insensitive and are robust enough to accommodate small typos. Embeddings whose euclidean distance are within a certain threshold (usually less than 1.) are assumed to belong to the same cluster. Since euclidean distances are reflexive ($Cluster(A)=Cluster(A)$) and symmetric ($Cluster(A)=Cluster(B) \iff Cluster(B)=Cluster(A)$), [all evaluation criteria](https://www.kaggle.com/c/shopee-product-matching/overview/evaluation) are met.

Needless to say, it makes more sense to combine the embeddings with features generated from images and perceptual hashes for a better feature set.

**Few implementation notes:**
* This notebook does not filter for the top 50 similar products alone. But that can be easily done by altering `generate_predictions`
* I load the entire title set (including repetitions) in this notebook. Alternatively, title strings can either be batched and/or repetitions can be ignored. 
* Submission takes approximately 1 hour.
* Set `IS_TEST = True` during commit to submit. When `IS_TEST = False`, this notebook evaluates on training data instead. 


**NOTE:** This is my first public notebook and any feedback is most welcome. Thanks. 

In [None]:
import pandas as pd, numpy as np
import random
import tensorflow_hub as hub
import tensorflow as tf
from sklearn.metrics import f1_score
from tqdm import tqdm
from tqdm.contrib.concurrent import process_map
tqdm.pandas()

IS_TEST = False
DIST_THRESHOLD = .75
fl = '../input/shopee-product-matching/train.csv' if not IS_TEST else '../input/shopee-product-matching/test.csv'
data = pd.read_csv(fl)
tot_rows = data.shape[0]

sentence_embed = hub.load("../input/use-v4/use_v4")
title_embeddings = sentence_embed(data['title'].values.tolist()) #512-d
title_embeddings = title_embeddings.numpy() 
title_embeddings = pd.DataFrame(title_embeddings)
title_embeddings.columns = [f'title_emb_{i}' for i in range(512)]
data = pd.merge(data, title_embeddings, left_index=True, right_index=True)

feat_cols = [f'title_emb_{i}' for i in range(512)] 
all_vecs = data[feat_cols].values

## Predictions using Sentence embeddings

In [None]:
def generate_predictions(ix):
    dists = np.linalg.norm(all_vecs-all_vecs[ix], axis=1)
    indices = np.where(dists<=DIST_THRESHOLD)[0]
    clusters = data.iloc[indices]['posting_id'].values.tolist()
    return ' '.join(clusters)

def generate_ground_truths(ix):
    lbl = data.iloc[ix]['label_group']
    gt_indices = data[data['label_group']==lbl].index
    gt_clusters = data.iloc[gt_indices]['posting_id'].values.tolist()
    return ' '.join(gt_clusters)
    
if __name__ == '__main__':
    data['matches'] = ''
    data['matches'] = process_map(generate_predictions, list(range(all_vecs.shape[0])), max_workers=4, chunksize=1000)
    if not IS_TEST:
        data['gts'] = ''
        data['gts'] = process_map(generate_ground_truths, list(range(all_vecs.shape[0])), max_workers=4, chunksize=1000)

## F1 Stats on training data

In [None]:
if IS_TEST:
    data[['posting_id', 'matches']].to_csv('submission.csv', index=False)
else:
    data['gts'] = data['gts'].progress_apply(lambda x: x.split())
    data['matches'] = data['matches'].progress_apply(lambda x: x.split())
    tp = data.progress_apply(lambda row: len(set(row['gts']).intersection(row['matches'])), axis=1)
    fp = data.progress_apply(lambda row: len(set(row['matches'])-set(row['gts'])), axis=1)
    fn = data.progress_apply(lambda row: len(set(row['gts'])-set(row['matches'])), axis=1)
    data['f1_score'] = tp/(tp + (fp+fn)/2)
    print('Train mean F1: ', data['f1_score'].mean())
    
    stats = pd.DataFrame()
    stats['tp'] = tp
    stats['fp'] = fp
    stats['fn'] = fn
    print(stats.describe(percentiles=[(ix+1)*.05 for ix in range(19)]))    