<img src="https://www.videoandaudiocenter.com/v/vspfiles/assets/images/PriceMatchGuar.jpg" width = 250 height = 100>

# Description

Let me begin by thanking [Chris Deotte](https://www.kaggle.com/cdeotte) for his outstanding contributions to the data science community. I have been learning a ton from him, and for that, I am forever grateful. This notebook is my submission to the [Shopee - Price Match Guarantee](https://www.kaggle.com/c/shopee-product-matching) competition, and is mainly built upon Chris's great notebooks published [here](https://www.kaggle.com/cdeotte/rapids-cuml-tfidfvectorizer-and-knn) and [here](https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700). The goal of the competition is to build a model that predicts which listed items are the same product, so that customers can purchase their desired product at its lowest price. 

We will be using Keras' EfficientNetB0 as well as RAPIDS cuML's TfidfVectorizer and KNN, in order to find items with similar titles and/or images. First we use RAPIDS cuML TfidfVectorizer to extract text embeddings of each item's title and then compare the embeddings using RAPIDS cuML KNN. Next we extract image embeddings of each item with Keras' EfficientNetB0 and compare them using RAPIDS cuML KNN. 

#### With that, let the adventure begin! 

In [None]:
import numpy as np, pandas as pd, gc
import cv2, matplotlib.pyplot as plt
import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetB0

Note that to avoid memory issues, we can restrict TensorFlow to 1GB of GPU RAM so that we have 15GB RAM for RAPIDS. According to Chris Deotte's comment in [this post](https://www.kaggle.com/cdeotte/rapids-cuml-tfidfvectorizer-and-knn#1244649), when TensorFlow sees a GPU, it will reserve all the GPU RAM for itself. So we can trick TensorFlow by making a fake GPU with only 1GB RAM. Then TensorFlow only takes the 1GB and the remaining 15GB is left over for RAPIDS to use.

In [None]:
LIMIT = 1
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024*LIMIT)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    #print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    print(e)

# Visualization

In this section, we will load the training data and create a target column of ground truths so as to compute the CV score. Note that *in order to submit this notebook we should change the variable `COMPUTE_CV` to `False`. However, this variable should be set to `True` when we want to commit the notebook.*

In [None]:
COMPUTE_CV = True

test = pd.read_csv('../input/shopee-product-matching/test.csv')
if len(test)>3: COMPUTE_CV = False
else: print('The submission notebook will compute CV score, but the commit notebook will not')

In [None]:
train = pd.read_csv('../input/shopee-product-matching/train.csv')
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
print('Train shape is', train.shape )
train.head()

To start, let's randomly display 20 images out of the training data: 

In [None]:
BASE = '../input/shopee-product-matching/train_images/'

def displayDF(train, random=False, COLS=5, ROWS=4, path=BASE):
    for k in range(ROWS):
        plt.figure(figsize=(20,5))
        for j in range(COLS):
            if random: row = np.random.randint(0,len(train))
            else: row = COLS*k + j
            name = train.iloc[row,1]
            title = train.iloc[row,3]
            title_with_return = ""
            for i,ch in enumerate(title):
                title_with_return += ch
                if (i!=0)&(i%20==0): title_with_return += '\n'
            img = cv2.imread(path+name)
            plt.subplot(1,COLS,j+1)
            plt.title(title_with_return)
            plt.axis('off')
            plt.imshow(img)
        plt.show()
        
displayDF(train,random=True)

Now let's display the top 5 duplicated items, using the column `label_group`, which represents the ground truth in the training data. For brievity, we will only show 5 images for each duplicated item:

In [None]:
groups = train.label_group.value_counts()

for k in range(5):
    
    print('-'*22)
    print('Top', k+1, 'Duplicated Item')  
    print('-'*22)
    top = train.loc[train.label_group==groups.index[k]]
    displayDF(top, random=False, ROWS=1, COLS=5)

Note that we can also find similar items using the title's text. To achieve this, we can first extract text embeddings using RAPIDS cuML's TfidfVectorizer. This will turn each title into a one-hot-encoding vector. We can then compare the one-hot-encoding vectors with RAPIDS cuML KNN in order to find the similar titles. I am going to skip plotting the similar titles in this section, but we will be using similar titles to create our ML model. 

# Modeling

We now ignore the ground truth of which items are similar and create a model that can identify duplicate items. Let's start by creating a baseline model, where we predict all products with the same `image_phash` as duplicate items.

In [None]:
tmp = train.groupby('image_phash').posting_id.agg('unique').to_dict()
train['oof'] = train.image_phash.map(tmp)

def getMetric(col):
    def f1score(row):
        n = len( np.intersect1d(row.target,row[col]) )
        return 2*n / (len(row.target)+len(row[col]))
    return f1score

train['f1'] = train.apply(getMetric('oof'),axis=1)
print('Baseline CV Score =',train.f1.mean())

Now let's use image embeddings, text embeddings, and phash all together in order to create a more accurate model. As mentioned earlier, in order to submit this notebook we should change the variable `COMPUTE_CV` to `False`. However, this variable should be set to `True` when we want to commit the notebook.

In [None]:
if COMPUTE_CV:
    test = pd.read_csv('../input/shopee-product-matching/train.csv')
    test_gf = cudf.DataFrame(test)
    print('This is a commit notebook! Test shape is', test_gf.shape)
else:
    test = pd.read_csv('../input/shopee-product-matching/test.csv')
    test_gf = cudf.read_csv('../input/shopee-product-matching/test.csv')
    print('This is a submission notebook! Test shape is', test_gf.shape)
test_gf.head()

## Method I: Image Embeddings

We will compute image embeddings in chunks in order to prevent memory errors, and will find similar images with RAPIDS cuML KNN in chunks.

In [None]:
class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df, img_size=256, batch_size=32, path=''): 
        self.df = df
        self.img_size = img_size
        self.batch_size = batch_size
        self.path = path
        self.indexes = np.arange( len(self.df) )
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        ct = len(self.df) // self.batch_size
        ct += int(( (len(self.df)) % self.batch_size)!=0)
        return ct

    def __getitem__(self, index):
        'Generate one batch of data'
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X = self.__data_generation(indexes)
        return X
            
    def __data_generation(self, indexes):
        'Generates data containing batch_size samples' 
        X = np.zeros((len(indexes),self.img_size,self.img_size,3),dtype='float32')
        df = self.df.iloc[indexes]
        for i,(index,row) in enumerate(df.iterrows()):
            img = cv2.imread(self.path+row.image)
            X[i,] = cv2.resize(img,(self.img_size,self.img_size)) #/128.0 - 1.0
        return X

In [None]:
BASE = '../input/shopee-product-matching/test_images/'
if COMPUTE_CV: BASE = '../input/shopee-product-matching/train_images/'

WGT = '../input/effnetb0/efficientnetb0_notop.h5'
model = EfficientNetB0(weights=WGT,include_top=False, pooling='avg', input_shape=None)

embeds = []
CHUNK = 1024*4

print('Computing image embeddings ...')
CTS = len(test)//CHUNK
if len(test)%CHUNK!=0: CTS += 1
for i,j in enumerate( range( CTS ) ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(test))
    print('chunk',a,'to',b)
    
    test_gen = DataGenerator(test.iloc[a:b], batch_size=32, path=BASE)
    image_embeddings = model.predict(test_gen,verbose=1,use_multiprocessing=True, workers=4)
    embeds.append(image_embeddings)

    #if i>=1: break
    
del model
_ = gc.collect()
image_embeddings = np.concatenate(embeds)
print('Image embeddings shape is',image_embeddings.shape)

In [None]:
KNN = 50
if len(test)==3: KNN = 2
model = NearestNeighbors(n_neighbors=KNN)
model.fit(image_embeddings)

In [None]:
preds = []
CHUNK = 1024*4

print('Finding similar images ...')
CTS = len(image_embeddings)//CHUNK
if len(image_embeddings)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(image_embeddings))
    print('chunk',a,'to',b)
    distances, indices = model.kneighbors(image_embeddings[a:b,])
    
    for k in range(b-a):
        IDX = np.where(distances[k,]<6.0)[0]
        IDS = indices[k,IDX]
        o = test.iloc[IDS].posting_id.values
        preds.append(o)
        
_ = gc.collect()

In [None]:
test['preds2'] = preds
test.head()

Now we have fitted the KNN classifier, let's display 5 different items and their 4 closest other images in the train data based on EffNetB0 image embeddings:

In [None]:
for k in range(180,185):
    
    print('-'*9)
    print('Example', k-179)
    print('-'*9)
    cluster = train.loc[cupy.asnumpy(indices[k,:8])] 
    displayDF(cluster, random=False, ROWS=1, COLS=5)

## Method II:  Text Embeddings

Similarly, we will find similar titles in chunks in order to prevent memory errors. To faciliate this, we will use cosine similarity between text embeddings instead of KNN.

In [None]:
print('Computing text embeddings ...')
model = TfidfVectorizer(stop_words='english', binary=True, max_features=25_000)
text_embeddings = model.fit_transform(test_gf.title).toarray()
print('Text embeddings shape is',text_embeddings.shape)

In [None]:
preds = []
CHUNK = 1024*4

print('Finding similar titles ...')
CTS = len(test)//CHUNK
if len(test)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(test))
    print('chunk',a,'to',b)
    
    # COSINE SIMILARITY DISTANCE
    cts = cupy.matmul( text_embeddings, text_embeddings[a:b].T).T
    
    for k in range(b-a):
        IDX = cupy.where(cts[k,]>0.7)[0]
        o = test.iloc[cupy.asnumpy(IDX)].posting_id.values
        preds.append(o)
        
del model, text_embeddings
_ = gc.collect()

In [None]:
test['preds'] = preds
test.head()

## Method III: Phash Feature

Finally, we will be using the phash feature and predict all items with the same phash as duplicates.

In [None]:
tmp = test.groupby('image_phash').posting_id.agg('unique').to_dict()
test['preds3'] = test.image_phash.map(tmp)
test.head()

## Method IV: Ensemble Learning

Let's combine the previous models and calculate the CV score:

In [None]:
def combine_for_sub(row):
    x = np.concatenate([row.preds,row.preds2, row.preds3])
    return ' '.join( np.unique(x) )

def combine_for_cv(row):
    x = np.concatenate([row.preds,row.preds2, row.preds3])
    return np.unique(x)

In [None]:
if COMPUTE_CV:
    tmp = test.groupby('label_group').posting_id.agg('unique').to_dict()
    test['target'] = test.label_group.map(tmp)
    test['oof'] = test.apply(combine_for_cv,axis=1)
    test['f1'] = test.apply(getMetric('oof'),axis=1)
    print('CV Score =', test.f1.mean() )

test['matches'] = test.apply(combine_for_sub,axis=1)

Finally, we need to generate a submission file:

In [None]:
test[['posting_id','matches']].to_csv('submission.csv',index=False)
sub = pd.read_csv('submission.csv')
sub.head()

Hope you enjoyed this notebook. Make sure you also check Chris Deotte's [post](https://www.kaggle.com/c/shopee-product-matching/discussion/238033) on how to improve the CV score by using a better decision boundary and removing false negative and false positive which increase metric F1 score. Happy Kaggling :) 