We will exploit the transitive relationships in this kernel. If 'A' likes 'B' and 'C' then we not only know that 1) B likes A and 2) C likes A but also that 3) B likes C (and viceversa).

Let us do quick POC first followed by an actual implementation. The POC does not need GPU but the actual implementation needs about 8-10 mins of GPU. I have also stored the output of the GPU processing so that one can turn GPU off for post-processing and directly use the output file shared

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.set_option('display.max_colwidth', None)

train = pd.read_csv('../input/shopee-product-matching/train.csv')
train['titleUcase'] = train['title'].str.upper()

Before we do anything serious, let us take the output from here : https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700 and see if the idea even makes sense

In [None]:
out = pd.read_csv('../input/chris-rapids/submission_Chris.csv')
out.head(4)

In [None]:
%timeit out.loc[out.posting_id == 'train_2288590299']

In [None]:
out = out.sort_values('posting_id')
out = out.set_index('posting_id')
out = out.sort_index()

%timeit out.loc['train_2288590299']

The below cell was taking well over an hour to execute on a CPU. Hence the above index optimizations. CPU time is now < half a minute

Ok what we do now is to take each posting_id and then build a superset of all possible transitive matches it can have. So if A is same as BCD and B is same as CDE and C is same as ABD and D is same as A and E is same as D, then ABCDE is a superset and each posting ID in this superset should match with every element in this superset

In [None]:
%%time
ctr=0

def getcombined(l, combined):
    global ctr
    if len(combined) == 0: 
        ctr+=1
        print(ctr) if ctr%6000==0 else None
    
    if len(l) < 3 and len(combined)==0 :
        return l
    elif len(combined) >= 50:
        return combined
    
    local_combined = set()
    for item in l:
        matches = set(out.loc[item]['matches'].split(' '))
        local_combined.update(matches)
        
    remaining = (local_combined - set(l)) - combined
    combined.update(local_combined)

    if len(remaining) > 0:
        getcombined(remaining, combined)
        
    return list(combined)

out['combined'] = out.apply(lambda x: getcombined(x.matches.split(' '), set()), axis=1)

Phew, This was my first recursive function and I asm sure there may be a more elegant way of writing this. Anyway this serves the purpose for now. The reason why we make it recursive is that as we go thru' all the matches for a particular posting ID (one iteration), we may get new matches once the iteration is over and we have to repeat this process for the new iterations. We need to keep doing that until we have iterated thru' all matching postingIDs or the num_postingID>50

Note - I later discovered a bug here in the way I assign the output. I do it just for the row in question. Instead I should do it for all matching IDs that are returned. I correct this in the actual implementation below

Ok. Looks good. Let us check our score. Chris's original score was 72.X. Let us use the same function to compute CV

In [None]:
def getMetric(col):
    def f1score(row):
        n = len( np.intersect1d(row.target,row[col]) )
        return 2*n / (len(row.target)+len(row[col]))
    return f1score

The input to the scoring function will be 'combined' col from the table we just created. Let us merge it back to 'train' and then use that to compute the score. But before that there is a headache to be taken care of. The submission format and the CV format vary slightly..one is a string separated by spaces and the other is a list. We need a list for CV. Our 'combined' is already a list but the original 'matches' isnt. So let us create a new col 'matches_orig' in list format from the submission output

In [None]:
out['matches_orig'] = list(out.matches.str.split(' '))
out.head(1)

Now, let us use the new cols 'matches_orig' and 'combined' to get the original and the new scores. Dont forget to add the label first.

In [None]:
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)

In [None]:
train = pd.merge(train, out,  how='left', left_on=['posting_id'], right_on = ['posting_id'])

train['f1_orig'] = train.apply(getMetric('matches_orig'),axis=1)
print('Orig CV Score =', train.f1_orig.mean() )

train['f1'] = train.apply(getMetric('combined'),axis=1)
print('New CV Score =', train.f1.mean() )

##Drop these columns to avoid confusion later
if 'f1' in train.columns:
    train.drop('f1',axis='columns', inplace=True)
    
if 'matches' in train.columns:
    train.drop('matches',axis='columns', inplace=True)

Ok. The score deteriorated slightly from the original CV. This was kind of expected. The number of false +ves generated would have increased dramatically. In fact I was expecting a far worse score that this. Now here is how we can actually utilize this bit of PP - First, in the regular ML program, we will make the cosine similarity check far more stringent that before. So any matches now are pretty sure to be 'actual' matches. Now we add this additional bit of PP to build transitive relations and expand the 'matches' column. Hopefully this returns a better score. To understand better, take a simple case where A is similar to B(with text comparision) and C(with image comparision). Now this automatically implies that our algorithm will return that B is similar to A and C is similar to A. But what happens if the text for B is not very closely related to the text for C (maybe it just falls below the cut-off criteria or maybe it contains translated words hence there is no matching) and the images of B and C also dont match (maybe the angles and lighting are different). So in that case B need not be equal to C as per our model. But we do know that this cant be the case - if A is equal to B and C (assuming we have high confidence because we have set high cut-offs) then B & C must simply be equal even though our model says No. 

Can we exploit this fact? Let us see..

Before that, let us load Rapids and calculate the cosine simialrity. We will use Chris's code exactly as-is for this purpose with some bare minimal changes while calculating the cosine similarity. I have also added a bit of my documentation so that I can understand this code easily when I look at it later

Note that we will have to turn on GPUs here. I wish Kaggle had an option of turning it on programatically

In [None]:
import gc
import cv2, matplotlib.pyplot as plt
import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetB0

print('RAPIDS',cuml.__version__)
print('TF',tf.__version__)

In [None]:
gpus = tf.config.experimental.list_physical_devices('GPU')

##Give Rapids 15GB of the GPU and let TF take 1 GB (1024 MB)
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0],[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    except RuntimeError as e:
        print(e)
else:
    print('tough luck. Work with cpus')

Let us read the file again and we start from scratch

In [None]:
train = pd.read_csv('../input/shopee-product-matching/train.csv')

##create the target based on the label_group
##first element in target is self and then all the matches
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
train.head(2)

ok, now let us use the images and text and phash. We will precit using them individually and then we will ensemble. Note that concatenating the embeddings to create one giant embedding is NOT a good strategy for multiple reasons. Instead let us try to use the best of the 3 groups with some intelligent ensembling

We will also have to convert our datasets to RAPIDS format. This is test_rapids.

Keep in mind that we are calculating CV and we will use the train data instead of the test data.

In [None]:
test = pd.read_csv('../input/shopee-product-matching/train.csv')
test_rapids = cudf.DataFrame(test)
    
test_rapids.head(4)

Since the dataset is huge, we cant process all the records at the same time. They will not be able to fit in memory at the same time.

We need to create some sort of a data generator that produces data in batches. We can leverage the keras.utils.Sequence which is the root class for Data Generators. We need to override 4 methods to implement a custom data loader. 

Once we have this, it is fairly easy to generate required batches of data and pass it to the model.fit

We use open cv to read and resize images which is much faster

In [None]:
##Use image embeddings
class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self, df, img_size=256, batch_size=32, path=''): 
        self.df = df
        self.img_size = img_size
        self.batch_size = batch_size
        self.path = path
        self.indexes = np.arange(len(self.df))
        
    def __len__(self):
        ##This param is supposed to return the length of the batches
        ct = len(self.df) // self.batch_size
        ct += int(( (len(self.df)) % self.batch_size)!=0)
        return ct

    def __getitem__(self, index):
        ##Hands out one batch of data
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X = self.__data_generation(indexes)
        return X
            
    def __data_generation(self, indexes):
        ##This is the actual data generator
        X = np.zeros((len(indexes),self.img_size,self.img_size,3),dtype='float32')
        df = self.df.iloc[indexes]
        for i,(index,row) in enumerate(df.iterrows()):
            img = cv2.imread(self.path+row.image)
            X[i,] = cv2.resize(img,(self.img_size,self.img_size))
        return X

Now we can do----  data = DataGenerator(test, path=BASE)

More particularly since we are going to process in chunks, we will call---- data = DataGenerator(test.iloc[a:b], path=BASE)

First we will extract image embeddings using EffNetB0 in chunks. Later we will do a kNN checks. We use tensorflow.keras.applications import EfficientNetB0 which they do not normalize (for other versions we should, unless we do a fine-tuning). The model takes input images of shape (224,224,3) and i/p data from 0 to 255.

But first a quick intro to Rapids - a collection of neat libraries which executes DIRECTLY on the GPUs. There is no data movement. Everything runs on the GPU.

cuDF is the Python GPU DataFrame library for manipulating data using a DataFrame style API. 

cuML is their suite of ML libraries. We use cuML KNN to find images that are similar. Apparently, the RAPIDS cuML implementation of kNN search on GPU is based on Facebookâ€™s SOTA FAISS library and is BLAZINGLY fast. See https://medium.com/rapids-ai/accelerating-k-nearest-neighbors-600x-using-rapids-cuml-82725d56401e by our very own Chris

All the API (commands) for Pandas work with RAPIDS cuDF and all the API (commands) for Scikit-Learn work with RAPIDS cuML. 

In [None]:
%%time
BASE = '../input/shopee-product-matching/train_images/'

model = EfficientNetB0(weights='imagenet', include_top=False, pooling='avg', input_shape=None)
##If submitting, then internet is off, so download the weights beforehand

##We predict in chunks to avoid OOM issues
CHUNK = 1024*4
CTS = len(test)//CHUNK
if len(test)%CHUNK!=0: 
    ##one last chunk for any left-over records
    CTS += 1
embeds = []

for i,j in enumerate(range( CTS )):
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(test))
    print('chunk',a,'to',b)
    
    test_gen = DataGenerator(test.iloc[a:b], path=BASE)
    image_embeddings = model.predict(test_gen,verbose=1,use_multiprocessing=True, workers=4)
    embeds.append(image_embeddings)
    
del model
_ = gc.collect()
image_embeddings = np.concatenate(embeds)
print('image embeddings shape',image_embeddings.shape)

In [None]:
KNN = 50
model = NearestNeighbors(n_neighbors=KNN)
model.fit(image_embeddings)

Now for each postingid, let us get the nearest neightbours. Chris uses a value of 6. We will use values of 2 and 8 to get the liberal and stringent versions of nearest neighbours

In [None]:
preds, preds_liberal, preds_stringent = [], [], []

CHUNK = 1024*4
CTS = len(image_embeddings)//CHUNK
if len(image_embeddings)%CHUNK!=0: 
    CTS += 1

print('Finding similar images...')
for j in range( CTS ):
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(image_embeddings))
    print('chunk',a,'to',b)
    distances, indices = model.kneighbors(image_embeddings[a:b,])
    
    for k in range(b-a):
        IDX = np.where(distances[k,]<6.0)[0]
        IDX_liberal = np.where(distances[k,]<8.0)[0]
        IDX_stringent = np.where(distances[k,]<2.0)[0]
        
        IDS = indices[k,IDX]
        IDS_liberal = indices[k,IDX_liberal]
        IDS_stringent = indices[k,IDX_stringent]
        
        o = test.iloc[IDS].posting_id.values
        o_stringent = test.iloc[IDS_stringent].posting_id.values
        o_liberal = test.iloc[IDS_liberal].posting_id.values
        
        preds.append(o)
        preds_liberal.append(o_liberal)
        preds_stringent.append(o_stringent)

test['preds_image'],test['preds_image_lib'],test['preds_image_stri'] = preds, preds_liberal, preds_stringent
del model, distances, indices, image_embeddings, embeds
_ = gc.collect()

print('Fast and clean as can be. Rapids is good!')

The key items of relevance above is the distance (Euclidean?) choice. Chris seems to have done some sort of analysis here before deciding on 6. We use 2 and 8 and this is an area which can be experimented with

We could also use cosine distance

Let us move onto TEXT now. We just put it in a function so it can be re-used if time permits for the actual comp. Just note that the test_rapids version is needed to vectorize...else we use the test dataframe for other operations.

In [None]:
def get_text_predictions(df_cu, df, max_features = 25_000):
    
    model = TfidfVectorizer(stop_words = 'english', binary = True, max_features = max_features)
    text_embeddings = model.fit_transform(df_cu.title).toarray()
    
    preds, preds_liberal, preds_stringent = [],[],[]
    CHUNK = 1024*4

    print('Finding similar titles...')
    CTS = len(df)//CHUNK
    if len(df)%CHUNK!=0: CTS += 1
        
    for j in range(CTS):
        a = j*CHUNK
        b = (j+1)*CHUNK
        b = min(b,len(df))
        print('chunk',a,'to',b)

        cts = cupy.matmul( text_embeddings, text_embeddings[a:b].T).T

        for k in range(b-a):
            IDX = cupy.where(cts[k,]>0.7)[0]
            IDX_liberal = cupy.where(cts[k,]>0.4)[0]
            IDX_stringent = cupy.where(cts[k,]>0.98)[0]
            
            o = df.iloc[cupy.asnumpy(IDX)].posting_id.values
            o_liberal = df.iloc[cupy.asnumpy(IDX)].posting_id.values
            o_stringent = df.iloc[cupy.asnumpy(IDX)].posting_id.values
            
            preds.append(o)
            preds_liberal.append(o_liberal)
            preds_stringent.append(o_stringent)
    
    del model,text_embeddings
    gc.collect()
    return preds,preds_liberal,preds_stringent

test['preds_text'],test['preds_textlib'],test['preds_textstri'] = get_text_predictions(test_rapids, test, max_features = 25_000)

In [None]:
##Phash it up
tmp = test.groupby('image_phash').posting_id.agg('unique').to_dict()
test['preds_phash'] = test.image_phash.map(tmp)
test.head(2)

In [None]:
##I am sure there is a better way than to repeat
def combine_for_sub(row):
    x = np.concatenate([row.preds_text,row.preds_image, row.preds_phash])
    return ' '.join(np.unique(x))

def combine_for_sub_stri(row):
    x_stringent = np.concatenate([row.preds_textstri,row.preds_image_stri, row.preds_phash])
    return ' '.join(np.unique(x_stringent))

def combine_for_sub_lib(row):
    x_liberal = np.concatenate([row.preds_textlib,row.preds_image_lib, row.preds_phash])
    return ' '.join(np.unique(x_liberal))

def combine_for_cv(row):
    x = np.concatenate([row.preds_text,row.preds_image, row.preds_phash])
    return np.unique(x)

def combine_for_cv_stringent(row):
    x_stringent = np.concatenate([row.preds_textstri,row.preds_image_stri, row.preds_phash])
    return np.unique(x_stringent)

def combine_for_cv_liberal(row):
    x_liberal = np.concatenate([row.preds_textlib,row.preds_image_lib, row.preds_phash])
    return np.unique(x_liberal)

In [None]:
tmp = test.groupby('label_group').posting_id.agg('unique').to_dict()
test['target'] = test.label_group.map(tmp)
test['oof']= test.apply(combine_for_cv,axis=1)
test['oof_liberal']= test.apply(combine_for_cv_liberal,axis=1)
test['oof_stringent']= test.apply(combine_for_cv_stringent,axis=1)

test['f1'] = test.apply(getMetric('oof'),axis=1)
print('CV Score =', test.f1.mean())
test['f1'] = test.apply(getMetric('oof_liberal'),axis=1)
print('CV Score liberal =', test.f1.mean())    
test['f1'] = test.apply(getMetric('oof_stringent'),axis=1)
print('CV Score stringent =', test.f1.mean())
    
test['matches'] = test.apply(combine_for_sub,axis=1)
test['matches_lib'] = test.apply(combine_for_sub_lib,axis=1)
test['matches_stri'] = test.apply(combine_for_sub_stri,axis=1)

The CV scores are: 
CV Score = 72.48

CV Score liberal = 69.93

CV Score stringent = 68.19

Now, let us apply getcombined() function for the stringent cases. We will save the output file and process it without GPU

In [None]:
test[['posting_id','matches', 'matches_lib', 'matches_stri']].to_csv('CV_difflevels.csv',index=False)

If we want, we can switch off GPU here and jump directly to this cell after executing first 10 cells of this kernel (upto where we start Rapids and GPU) 

In [None]:
out = pd.read_csv('../input/cvwithdifflevels/CV_difflevels (1).csv')
out.head(4)

In [None]:
out = out.sort_values('posting_id')
out = out.set_index('posting_id')
out = out.sort_index()

There were 2 bugs and I correct them now in the actual function. The first is that once we get the o/p for a particular product ID, we need to update all rows with that superset (not just the row in function). Once we do this, we can also leverage the fact that if 'combined' is found to be updated for a particular row, there is no need to process the function again. The last mistake (which is the silliest one was that I was using the original 'matches' column for the superset updates instead of using the 'matches_stringent'.

Lastly we will change the output columns to list instead of strings, so it will be a little easier to process for CV purposes. Let us do that first

In [None]:
out['matches_orig'] = list(out.matches.str.split(' '))
out['matches_stri_orig'] = list(out.matches_stri.str.split(' '))
out['matches_lib_orig'] = list(out.matches_lib.str.split(' '))
out.drop('matches_stri',axis='columns', inplace=True)
out.drop('matches_lib',axis='columns', inplace=True)
out.drop('matches',axis='columns', inplace=True)
    
out.head(1)

In [None]:
def getcombined(l, combined, mode):
    
    if len(l) < 3 and len(combined)==0 :
        return l
    elif len(combined) >= 50:
        return combined
    
    local_combined = set()
    for item in l:
        if mode == 'strict':
            matches = set(out.loc[item]['matches_stri_orig'])
        elif mode == 'liberal':
            matches = set(out.loc[item]['matches_lib_orig'])
        else:
            matches = set(out.loc[item]['matches_orig'])            
        local_combined.update(matches)
        
    remaining = (local_combined - set(l)) - combined
    combined.update(local_combined)

    if len(remaining) > 0:
        getcombined(remaining, combined, mode)
        
    return list(combined)

In [None]:
out['combined'] = None

def updatetransitive(x_index, x):
    if out.loc[x_index, 'combined'] != None:
        return
        
    combined = getcombined(x, set(), 'strict')

    for item in combined:
        out.loc[item, 'combined'] = combined

out.apply(lambda x: updatetransitive(x.name, x.matches_stri_orig), axis=1)

In [None]:
##clean up past merges if any
for item in ['combined', 'matches_orig', 'matches_lib_orig', 'matches_stri_orig','combined_x', 'combined_y', 'matches_orig_x', 'matches_orig_y', 'matches_lib_orig_x','matches_lib_orig_y','matches_stri_orig_x','matches_stri_orig_y']:
    if item in train.columns:
        train.drop(item,axis='columns', inplace=True)

In [None]:
train = pd.merge(train, out[['combined', 'matches_orig', 'matches_stri_orig', 'matches_lib_orig']],  how='left', left_on=['posting_id'], right_on = ['posting_id'])
train[train.columns[5:]].head(4)

Now let us see if it has improved our stringent distance check. If not, this experiment is as good as done and we should wind up

In [None]:
train['f1_orig'] = train.apply(getMetric('matches_orig'),axis=1)
print('CV Score Orig', train.f1_orig.mean())

train['f1_lib'] = train.apply(getMetric('matches_lib_orig'),axis=1)
print('CV Score liberal', train.f1_lib.mean())

train['f1_stri'] = train.apply(getMetric('matches_stri_orig'),axis=1)
print('CV Score stringent', train.f1_stri.mean())

train['f1_stri_PP'] = train.apply(getMetric('combined'),axis=1)
print('CV Score stringent - WITH TRANSITIVE PP', train.f1_stri_PP.mean())

Ok. I would have expected more improvement, but it looks like there are still many false +ves which contaminate the score when we do this transitive PP. Anyway there is an improvement in CV from 68.19 to 68.84. Let us now try an ensemble. Basically for those loners with no matches, we try to set them up with some dates from matches_orig or matches_liberal in that order.

If the results go higher than 72.48, then the whole solution makes good

Pandas just were not made to store lists. So rather than doing ugly hacks, we will just store the ensemble as a string and then convert them to list. This process takes 3-4 mins on a CPU and definitely needs optimization.

In [None]:
def ensemble(row):
    if len(row.combined) ==1:
        if len(row.matches_orig) > 1:
            train.loc[train.posting_id==row.posting_id, 'ensemble'] =  ' '.join(row.matches_orig)
        elif len(row.matches_lib_orig) > 1:
            train.loc[train.posting_id==row.posting_id, 'ensemble'] = ' '.join(row.matches_lib_orig)
        else:
            train.loc[train.posting_id==row.posting_id, 'ensemble'] = ' '.join(row.combined)
    else:
        train.loc[train.posting_id==row.posting_id, 'ensemble'] = ' '.join(row.combined)
    
train.apply(ensemble, axis=1)

In [None]:
train['ensemble'] = list(train.ensemble.str.split(' '))
train.head(1)[train.columns[6:]]

In [None]:
train['f1_stri_PP_ensemble'] = train.apply(getMetric('ensemble'),axis=1)
print('CV Score stringent - WITH TRANSITIVE PP', train.f1_stri_PP_ensemble.mean())

Nope. It does not go beyond 72.4 which was our original CV. A failed experiment then? Depends. We could see that the stringent score CV improves from 68.1 to 68.8 when we consider transitive relationships. The idea is sound but the implementation failed (reminds me of a dialog from the Ghost and the darkness movie from a young Michael Douglas). Anyway, the challenge is that if we reduce the stringency, then we get lots of false positives and the score actually drops due to the PP. Increasing stringency causes the CV to drop initially but even after PP, it does not recover back to its original level. 

For this PP to work, we need stringency but the moment we make cut-off stringent, the CV drops big time and the gains from PP are not enough to go past its original mark. Can we find a sweet spot where CV does not drop too much and the gains from PP carry it past the older score? Would having sharper differentiation between embeddings work? Theoritically it should but unfortunately I couldnt make it work in the limited time I had and it would be nice if someone could fix this. One option could be to go back to those records which have huge number of matches and then try increasing stringency level cut-offs onlhy for those records and then do the PP only for those. Another option is to do this PP only for those records where we are 100% sure of a match.

In any case, due to work constraints I would be taking a break of a couple of months from Kaggle - at least no new comps or kernels, though I hope I should be able to occassionally keep myself upto date on the discussions happening. All the best to all participants of Shoppe and I will be eagery awaiting the results day discussions.