It often helps to think from a contrarian point of view. This could open up unexpected avenues of pursuit. So in the current competition, while the focus is on **MATCHING** products, we can try taking a step back to see conditions when products can be **UN-MATCHED**. 

* The first criteria for ‘un-matching’ could be the **PRICE**. If the price is not within an acceptable range, we probably know that the products are different. We can’t have a Rp@6999 product compare with a Rp@999 product though the rest of the text title and the images may be quite similar. A quick check at the training data shows that the price info seems to have been scrubbed for most of the records barring very few.

* The second could be the **ORG NAME**. We just can’t have a Sony 32 inch TV HD color hi pixel whatsoever match with a Samsung 32 inch TV HD color hi pixel whatsoever. This could be a criteria for un-matching in current dataset. However a bulk of the items listed are small ones and don’t have a tangible org name associated. So success could be limited. Nonetheless this could work to some extent.

* We could have a whole range of ‘**CONTRASTS**’ lookup tables. We could have sex as one of the entry. An xxx perfume strong cologne fragrance for males could be different than an exactly similar (text, image) product but one designed for females. We could have color contrasts. White Apple iPhone 10 256 GB versus Black Apple iPhone 10 256 GB.  There could be several such interesting rows of contrasting criteria. A look-up table could hold these contrasting pairs and then we could ensure that they are not present individually in the two product titles being compared

* Then based on the product **CATEGORY** (yes we may have to do some topic labelling as train data does not have this) we could have some strong category specific differentiators. For e.g. for edibles, we could have ‘organic’ versus ’natural’ versus ‘non-organic’ or just the absence of the word ‘organic’. So 100% grape juice Tropicana sweet real grapes should be different from 100% organic grape juice Tropicana sweet real grapes. Last I checked the train data had about 250 organic products, so this could help to some extent. Based on the product category, other interesting contrasts can be identified and used to un-match products.

* We could also have **APHANUMERIC CODES**. For e.g. org1 LR220X-5A lighting lamp manita may not be equal to org1 LR220Y-5A lighting lamp manita. Or to cover a more simpler case anti-wrinkle anti-uv cream SPF-250 is different from anti-wrinkle anti-uv cream SPF-50.

* Lastly we have the **QUANTITY**. This could be a very important indicator for un-matching. For sure - “L’Oreal Paris Total Repair 5 Shampoo Hair Care - 450 ml (Perawatan Rambut Rusak dan Mudah Patah)” is different from “\L’Oreal Paris Total Repair 5 Shampoo Hair Care - 900 ml (Perawatan Rambut Rusak dan Mudah Patah)” even though the rest of the text and the entire image may nearly be the same. 

Let us see if we can leverage existing frameworks to quickly give out this kind of info

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.set_option('display.max_colwidth', None)

train = pd.read_csv('../input/shopee-product-matching/train.csv')
train['titleUcase'] = train['title'].str.upper()

Let us use Spacy. Not sure if this can be called SOTA

In [None]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

for item in train['title'].head(11):
    print('\n',item, '\n', [(x.text, x.label_) for x in nlp(item).ents if len(x.text)>0])

A bit disappointing, you would agree. Let us check if the indonesian version does any better. We will have to create a link first

In [None]:
!python -m spacy download xx

import xx_ent_wiki_sm
nlp = spacy.load('xx')

In [None]:
for item in train['title'].head(11):
    print('\n',item, '\n', [(x.text, x.label_) for x in nlp(item).ents if len(x.text)>0])

No big difference.

Let us try to create a quick custom code to retreieve quantity (weight, volume, nums) from the title. Note that some rows could have more than one quantity.

In [None]:
unit = ['GR', 'GM', 'KG', 'KILO', 'MG', 'LITRE', 'ML', 'PC', 'INCH', 'YARD', 'CM', 'MM', 'METRE', 'MICRO', 'GB', 'MB', 'TB', 'KB', 'THN']

We could still have 'G' or 'L' and much much more units. For a quick POC above should suffice

In [None]:
import re

def findunit(row):
    numunit=0
    for item in unit:
        quantity = re.findall('[0-9.-]+' + '\s*'+ item, row.titleUcase)

        if len(quantity) > 0:
            q1 = quantity[0].replace(item,"")
            ##if more than one same unit, consider just 1 more instance
            ##Also ensure no more than 4 entries against all units combined
            if ((len(quantity) > 1) and (numunit < 3)):
                q2 = quantity[1].replace(item,"")   
                train.loc[train.posting_id == row.posting_id,['quantity_'+str(numunit),'unit_'+str(numunit),'quantity_'+str(numunit+1),'unit_'+str(numunit+1)]]=q1,item,q2,item
                numunit+=2
            else:
                train.loc[train.posting_id == row.posting_id, ['quantity_' + str(numunit), 'unit_' + str(numunit)]] = q1, item   
                numunit+=1

        ##Return if 4 cols across all units is filled
        if numunit >= 4:
            return

##Not efficient but for 30K rows we should be ok. Takes about 2 mins
train.apply(findunit, axis = 1)

print('Number of records with units:',train['unit_0'].notna().sum())

A good 25% have units. This is sizeable

Let us aggregate these units into one column for a quick comparision later

In [None]:
cols=['quantity_0','unit_0','quantity_1','unit_1','quantity_2','unit_2','quantity_3','unit_3']
train['quantam'] = train[cols].fillna('').agg(' '.join, axis=1)

##7 spaces means there is no unit. Replace string for readability
train['quantam'] = train['quantam'].replace('       ' , 'NO UNIT' )

In [None]:
print('Sample rows without unit:\n************************')
for row in train.head(20).itertuples():
    if row.quantam == 'NO UNIT':
        print(row.title)

In [None]:
print('Sample rows with unit:\n*********************')
for row in train.head(10).itertuples():
    if row.quantam != 'NO UNIT':
        print(row.title, '\n',row.quantam, '\n')

Looks ok. Now let us write a simple logic to un-match products based on the units. Let us say we are given a list containing the posting_id's

In [None]:
##i/p = 1)base row 2)all units in that row 3)row to be matched against
##o/p = True if rows are matched else False
def compare(row1, row1units, row2):
    if str(row1.quantam) == str(row2.quantam):
        return True
    else:
        ##get all units in row2
        row2units=[]
        for i in range(4):
            if ~pd.isna(row2['unit_'+str(i)].values):
                row2units.append(str(row2['unit_'+str(i)].values))
                
        ##Proceed only if row1 & row2 dont have duplicate units
        if (len(set(row1units)) == len(row1units)) and (len(set(row2units)) == len(row2units)):
            ##No duplicate units. Iterate thru each unit of row 1
            ##if a match is found in row2, then quantities should be same
            for unit in row1units:
                for i in range(4):
                    if str(row2['unit_'+ str(i)].values) == unit:
                        ##units match. Get quantity from row1 (should have inputted a map)
                        for j in range(4):
                            if (str(row1['unit_'+ str(j)].values)) == unit:
                                row1quantity = row1['quantity_'+ str(j)].values[0]
                                row2quantity = row2['quantity_'+ str(i)].values[0]
                                ##add all cleanup and other processing here
                                row1quantity.replace(' ' , '')   
                                row2quantity.replace(' ' , '')
                                if (len(re.findall('[0-9]', row1quantity)) == 0) | (len(re.findall('[0-9]', row2quantity))==0):
                                    continue
                                ##if quantity contains '-' do a string comparision else numeric
                                if ('-' in row1quantity) or ('-' in row2quantity):
                                    if row2quantity != row1quantity:
                                        return False
                                elif float(row2quantity) != float(row1quantity):
                                    return False
            ##no match in any unit. We can't say that quantities differ
            return True
        else:
            ##Duplicate units. Maybe this is a range. For now return True
            return True
        

##i/p = list of posting_ids with first being the ref ID
##o/p = updated list after removal of IDs that done belong
def unmatch(ids):
    ##First ID is the reference row
    row1 = train.loc[train.posting_id == ids[0]]
    removeids = []
    if row1.quantam.values != 'NO UNIT':
        ##get all units in row1
        row1units=[]
        for i in range(4):
            if ~pd.isna(row1['unit_'+str(i)].values):
                row1units.append(str(row1['unit_'+str(i)].values))
        
        ##compare row1 with all the other rows
        for i in range(1,len(ids)):
            row2 = train.loc[train.posting_id == ids[i]]
            if not(compare(row1,row1units, row2)):
                removeids.append(i)
    return [v for i, v in enumerate(ids) if i not in removeids]

Time to check if what we wrote works. Let us take the output of Chris's starter kit where he uses Rapids to categorize labels using an unsupervised approach. It is a simple effective notebook and acheieves a good score of 70+. Chris has also introduced the Rapids FW here and it is blazingly fast. Thanks Chris/Nvidia for this nice FW. Do check it out in case you have not already done so.

In [None]:
out = pd.read_csv('../input/chris-rapids/submission_Chris.csv')
out.head(4)

The 'matches' look to be sorted. Let us merge the 2 cols. This way the 1st ID in 'matches' is the reference ID. We will use set() to remove the duplicate

In [None]:
out['ids'] = out.posting_id + ' ' + out.matches
out.head(4)

In [None]:
def removedup(inp):
    temp = list()
    op = [x for x in inp if not (x in temp or temp.append(x))]
    return op
    
out['postprocess'] = out.apply(lambda x: unmatch(removedup(x.ids.split(' '))), axis=1)

Unfortunately looks like set() does not maintain order. We want our first item to be the reference posting_id. We create a small function removedup() which does that - For each x, we add x to output if x not in the list 'temp'. Each time we also append 'x' to 'temp', so if a duplicate turns up it wont get added. We rely on append always returning None unconditionally. So it is an ugly hack but suffices for our quick purpose investigation here.

We feed this to our unmatch() function. It takes a couple of mins to run

Let us now check the new CV. We use below snippet from Chris's original code. The original CV score is 0.7248. The kernel is at: https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700

In [None]:
def getMetric(col):
    def f1score(row):
        n = len( np.intersect1d(row.target,row[col]) )
        return 2*n / (len(row.target)+len(row[col]))
    return f1score

tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
train['matches'] = out['matches']
train['newmatches'] = out['postprocess']
train['f1'] = train.apply(getMetric('newmatches'),axis=1)

print('New CV Score =', train.f1.mean() )

The new score is 0.7255... a modest improvement. Should we be disappointed? I guess not. For one we may see more success in the test data, secondly this feature could be benificial in other ways - e.g. if we are short on processor time, this can be used for quick checks to derive smaller clusters. More importantly when this AI is applied to the retail biz domain, there is simply no way that one cannot NOT use this imp feature. As discussed this is the **second most important signal in differentiating products (the first would be the price range)**. 

For curiosity, let us check out the differences

In [None]:
train['differs'] = train.apply(lambda x: list(set(x.matches.split(' ')).difference(x['newmatches'])), axis=1)

In [None]:
differences = []
train.apply(lambda x: [differences.append(x.posting_id) if len(x.differs)>0 else None], axis=1)
len(differences)

Hmm. That is a fair bit of difference. 1000 rows with potentially more than 1 diff in each row. This is a sizeable chunk. Let us investigate. We compare all rows that are returned as a un-matched by our code above. Let us print each pair of titles side by side along with the labels. Ideally these should have different labelgroups. This should increases our score. If they have the same score our CV decreases and we should investigate why there un-matched.

In [None]:
cols_needed = ['quantam', 'matches', 'newmatches', 'differs']
print('First row: Base row Title and label group which has a wrong prediction against it')
print('Second row: This is mis-classified (mis-clustered?) row paired to base row above')
print('Label groups should be different for above 2 rows. If they are same, we investigate')
print('***********************************************************************************\n')

yay = 0   ##num instances where our code is correct. 
gosh = 0  ##num instances where our code is incorrect. Pl dont use such names in prod systems :)

for item in differences[:9]:
    diff_ids = train[(train.posting_id==item)]['differs'].to_numpy(copy=True)
    for ids in diff_ids[0]:
        row1 = train.loc[train.posting_id==item]
        row2 = train.loc[train.posting_id==ids]
        if row1.label_group.values[0] != row2.label_group.values[0]:
            yay+=1
        else:
            gosh+=1
        print(row1[['title', 'label_group']].values, '\n', row2[['title', 'label_group']].values, '\n')
        
print('yay=',yay,'gosh=',gosh)

Interesting. For the first 10 mismatches (out of 1008) We have 22 yays and 4 gosh's. Let us analyze the 4 gosh'es to see what is happeninig.  
In case of label group: 2660605217, the text scraped by shopppe seems to have picked up some text from the next item - so it is a noise. Then we have 3 cases where the quantities are clearly different though the rest of the text is same. No way this can be same product. More noise. 

So a 100% success of our un-match logic as far as these 10 records are concerned. If so our CV should have scored much much better. If we go all the way and do it for entire train data we get **1684 yays and 516 gosh's**. So there is a fair bit of correct post-processing, so need to investigate why CV is not going up further. Maybe this weekend I will do that. Also our code is pretty basic and we havent taken care of any edge conditions or looked at improving speed/elegance etc. Defintiely, many things can be improved.

This idea (& a few others - in particular generating more training records from this limited data - esp for products with low samples) has been on the back of my mind for a while now but work constraints didnt allow me to put pen to paper (or pressure to the keyboard keys until now). I could see some discussions related to this as well. I believe it was Chris who referred to the importance of numerical values in one of the discussions couple of weeks back followed by another recent discussion by Sarthak. In fact un-matching is an entire post-processing piece by itself and covers lot more than prices or numericals (see my discussion link on this). The results seem promising with slight improvement in CV and ideally the test scores too should improve with PP. I dont have time to submit or participate in this comp but if anyone gets a better test score, please do share the results and your observations

Also this PP exposes the noisiness of the train data. There are pairs that should have not been matched under label group but have been matched. I hope the test data is much cleaner. 

How else can un-matching be leveraged? Well this is a time-constrained comp so if we are comparing a row against every other row in the test DB, it could consume vast amounts of time. The ‘un-matching’ criteria could be used to generate clusters beyond which matching may not be required and this should reduce time taken which be used for other activities. 

All said and done, even if it does not matter to the comp scores, this seems to me to be the right way to approach the business problem. This is one of those domains where the model needs to be confident on its predictions. If it is not very confident, it is ok not to predict (and leave it to be manually hand-labelled as similar or dissimilar). Special efforts need to be taken to eliminate the false positives and hence the un-matching logic could hold the key to actual applications of this tech. In fact it is interesting that we get a few rows in train data which are pretty obviously mis-grouped. The usage of un-matching could have prevented that easily. 

Shoppe matching is an insanely interesting competition.  If there is one competition that has it all - image, text, OCR, mutiple approaches to tackle - it would be this. I am sure we will have an exciting finish. This is also a very relevant industry problem and I would like to thank the organisers and Kaggle staff for hosting this competition. The kernels by ragnar and chris and the discussion topics by rohitsingh & tensorgirl & knownothing and several others have been very useful and educative for me.