### A simple utility notebook allowing us to compare different submissions (on various models) on test dataset

For instance, here we'll compare two BERT model submissions:
- <code>bert_47_submission.csv</code> Submissions based on fine-tuned pretrained base BERT model
- <code>bert_ling_submission.csv</code> Submissions based on fine-tuned pretrained multilingual BERT model (104 languages)

In [1]:
import pandas as pd

In [2]:
test = pd.read_csv('data/test.csv')

In [3]:
sumbission_bert = pd.read_csv('submissions/bert_47_submission.csv')
sumbission_bert.is_duplicate.value_counts()

0    212685
1       564
Name: is_duplicate, dtype: int64

In [4]:
bert_dup = sumbission_bert[sumbission_bert.is_duplicate == 1]

In [5]:
sumbission_langbert = pd.read_csv('submissions/bert_ling_submission.csv')
sumbission_langbert.is_duplicate.value_counts()

0    212606
1       643
Name: is_duplicate, dtype: int64

In [6]:
bert_langdup = sumbission_langbert[sumbission_langbert.is_duplicate == 1]

Create overall set of pairs of names from both model submissions for comparaison

In [7]:
pair_ids = list(set(bert_dup.pair_id) | set(bert_langdup.pair_id)) 
len(pair_ids)

847

In [8]:
test_dup = test.iloc[pair_ids].sort_values('pair_id', ascending=True)

Make a double outer join on pair_id key with test dataset, and two submissions

In [9]:
df = pd.merge(pd.merge(test, bert_dup, how='outer', on='pair_id'), bert_langdup, how='outer', on='pair_id', suffixes=('_bert', '_lang'))
test_diff = df[(df.is_duplicate_bert == 1) | (df.is_duplicate_lang == 1)].fillna(0)

Select and display only those rows where two models differ (in order to make sense where models make mistakes, and to maintain adjustments, e.g for data preprocessing)

In [10]:
with pd.option_context('display.max_rows', 500):
    display(test_diff[test_diff.is_duplicate_bert != test_diff.is_duplicate_lang])

Unnamed: 0,pair_id,name_1,name_2,is_duplicate_bert,is_duplicate_lang
19,20,OMV Aktiengesellschaft (WBAG:OMV),SOPREMA CASTELLBISBAL,1.0,0.0
285,286,Ineos Styrolution India Ltd.,Styrolution America Llc,0.0,1.0
981,982,Sanyo Energy(Beijing) Co. Ltd.,Sanyo E & E S.A. De C.V.,1.0,0.0
1123,1124,Trelleborg Engineered Products,Pt Trelleborg Indonesia,1.0,0.0
1717,1718,Goodyear Canada Inc.,Goodyear Tire & Rubber Co.,0.0,1.0
1910,1911,Flexport Inc.,Alaska Flexo Private Ltd.,0.0,1.0
1981,1982,K Flex Usa,Flexseals Inc.,0.0,1.0
2117,2118,Sojitz (Shanghai) Co. L,Sojitz (Shanghai) Co.,0.0,1.0
2859,2860,Avik Polychem,Alliance Polychem,0.0,1.0
3586,3587,"COLAS CZ, a.s.",COLAS POLSKA,1.0,0.0
