This notebook is created as an attempt to explain mismatch between local CV, Public LB and (likely) Private LB. Any advices, suggestions are welcome.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
X_test = pd.read_csv('../input/test.csv')
X_test = X_test.drop(["ID"], axis=1)

In [None]:
print ('Number of duplicates among test set: {}, {}'
       .format(X_test.duplicated().sum(), X_test.duplicated(keep=False).sum()))

In [None]:
print ('Percentage of duplicates among test set:')
print ('{:.2%}'.format(X_test.duplicated().sum()/len(X_test)))
print ('{:.2%}'.format(X_test.duplicated(keep=False).sum()/len(X_test)))

It is clear that there are duplicates among test set. Number of duplicates (with and without **keep=False**) indicates that each duplicated row has not only one copy, but maybe two, three or even more.

Let's consider some distribution with the same fraction of duplicates and two models which approximate this distribution pretty well.

In [None]:
valid = [i for i in range (100,136)] + [136, 136, 137, 137, 137]
model_1 = [i for i in range (101,137)] + [136.7,136.7,137.5,137.5,137.5]
model_2 = [i+0.5 for i in range (100,136)] + [138.5, 138.5, 139, 139, 139]

In [None]:
print ('Percentage of duplicates among valid set:')
print ('{:.2%}'.format((3/len(valid))))
print ('{:.2%}'.format((5/len(valid))))

In [None]:
plt.plot(valid, 'b.')
plt.plot(model_1, 'r.')
plt.plot(model_2, 'g.')
plt.ylim(ymin=95, ymax=145)
plt.show()

It is obvious that both models approximate our "truth" distribution well. However, **model_1** is likely to have larger error considering non-duplicated values, whereas **model_2** will have larger errors considering duplicated one's. 

Let's find out how our models perform on **whole valid set**.

In [None]:
from sklearn.metrics import r2_score

In [None]:
print ('Calculating r2 score for both models:')
print ('Model_1: {:.2%}'.format(r2_score(valid, model_1)))
print ('Model_2: {:.2%}'.format(r2_score(valid, model_2)))

Both models show high and almost the same accuracy for the whole set. Let's move further and look at how models will perform if we slice our data according to contest rules.

First of all, let's consider two extreme cases: when all duplicated values will be either within Public LB subset or Private LB subset. This is unlikely to occur, but anyway. Here LB refers to **Public LB score**, PB refers to **Private LB score**. 

I used only eight values to simulate the same split we have in this contest. (i.e. 8/41 ~ 19.5%)

In [None]:
valid_LB1 = valid[0:8]
model_1_LB1 = model_1[0:8]
model_2_LB1 = model_2[0:8]

valid_PB1 = valid[8:]
model_1_PB1 = model_1[8:]
model_2_PB1 = model_2[8:]

In [None]:
print ('Public LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_LB1, model_1_LB1), r2_score(valid_LB1, model_2_LB1)))
print ('Private LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_PB1, model_1_PB1), r2_score(valid_PB1, model_2_PB1)))

Here we get first interesting result. If there will be no duplicated values in Public LB, there is heavy mismatch between private and public LB. Would you choose **model_1** based on PublicLB? I bet you won't.

In [None]:
valid_LB2 = valid[-8:]
model_1_LB2 = model_1[-8:]
model_2_LB2 = model_2[-8:]

valid_PB2 = valid[:-8]
model_1_PB2 = model_1[:-8]
model_2_PB2 = model_2[:-8]

In [None]:
print ('Public LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_LB2, model_1_LB2), r2_score(valid_LB2, model_2_LB2)))
print ('Private LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_PB2, model_1_PB2), r2_score(valid_PB2, model_2_PB2)))

Another interesting result. It is better not to estimate our set using **model_2** at all! It performs worse than simple average value. However, once Private LB is revealed we might be very sad for not choosing it.

These cases are unlikely to happen, since it's hard "randomly" to split testing set in these presented ways. Now let's look at more realistic splits of our data.

In [None]:
valid_LB3 = valid[:5] + valid[-3:]
model_1_LB3 = model_1[:5] + model_1[-3:]
model_2_LB3 = model_2[:5] + model_2[-3:]

valid_PB3 = valid[5:] + valid[:-3]
model_1_PB3 = model_1[5:] + model_1[:-3]
model_2_PB3 = model_2[5:] + model_2[:-3]

In [None]:
print ('Public LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_LB3, model_1_LB3), r2_score(valid_LB3, model_2_LB3)))
print ('Private LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_PB3, model_1_PB3), r2_score(valid_PB3, model_2_PB3)))

In this case three duplicated values are in Public LB subset. They all have the same ground truth (i.e. all three row are the same). Other five rows are from non-duplicated part. And again mismatch happens to us. If we rely on Public score, we will choose the worse model.

Now let's consider another split.

In [None]:
valid_LB4 = valid[:5] + valid[-5:-2]
model_1_LB4 = model_1[:5] + model_1[-5:-2]
model_2_LB4 = model_2[:5] + model_2[-5:-2]

valid_PB4 = valid[5:-5] + valid[-2:]
model_1_PB4 = model_1[5:-5] + model_1[-2:]
model_2_PB4 = model_2[5:-5] + model_2[-2:]

In [None]:
print ('Public LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_LB4, model_1_LB4), r2_score(valid_LB4, model_2_LB4)))
print ('Private LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_PB4, model_1_PB4), r2_score(valid_PB4, model_2_PB4)))

Again three duplicates are in Public LB subset. This time 2 duplicates have one truth behind them (i.e. they are simular) and third value has it's pairs only in Private LB subset. We again might see a mismatch between Public LB and Private LB scores.

In [None]:
valid_LB5 = valid[:5] + valid[-5:-4] + valid[-3:-1]
model_1_LB5 = model_1[:5] + model_1[-5:-4] + model_1[-3:-1]
model_2_LB5 = model_2[:5] + model_2[-5:-4] + model_2[-3:-1]

valid_PB5 = valid[5:-5] + valid[-4:-3] + valid[-1:]
model_1_PB5 = model_1[5:-5] + model_1[-4:-3] + model_1[-1:]
model_2_PB5 = model_2[5:-5] + model_2[-4:-3] + model_2[-1:]

In [None]:
print ('Public LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_LB5, model_1_LB5), r2_score(valid_LB5, model_2_LB5)))
print ('Private LB scores. Model_1: {:.2%}, Model_2: {:.2%}'.format(r2_score(valid_PB5, model_1_PB5), r2_score(valid_PB5, model_2_PB5)))

Finally, we managed to consider last example, when duplicates are distributed among subsets like simple stratification. (e.i. all duplicated pairs presented both in Public and Private subsets). And here again some mismatch occurs.

Conclusion
----------
In this notebook I tried to explain why I believe that this contest is more like a lottery, since we don't know actual distribution of duplicated pairs (or some other observations with relatively high errors) between Public and Private LB subsets of testing data. That's why some random model might be a winning one.

I know that ideas presented here are quite naive. It is unlikely to have constant errors for all non-duplicated values. Or to have strictly higher of lower constant errors for each duplicated pair of values. In the real dataset relations between values will be much more complicated. However, I believe that this example is enough to explain the missmatch many participants have between their Local CV and Public LB as well as possible shuffle in the Private LB.

P.S. Stay tuned. Probably (not sure yet) I will present ideas why you shouldn't rely much on your Local CV either.