&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <img src='https://github.com/Daniboy370/Uploads/blob/master/meme_0.png?raw=true' width=400 />

In [None]:
import os
import math
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.style as style
import matplotlib.pyplot as plt

PATH_root = '/kaggle/input/' # --> '/world-data-by-country-2020'
# os.chdir(PATH_root) # os.listdir()

df = pd.read_csv( PATH_root+'/competitions-shakeup/df_Elo.csv')

### $\text{Objective}$
This kernel delivers a simple and intuitive analysis of participants re-ranking in the ***Quora Insincere Questions Classification*** [[**link**](https://www.kaggle.com/c/quora-insincere-questions-classification)] competition. The kernel attempts to better explain the underlying trends of the **shaking** phenomenon we all are subjected to during any DS competition.

#### $\text{Shake-what ?!}$


**Shake** ( $\Delta LB$ ) occurs when the competition is shifting between two different datasetes :

$$
\text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB_{public} \ \Rightarrow \ LB_{private}
$$

The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective **re-ranking** of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

$$
\Delta LB := -( LB_{(private)} - LB_{(public)} ) = \begin{cases}
\Delta LB < 0 \quad : \quad \text{Shake-down} \\
\Delta LB = 0 \quad : \quad \text{No shake} \\
\Delta LB > 0 \quad : \quad \text{Shake-up}
\end{cases}
$$

### $\text{ Shake-up vs. Shake-down }$

Let us first examine the ratio between three possible categories :

In [None]:
df_dn = df[df['Shake'] <0]['Shake']
df_up = df[df['Shake']>=0]['Shake']
df_stats = df['Shake'].describe()
df_len, df_mean, df_median, df_std, df_min, df_max = df_stats[0], df_stats[1], df_stats[5], df_stats[2], df_stats[3], df_stats[-1]
labels = ['Shake-up', 'No-Shake', 'Shake-down']

# --------- Mean and median extraction -------- #
df_up_g = df[df['Shake']>0]['Shake']
df_eq = df[df['Shake']==0]['Shake']
df_up_mean, df_up_median = np.mean( df_up ), np.median( df_up )
df_dn_mean, df_dn_median = np.mean( df_dn ), np.median( df_dn )

df_dn_len, df_eq_len, df_up_g_len = len(df_dn), len(df_eq), len(df_up_g)
sizes = pd.Series([df_up_g_len, df_eq_len, df_dn_len])*100/df_len

explode = (0.02, 0.2, 0.02)
colors = ['#66b3ff','#99ff99', '#ff9999']

fig, ax = plt.subplots( figsize=(8, 8) )
ax.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.2f%%', shadow=True, startangle=-110, 
        labeldistance=1.1, pctdistance=0.6, radius=1, textprops={'fontsize': 20})

# Equal aspect ratio ensures that pie is drawn as a circle
ax.axis('equal')
plt.tight_layout()
plt.show()

In this specific competition, the majority of the shakings turned to be positive. A very small portion however, managed to keep their ranking untouched.

In [None]:
print('Total participants : ', int(df_len))
print('Range : [', int(df_min),',', int(df_max),']')
print(f'Mean : {df_mean:.2f}')
print('Median : ', int(df_median))
print(f'STD : {df_std:.2f}')

Despite the **HUGE** range that composes the **re-rankings**, both mean and median seem to be rather close to the "0" center.

### $\Delta LB \, \text{ Prevalence}$


Starting with a basic histogram over the distribution range :

In [None]:
rug_kws={"color": "g"}
kde_kws={"color": "k", "lw": 3, "label": "KDE"}
hist_kws={"linewidth": 4, "alpha": 0.75, "color": "g"}
sns.set(style="white", palette="muted", color_codes=True)

bins_num, df_len = 50, len(df)
plt.figure(figsize=(15, 6))

ax = plt.gca()
ax.set_facecolor('#fffccc')
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
sns.distplot(df['Shake'], color="blue", kde=False, hist=True, bins=bins_num, label='Shake', kde_kws=kde_kws, rug_kws=rug_kws, hist_kws=hist_kws)

plt.title('Shake Histogram ( bin-width = '+str(int(df_len/bins_num))+' places )', fontsize = 20) 
plt.grid(linestyle='-', linewidth=1.25)
plt.xlabel('$\Delta LB$', fontsize = 20)
plt.ylabel('Frequency', fontsize = 20)
plt.xlim([-500, 500])
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.axvline(x=df_mean, label='Mean', c='k', linestyle='dashed', linewidth=3.5)
plt.axvline(x=df_median, label='Median', c='k', linestyle='-', linewidth=2.5)
plt.legend(fontsize = 18)
plt.show()

At a glance, it is already clear that the majority of the **shakes** occurs in the range of : $ \, | \Delta LB | \leq 50 \, $ .

#### $\text{Probability Density Function ( in percentiles )}$

By normalizing the above histogram we get its probability density function (**PDF**), which presents the relative likelihood of a random variable :

In [None]:
rug_kws={"color": "g"}
kde_kws={"color": "k", 'linestyle':'--', "lw": 4, "label": "KDE"}
hist_kws={"linewidth": 3.5, "alpha": 0.75, "color": "g"}
sns.set(style="white", palette="muted", color_codes=True)

bins_num = 100
plt.figure(figsize=(15, 7))

ax = plt.gca()
ax.set_facecolor('#fffccc')
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
sns.distplot(df['Shake'], color="blue", kde=True, hist=True, bins=bins_num, label='Shake', kde_kws=kde_kws, rug_kws=rug_kws, hist_kws=hist_kws)


plt.title('Shake PDF ( bin-ratio = '+str( np.round( 100/bins_num, 3 ))+' [%] )', fontsize = 20) 
plt.grid(linestyle='-', linewidth=1.0)
plt.xlabel('$\Delta LB$', fontsize = 20)
plt.ylabel('Density (KDE)', fontsize = 20)
plt.xlim([-1000, 1100])
# plt.xlim([df_min*0.6, df_max*0.7])
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.legend(fontsize = 18)
plt.show()

$$
\text{let :} \quad F(x) = \sum_{x_j \leq x} f(x_j) \quad \text{where} \quad f(x_j)  \equiv P(X=x_j)  \\
\text{s.t. :} \quad 
\Pr \, (a \le X \le b ) = F(b) - F(a) = \int_a^b f_X(x) \, dx  
$$

Using this definition let's examine the likelihood of certain arbitrary ranges :

In [None]:
df_hist = np.histogram( df['Shake'], bins=1000 )

Range = [[-29, 29], [-143, 143], [-369, 369]]

for i,t in enumerate( Range ):
    x_min, x_max = Range[i][0], Range[i][1]
    a = df_hist[1] > x_min
    b = df_hist[1] < x_max
    prob_tot = df_hist[1][ a & b ]
    prob_min, prob_max = np.min(prob_tot), np.max(prob_tot)
    prob_min_s, prob_max_s = str(int(prob_min)), str(int(prob_max))
    prob_CDF = str( np.round( 100*( df_hist[0][ (a & b)[:-1]] / df_len ).sum() , 3) )
    print('P('+ prob_min_s+ ' <= X <= '+ prob_max_s+ ') = '+ prob_CDF+ ' [%]')

* $\, \sim 25 [\%] \,$ of the population $\ \Rightarrow \ \pm 24 \ [\, \Delta LB \, ]$
* $\, \sim 50 [\%] \,$ of the population $\ \Rightarrow \ \pm 142 \ [\, \Delta LB \, ]$
* $\, \sim 75 [\%] \,$ of the population $\ \Rightarrow \ \pm 368 \ [\, \Delta LB \, ]$

Generally speaking, these big numbers are not representing a regular competition. In my **dataset** : [**Competitions Shake-up**](https://www.kaggle.com/daniboy370/competitions-shakeup), most of the competitions exhibited lower numbers by far.

Let us compare between both trends :

In [None]:
plt.figure(figsize=(15, 7))
ax = plt.gca()
ax.set_facecolor('#fffef0')
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
hist_b={"linewidth": 2, "alpha": 0.75, "color": "b"}
hist_r={"linewidth": 2, "alpha": 0.75, "color": "r"}

# ----------- Mean and median lines ----------- #
plt.axvline(x=df_up_median, label='SU-Median', c='b', linestyle='dashed', linewidth=3.5, alpha=0.8)
plt.axvline(x=df_dn_median, label='SD-Median', c='r', linestyle='dashed', linewidth=3.5, alpha=0.8)
plt.axvline(x=df_up_mean, label='SU-Mean', c='b', linestyle='-', linewidth=3, alpha=0.6)
plt.axvline(x=df_dn_mean, label='SD-Mean', c='r', linestyle='-', linewidth=3, alpha=0.6)
plt.legend(fontsize = 18)
sns.distplot(df_up, color="blue", kde=True, bins=90, label='Shake-up', kde_kws=dict(linewidth=0), hist_kws=hist_b)
sns.distplot(df_dn, color="red", kde=True, bins=150, label='Shake-down', kde_kws=dict(linewidth=0), hist_kws=hist_r)

# --------------------------------------------- #
plt.title("Shake Histogram", fontsize = 25)
plt.ylabel('Frequency', fontsize = 20)  #  Density (KDE)
plt.xlabel('$\Delta LB$', fontsize = 20)
plt.grid(linestyle='-', linewidth=1.0)
plt.xlabel('$\Delta LB$', fontsize = 20)
plt.ylabel('Density (KDE)', fontsize = 20)
plt.xlim([-600, 600])
# plt.xlim([df_min*0.3, df_max*0.35])
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.show()

In [None]:
print( '# -------- Shake-up ------- #')
print( 'Median =  %d     Mean =  %.2f' %(int(df_up_median), df_up_mean))
print( '\n# -------- Shake-down ----- #')
print( 'Median = %d     Mean = %.2f' %(int(df_dn_median), df_dn_mean))

After having seen the $\, \Delta LB \,$ frequency, we shall examine the correlation between **Pubilc** rank to **Private** rank. 

### $\text{Overfitting vs. Underfitting}$


Let us scatter the samples on a performance plane where $\, \Delta LB_i \,$ is denoted by the vertical distance of point $\, x_i \,$ from the diagonal. The dashed line [ **--** ] delineates the medal zone ( $LB \, \leq \text{ Top-}10 \%$ ) . 

In [None]:
plt.figure(figsize=(16, 8))
ax = plt.gca()
ax.set_facecolor('#fffef0')

Df_up = df[df['Shake']>=0]
Df_dn = df[df['Shake']<0]

plt.scatter( Df_up['Rank_public'], Df_up['Rank_private'], label='Shake-up', s=65, c='b', alpha=1, linewidths=1, edgecolors='w')
plt.scatter( Df_dn['Rank_public'], Df_dn['Rank_private'], label='Shake-down', s=65, c='r', alpha=0.8, linewidths=1, edgecolors='w')
plt.legend(fontsize = 22, loc='lower right')

h_medal = np.round( 0.1*df_len )
plt.hlines(y=h_medal, xmin=0, xmax=0.8*df_len, color='k', linestyle='dashed', linewidth=2.5, alpha=0.6)
plt.axvline(x=h_medal, label='Medals', c='k', alpha=0.5, linestyle='dashed', linewidth=2.5)

# --------------------------------------------- #
plt.title('Private rank vs. Public rank', fontsize = 25)
plt.ylabel('Private LB', fontsize = 20)
plt.xlabel('Public LB', fontsize = 20)
plt.grid(linestyle='-', linewidth=1.0)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

Before discussion, consider another view that demonstrates the **model fitting** concept :


In [None]:
plt.figure(figsize=(16, 8))
ax = plt.gca()
ax.set_facecolor('#fffef0')

Df_up = df[df['Shake']>=0]
Df_dn = df[df['Shake']<0]

plt.scatter( Df_up['Rank_public'], Df_up['Shake'], label='Shake-up', s=65, c='b', alpha=1, linewidths=1, edgecolors='w')
plt.scatter( Df_dn['Rank_public'], Df_dn['Shake'], label='Shake-down', s=65, c='r', alpha=0.8, linewidths=1, edgecolors='w')
plt.legend(fontsize = 18, loc='lower right')

h_medal = np.round( 0.1*df_len )
plt.axvline(x=h_medal, label='Medals', c='k', alpha=0.5, linestyle='dashed', linewidth=2.5)

# --------------------------------------------- #
plt.title('$\Delta LB$ vs. Public rank', fontsize = 25)
plt.xlabel('Public LB', fontsize = 20)
plt.ylabel('$\Delta LB$', fontsize = 20)
plt.grid(linestyle='-', linewidth=1.0)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.ylim([np.min(Df_dn['Shake'])*0.9, np.max(Df_up['Shake'])*1.25])
plt.show()

#### $\text{Discussion}$

These scatter plots shouts at me one thing - **Never take the public leaderboard too seriously !**

The red points **density** in the medal zone imply on the earthquake that is about to come. So many high-ranked contestants will be subjeceted to a dramatic **shake-down** ( $ \, > 1000 \ [ \Delta LB ]$ ) once shifting to the private dataset. 

And like always, who is the usual suspect ??? None other than the **model OVERFITTING** to the **public datatest**. 

### $\text{Analysis of medal zone}$

The medal zone as defined by the Kaggle Progression System [[**link**](https://www.kaggle.com/progression)] :

<img src='https://github.com/Daniboy370/Uploads/blob/master/Kaggle-medals.png?raw=true' width=800 />

The question is : ***What is the re-ranking of Public LB medalists after shifting to the private phase ?***

In [None]:
h_bronze = h_medal
h_silver = np.round( 0.05*df_len )
h_gold   = np.round( 10 + 0.002*df_len )

# shortcut for np.array casting
def logical_list( l1, l2 ):
    L1, L2 = np.array( l1 ), np.array( l2 )
    return pd.Series( np.logical_and( L1, L2) )
     
df_bronze = df[ logical_list( (df['Rank_public'] > h_silver), (df['Rank_public'] <= h_bronze) ) ]
df_silver = df[ logical_list( (df['Rank_public'] > h_gold), (df['Rank_public'] <= h_silver) ) ]
df_gold   = df[ logical_list( (df['Rank_public'] >= 0), (df['Rank_public'] <= h_gold) ) ]

# ----------------- Visualization ----------------- #
plt.figure(figsize=(15, 7))
ax = plt.gca()
ax.set_facecolor('#f2f9fa')

plt.scatter( df_bronze['Rank_public'], df_bronze['Rank_private'], label='Bronze Medal', s=90, c='brown', alpha=1, linewidths=0.8, edgecolors='k')
plt.scatter( df_silver['Rank_public'], df_silver['Rank_private'], label='Silver Medal', s=90, c='silver', alpha=1, linewidths=1, edgecolors='k')
plt.scatter( df_gold['Rank_public'],   df_gold['Rank_private'],   label='Gold Medal', s=90, c='gold', alpha=1, linewidths=1, edgecolors='k')

# --------------------------------------------- #
plt.legend(fontsize = 20, loc='upper right')
plt.hlines(y=h_medal, xmin=0, xmax=np.max(df_bronze['Rank_public']), color='k', linestyle='dashed', linewidth=2.5, alpha=0.7)
plt.title('Re-ranking of Public LB medalists ', fontsize = 23)
plt.xlabel('Public LB', fontsize = 21)
plt.ylabel('Private LB', fontsize = 21)
plt.grid(linestyle='-', linewidth=1.0)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
# plt.ylim([-50, 600])
# plt.ylim([np.min(Df_dn['Shake'])*0.7, np.max(Df_up['Shake'])*1.25])
plt.show()

Almost half the Public LB medalists experienced an enormous **shaking** that washed them away from the medal zone ( the cost of **overfitting** ). The other half was lucky ( and smart ) enough to guarantee their medal awards.

And my last question for today : ***What WAS the finalists' ranking PRIOR to the shifting ?***

In [None]:
h_bronze = h_medal
h_silver = np.round( 0.05*df_len )
h_gold   = np.round( 10 + 0.002*df_len )

# shortcut for np.array casting
def logical_list( l1, l2 ):
    L1, L2 = np.array( l1 ), np.array( l2 )
    return pd.Series( np.logical_and( L1, L2) )
     
df_bronze = df[ logical_list( (df['Rank_private'] > h_silver), (df['Rank_private'] <= h_bronze) ) ]
df_silver = df[ logical_list( (df['Rank_private'] > h_gold), (df['Rank_private'] <= h_silver) ) ]
df_gold   = df[ logical_list( (df['Rank_private'] >= 0), (df['Rank_private'] <= h_gold) ) ]

# ----------------- Visualization ----------------- #
plt.figure(figsize=(15, 7))
ax = plt.gca()
ax.set_facecolor('#f2f9fa')

plt.scatter( df_bronze['Rank_public'], df_bronze['Rank_private'], label='Bronze Medal', s=90, c='brown', alpha=1, linewidths=0.8, edgecolors='k')
plt.scatter( df_silver['Rank_public'], df_silver['Rank_private'], label='Silver Medal', s=90, c='silver', alpha=1, linewidths=1, edgecolors='k')
plt.scatter( df_gold['Rank_public'],   df_gold['Rank_private'],   label='Gold Medal', s=90, c='gold', alpha=1, linewidths=1, edgecolors='k')

# --------------------------------------------- #
plt.legend(fontsize = 20, loc='lower right')
plt.axvline(x=h_medal, label='Medals', c='k', alpha=0.7, linestyle='dashed', linewidth=2.5)
plt.title('Re-ranking of Private LB medalists', fontsize = 23)
plt.xlabel('Public LB', fontsize = 21)
plt.ylabel('Private LB', fontsize = 21)
plt.grid(linestyle='-', linewidth=1.0)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
# plt.ylim([np.min(Df_dn['Shake'])*0.7, np.max(Df_up['Shake'])*1.25])
plt.show()

Amazingly, third of the Private LB medalists were ranked away from the medal zone during the **public** phase. However, they were strong enough to believe in their model, and in the moment of truth they caught the Tsunami successfully.

Furthermore, both scatter plots show an interesting trend which in some engineering fields is known as convergence-divergence :

* The **higher** the PUBLIC rank $\rightarrow$ the **higher** the PRIVATE rank. 

* The **lower** the PUBLIC rank $\rightarrow$ the **lower** the PRIVATE rank.

#### $\text{Summary}$

Shake-ups and shake-downs occur quite often, however they vary between competitions due to many reasons (organizers, data, contestants, models etc.). Usually, surprises occur often to less than  10 [%] of the population ! It is true to Kaggle competitions as much as it is true to life. My advice : more competitions, more mistakes, more questions, more work. And medals will follow !

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <img src='https://github.com/Daniboy370/Uploads/blob/master/meme_1.jpg?raw=true' width=500 />

$$
\text{Thank you !}
$$