# Generate a Shake-up Scatter Plot for each Kaggle Competition

This notebook generates a plot of public LB rank vs private LB rank for all teams in a competition, for each Kaggle competition.

Each competition's plot is saved as an image file, and all are zipped into one output for easy download (see output files section).

Highly scattered plots are harder to compress and result in larger image files - so this is another way to rank competitions by shake-up (number of teams is mixed in to this metric too - more teams = more data points, so larger competitions at the top).

**New**: you can try different image codecs. By default (lossless) PNG is used but JPEG would work and create a different ranking.

A preview of the top 40 largest plots are shown below.


## Contents

 * [Generate Plots](#Generate-Plots)
 * [Correlations to Competition Features](#Correlations-to-Competition-Features)
 * [Compare All Competitions](#Compare-All-Competitions)
 * [Competitions: Top 40 Highest Shake-up](#Competitions:-Top-40-Highest-Shake-up)
 * [Competitions: Most Recent](#Competitions:-Most-Recent)
 * [Inspect One Competition](#Inspect-One-Competition)
 * [Related Links and Info](#Related-Links-and-Info)

## Background

Why do this?

The plots are *rank* vs *rank* - so every team in a competition has their own "address" in the plot, there is no complete overlap of any two points.

However, teams that move from neighbouring positions on the public LB, to neighbouring positions on the *private* LB, have points *together* in the plot.

Image compression uses correlations of *neighbouring* pixels to reduce the amount of data required to store an image.

For example [PNG][1] uses a 2-stage compression process:

 - pre-compression: filtering (prediction)
 - compression: DEFLATE

In the filtering process, each pixel is *predicted* &mdash; it can use as input: the pixel to the left, the pixel above, or the pixel above and to the left, or a combination.

If there was zero shake-up the teams would form a diagonal line and the image would be easy to compress, lots of blank background space and a thin line of highly predictable points for the teams.

With high shake-up teams move around a lot and each pixel is harder to predict based on it's neighbours.

(Another way to think of it is edges: every transition from background to scatter-plot marker is an *edge* and it is really the *edges* that are costly in compression. An edge is high frequency information, a space without edges is low frequency.)

Using file size as a ranking system is not perfect, for example long competition titles add a bit to the file size, and it's sensitive to the size of the markers. But it's quick to implement, it tells us *something* about competitions, and the resulting plots can reveal insights too :)

A similar idea, *compression distance*, can be highly useful in ML competitions, e.g. $A$, $B$ are typically strings and $A+B$ is string concatenation:

$$CompressionDistance = \frac{|compress(A+B)|}{|compress(A)|+|compress(B)|}$$

If a compressed representation of $A+B$ is much smaller than the combination when they are when compressed separately, there is redundancy between $A$ and $B$.

More generally, from the [Data Compression][2] article on Wikipedia:

<blockquote>
There is a close connection between machine learning and compression.
A system that predicts the posterior probabilities of a sequence given
its entire history can be used for optimal data compression (by using
arithmetic coding on the output distribution). An optimal compressor
can be used for prediction (by finding the symbol that compresses
best, given the previous history). This equivalence has been used as a
justification for using data compression as a benchmark for "general
intelligence".
</blockquote>

<!--
    Portable Network Graphics (PNG, officially pronounced /pɪŋ/[2][3] PING, more commonly pronounced /ˌpiːɛnˈdʒiː/[4] PEE-en-JEE) is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF).
-->

[1]: https://en.wikipedia.org/wiki/Portable_Network_Graphics
[2]: https://en.wikipedia.org/wiki/Data_compression
[3]: https://www.kaggle.com/jtrotman/meta-kaggle-competition-shake-up

In [1]:
import gc, os, sys, subprocess
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import HTML, Image
import plotly.express as px
from tqdm import tqdm_notebook as tqdm

CSV_DIR = '../input'
OUTPUT_ZIP = 'shakeup_scatter_plots.zip'
X, Y = 'PublicLeaderboardRank', 'PrivateLeaderboardRank'
CODEC = 'png'  # Fork & try 'jpg'!
N_SHOW = 40
MIN_TEAM_COUNT = 20  # Do not plot comps with few teams

def read_csv_filtered(csv, col, values, **kwargs):
    dfs = [df.loc[df[col].isin(values)]
           for df in pd.read_csv(csv, chunksize=100000, low_memory=False, **kwargs)]
    return pd.concat(dfs, axis=0)

def do_read_csv(name, **kwargs):
    df = pd.read_csv(name, low_memory=True, **kwargs)
    print (df.shape, name)
    return df

In [2]:
teams = do_read_csv(f'{CSV_DIR}/Teams.csv')
comps = do_read_csv(f'{CSV_DIR}/Competitions.csv').set_index('Id')
comps['Deadline'] = comps.DeadlineDate.str.split().str[0]
comps['DeadlineDate'] = pd.to_datetime(comps.DeadlineDate)
# InClass are most common, followed by Featured
comps.HostSegmentTitle.value_counts()

Select competitions that have public and private leaderboard ranks.

Also filtering out "M5 Forecasting - Accuracy" - the public LB was reset one month from the end.
The second public LB is the only one available in Meta Kaggle, for which the ground truth was publicly available.
Over 700 teams could not resist showboating and had a perfect score of 0!
About 2000 had unrealistically good scores, creating artificially high shake-up.
The crazy looking plot is [saved here][1] for posterity :)

 [1]: https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163999


In [3]:
exclude_slugs = {'m5-forecasting-accuracy'}
teams = teams.dropna(subset=[X, Y])
n_priv = teams.groupby('CompetitionId').PrivateLeaderboardRank.count()
with_private = n_priv[n_priv > 0].index
comps_idx = ((comps.LeaderboardPercentage > 0)
             & (comps.index.isin(with_private))
             & (~comps.Slug.isin(exclude_slugs)))
selected_comps = comps.loc[comps_idx].copy()
selected_comps.shape

Find the teams for those selected competitions

In [4]:
teams = teams.loc[teams.CompetitionId.isin(selected_comps.index)]
teams = teams.assign(Medal=teams.Medal.fillna(0).astype(int))
teams.shape

In [5]:
IMAGE_DIR = '/kaggle/plots'
os.makedirs(IMAGE_DIR, exist_ok=True)

# Generate Plots

Color private LB medals in appropriately.

In [6]:
plt.rc('font', size=14)
plt.rc('figure', figsize=(15, 15))

In [7]:
shakes = {}
image_files = {}
COLOR_DICT = {0: 'deepskyblue', 1: 'gold', 2: 'silver', 3: 'chocolate'}
savefig_opts = dict(bbox_inches='tight')
total = teams.CompetitionId.nunique()

for i, df in tqdm(teams.groupby('CompetitionId', sort=False), total=total):
    if len(df) < MIN_TEAM_COUNT:
        continue
    fname = comps.Slug[i]
    row = comps.loc[i]
    shakeup = df.eval('abs(PrivateLeaderboardRank-PublicLeaderboardRank)').mean() / len(df)
    title = (f'{row.Title} — {row.TotalTeams} teams — '
             f'{shakeup:.3f} shake-up — {row.Deadline}')
    image_file = f'{IMAGE_DIR}/{fname}.{CODEC}'
    image_files[i] = image_file
    shakes[i] = shakeup
    l = [0, df.PrivateLeaderboardRank.max()]
    df = df.sort_values('PrivateLeaderboardRank', ascending=False)  # plot gold last
    ax = df.plot.scatter(X, Y, c=df.Medal.map(COLOR_DICT))
    ax.plot(l, l, linestyle='--', linewidth=1, color='k', alpha=0.5)
    ax.set_title(title)
    ax.set_xlabel(X)
    ax.set_ylabel(Y)
    plt.tight_layout()
    plt.savefig(image_file, **savefig_opts)
    plt.close()
    gc.collect()
plt.close()    

Add new fields and save - try adding this Notebook as a *kernel source* and use this file to find more insights about shakeup!

In [8]:
comps['FileSize'] = pd.Series(image_files).apply(os.path.getsize)
comps['Shakeup'] = pd.Series(shakes)
comps.to_csv('CompetitionsWithShakeup.csv')
comps['Image'] = pd.Series(image_files)

**Mercedes-Benz Greener Manufacturing** and **LANL Earthquake Prediction** are nearly tied at the top by this metric.

In [9]:
def fmt_link(row):
    url = 'https://www.kaggle.com/c/{Slug}'.format(**row)
    txt = ('{Deadline} [{EvaluationAlgorithmName}]\n\n'
           '{LeaderboardPercentage}% public\n\n'
           '{Subtitle}').format(**row)
    title = row.Title
    return f'<a href="{url}" title="{txt}">{title}</a>'


show = [
    'Title', 'HostSegmentTitle', 'TotalTeams', 'Deadline', 'Shakeup',
    'FileSize'
]
bars = ['TotalTeams', 'Shakeup', 'FileSize']

tmp = comps.assign(Title=comps.apply(fmt_link, 1))
tmp = tmp.sort_values('FileSize', ascending=False)[show]
tmp = tmp.set_index('Title').head(50)
tmp.style.bar(subset=bars, color='#20beff')

# Correlations to Competition Features

The FileSize is much more correlated (0.9) to these:
 - TotalTeams
 - TotalCompetitors
 - TotalSubmissions

...than Shakeup (0.4)

In [10]:
show = [
    'HasKernels', 'OnlyAllowKernelSubmissions', 'LeaderboardPercentage',
    'MaxDailySubmissions', 'NumScoredSubmissions', 'MaxTeamSize',
    'BanTeamMergers', 'EnableTeamModels', 'EnableSubmissionModelHashes',
    'EnableSubmissionModelAttachments', 'RewardQuantity', 'NumPrizes',
    'UserRankMultiplier', 'CanQualifyTiers', 'TotalTeams', 'TotalCompetitors',
    'TotalSubmissions', 'FileSize', 'Shakeup'
]

plt.rc('font', size=12)
plt.figure(figsize=(14, 12))
sns.heatmap(comps[show].corr(method='spearman'),
            vmin=-1,
            cmap='RdBu',
            annot=True,
            fmt='.1f',
            linewidths=1)
plt.title('Kaggle Competition Attributes - Spearman Correlation');

# Compare All Competitions

Visualise the relationship over competitions - comparing *Shakeup* on the x-axis against the *size* of the generated scatter-plot for that competition.
Notice how you need higher shake-up *and* more teams to generate a larger png.

The low shake-up up corner (bottom left) is crowded (as it *should* be! Kaggle competitions would not be so popular if high shake-up was the norm!) but you can zoom in for better detail (thanks plot.ly!)

<!--
See [the top comment by beluga in this thread in the recent Data Science Bowl 2019](https://www.kaggle.com/c/data-science-bowl-2019/discussion/127203#726747) for proof that the prospect of high shake-up discourages top competitors. On the other hand, high shake-up attracts some people with the prospect of *free lotto tickets* - entering with a shallow *low effort* approach and hoping to get lucky. There's nothing wrong with either case, and both apply to me for different competitions. Confidence that shake-up will be low certainly inspires more effort.
-->

Without shake-up competitions become too much like open-loop validation set over-fitting.
Too much shake-up means results look "random".
Where in this graph should Kaggle be *aiming* for?!
Is the dense region from a conscious effort to make competitions "low shake"?


In [11]:
color_discrete_map = {
    'Featured': 'blue',
    'Research': 'green',
    'Recruitment': 'red',
    'GE Quests': 'slateblue',
    'Getting Started': 'slateblue',
    'Playground': 'slateblue',
    'Prospect': 'slateblue',
}
fig = px.scatter(comps.loc[comps_idx & (comps.HostSegmentTitle != 'InClass')],
                 title='Competition Shake-up',
                 x='Shakeup',
                 y='FileSize',
                 log_y=True,
                 hover_name='Title',
                 hover_data=[
                     'EvaluationAlgorithmAbbreviation', 'TotalTeams',
                     'TotalSubmissions', 'Deadline'
                 ],
                 color='HostSegmentTitle',
                 color_discrete_map=color_discrete_map)
fig.update_layout(showlegend=False)

This is a better view.
Plotting the file size against the number of teams, the trend is very clear.
Now it is easy to see, for a given number of teams, what the range of file sizes is.
The vertical variation within a given # of teams is due mainly to the shake-up.
(Points along the top are high shake, on the bottom low shake.
Most of the lowest points are actually Shakeup=0)

In [12]:
fig = px.scatter(comps.loc[comps_idx & (comps.HostSegmentTitle != 'InClass')],
                 title='Competition Shake-up',
                 x='TotalTeams',
                 y='FileSize',
                 log_x=True,
                 log_y=True,
                 hover_name='Title',
                 hover_data=[
                     'EvaluationAlgorithmAbbreviation', 'Shakeup',
                     'TotalSubmissions', 'Deadline'
                 ],
                 color='HostSegmentTitle',
                 color_discrete_map=color_discrete_map)
fig.update_layout(showlegend=False)

# Competitions: Top 40 Highest Shake-up

Here are the scatter plots for the top 40 competitions sorted by image file size...


In [13]:
COMMENT = {}

COMMENT["data-science-bowl-2019"] = """

As pointed out by narsil in this <a href="https://www.kaggle.com/c/data-science-bowl-2019/discussion/127203">thread</a>
there is a very interesting <em>inverse</em> correlation between public and private scores for the bulk of the teams,
perhaps a quirk of the QWK metric combined with a small test set size.

"""

COMMENT["santander-customer-satisfaction"] = """

Many teams high up on the public LB were using manual overrides that set the submission prediction to 0 in specific cases.
As time went on these rules targetted smaller and smaller sets of rows in a non-statistically sound way :)
Overrides that improved public LB AUC were published in notebooks, whilst those that did not were silently forgotten.
This gradually overfit the public LB and many of these post processing tricks did not hold up on the private part of the test set.

"""

COMMENT["ieee-fraud-detection"] = """

Somebody shared a <a href="https://www.kaggle.com/c/ieee-fraud-detection/discussion/111108">poisoned blend</a>.
Those who missed the warning post are clear in the plot, dropping from the top 1000 on the public LB to outside the top 5000 on the private LB.

"""

COMMENT["mercedes-benz-greener-manufacturing"] = """

This competition featured one of the smallest test sets in a featured competition with just 4209 rows (19% public, 81% private).
The metric was the R^2 value and somebody set up a website to co-ordinate a massive crowd-sourced leaderboard-polling operation, in this 
<a href="https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/35271">How to get the exact y-values of all data points in the public LB</a>
thread. This did not go exactly as they planned but contributed to heavy public LB overfitting.

"""

In [14]:
# uses a row from Competitions dataframe
def plot(i, row, tag=''):
    pre = 'https://www.kaggle.com'
    comment = COMMENT.get(row.Slug, '')
    html = ('<h1 id="{tag}{Slug}">[#{i}] {Title}</h1>'
            '<h3>{Subtitle}</h3>'
            '<p>[ <a href="{pre}/c/{Slug}">home</a>'
            '   | <a href="{pre}/c/{Slug}/discussion">discussion</a>'
            '   | <a href="{pre}/c/{Slug}/leaderboard">leaderboard</a> ]'
            '<br/>'
            '{comment}').format(pre=pre, tag=tag, i=i, comment=comment, **row)
    display(HTML(html))
    display(Image(row.Image))

In [15]:
src = comps.sort_values('FileSize', ascending=False)
for i, (row_id, row) in enumerate(src.head(N_SHOW).iterrows(), 1):
    plot(i, row)

# Competitions: Most Recent

For sharing on the forums, here are the most recent competitions:

In [16]:
by_date = comps.query("HostSegmentTitle!='InClass'")
by_date = by_date.sort_values("DeadlineDate", ascending=False)

for i, (row_id, row) in enumerate(by_date.head(5).iterrows(), 1):
    plot(i, row, 'most-recent-')

# Inspect One Competition

Note how hovering over the teams gives you their scores - in "full precision" - the score data in Meta Kaggle appears to be the raw output of their metric calculation code, no truncation :-)

Fork this notebook and put any competition slug here:

In [17]:
Slug = 'mercari-price-suggestion-challenge'

In [18]:
# Fix: Your notebook tried to allocate more memory than is available.
def read_subs(ids):
    subs_cols = ['Id', 'PublicScoreFullPrecision', 'PrivateScoreFullPrecision']
    subs = read_csv_filtered(f'{CSV_DIR}/Submissions.csv', 'Id', ids, usecols=subs_cols)
    subs.set_index('Id', inplace=True)
    return subs

In [19]:
MEDAL_NAMES = np.asarray(["None", "Gold", "Silver", "Bronze"])
MEDAL_COLORS = dict(
    zip(
        MEDAL_NAMES,  # depends on Python 3 dict order
        COLOR_DICT.values()))

selected = comps.query(f"Slug=='{Slug}'")

assert len(selected) == 1, f"Slug {Slug} not found"

chosen = selected.squeeze()

df = teams.query(f"CompetitionId=={chosen.name}").fillna("")
df['Medal'] = MEDAL_NAMES[df.Medal]

ids = (set(df.PublicLeaderboardSubmissionId.astype(int)) |
       set(df.PrivateLeaderboardSubmissionId.astype(int)))
subs = read_subs(ids)
df['PublicScore'] = df.PublicLeaderboardSubmissionId.map(subs.PublicScoreFullPrecision)
df['PrivateScore'] = df.PrivateLeaderboardSubmissionId.map(subs.PrivateScoreFullPrecision)

fig = px.scatter(df,
                 title='Shake-up ' + chosen.Title,
                 x=X,
                 y=Y,
                 hover_name='TeamName',
                 hover_data=[
                     'ScoreFirstSubmittedDate',
                     'LastSubmissionDate',
                     'PublicScore',
                     'PrivateScore',
                     'Medal',
                 ],
                 color='Medal',
                 color_discrete_map=MEDAL_COLORS)
fig.update_layout(showlegend=False)

## Package Files

Archive files for download.

    -bd : disable progress indicator
    -mmt[N] : set number of CPU threads

In [20]:
!7z a -bd -mmt4 {OUTPUT_ZIP} {IMAGE_DIR}/*.{CODEC}

# Related Links and Info

[Here is my first attempt at a shake-up notebook][1] that generates the stats manually for each competition, using [BreakfastPirate][2]'s definition of "top 10% shakeup" and also a similar "gold medal shake".

Some of the plots from this notebook were posted just after the deadline in the related competition forums, which has drawn interesting comments and new insights from top Kagglers:

 - https://www.kaggle.com/c/lish-moa/discussion/200530
 - https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163999
 - https://www.kaggle.com/c/data-science-bowl-2019/discussion/127203
 - https://www.kaggle.com/c/ieee-fraud-detection/discussion/111670
 - https://www.kaggle.com/c/understanding_cloud_organization/discussion/117948
 - https://www.kaggle.com/c/severstal-steel-defect-detection/discussion/114208
 - https://www.kaggle.com/c/pku-autonomous-driving/discussion/127141
 - https://www.kaggle.com/c/cat-in-the-dat/discussion/120927
 - https://www.kaggle.com/c/bengaliai-cv19/discussion/137831 (Special guest post by @greatgamedota)

Simulation competitions have leaderboards that are constantly updated as new episodes are run, perhaps some new metrics to quantify volatility over time are needed?
Until then, here are a couple of animated shake-up plots from snapshots of the leaderboards:

 - https://www.kaggle.com/jtrotman/santa-2020-animated-shake-up-plot
 - https://www.kaggle.com/jtrotman/rock-paper-scissors-animated-shake-up-plot

This has turned out to be quite a cumbersome notebook, taking 20 minutes to run.
If you want something more focussed, to zoom in on one competition, you can check out the plot.ly 
[Shakeup interactive scatterplot maker](https://www.kaggle.com/carlmcbrideellis/shakeup-interactive-scatterplot-maker) by @carlmcbrideellis (and me, technically!) Also by Carl, for a more detailed narrative breakdown of the features in the plots, there is:
[Shakeup scatterplots: Boxes, strings and things...](https://www.kaggle.com/carlmcbrideellis/shakeup-scatterplots-boxes-strings-and-things)

Want to see every single mention of the phrase "shakeup" or "shake-up" on Kaggle? [Shakeup: The Story So Far][4].

Lastly, I should credit BreakfastPirate again - he came up with the metric and his [discussion][3] tab has loads of posts on shake-up, with histograms, scatterplots and even animations >:-D

[1]: https://www.kaggle.com/jtrotman/meta-kaggle-competition-shake-up
[2]: https://www.kaggle.com/breakfastpirate
[3]: https://www.kaggle.com/breakfastpirate/discussion
[4]: https://www.kaggle.com/jtrotman/shakeup-the-story-so-far

In [21]:
_ = """
Re-run to include recent competitions:

    2021-06-23 | Slug:iwildcam2021-fgvc8
    2021-06-28 | Slug:coleridgeinitiative-show-us-the-data
    2021-07-05 | Slug:tabular-playground-series-jun-2021
    2021-08-10 | Slug:google-smartphone-decimeter-challenge
    2021-08-11 | Slug:commonlitreadabilityprize


"""