# Kaggle Rankings History

Using snapshots of UserAchievements table from [Meta Kaggle][1]. Note: only users with Points>=0 are stored...

This is really here as a demo of reading and querying the data.

## Questions

 - How does the points requirement to make the top N in each tier change over time?

## TODO

- colors. Out of time!
- join with Users.csv from Meta Kaggle to inspect rises / falls!

[1]: https://www.kaggle.com/kaggle/meta-kaggle "Meta Kaggle"

In [1]:
%matplotlib inline
import gc, os, re, sys, time
import calendar
import pandas as pd, numpy as np
from pathlib import Path
from IPython.display import display, HTML

MK = Path(f'../input/meta-kaggle')
DS = Path(f'../input/user-achievements-snapshots')
ID = 'Id'
ATYPES = ['Competitions', 'Scripts', 'Discussion']

pd.options.display.max_rows = 200

TIERNAMES = [
    "novice", "contributor", "expert", "master", "grandmaster", "overlord"
]

DTYPE = {
    'UserId': ('int32'),
    'AchievementType': ('O'),
    'Tier': ('int8'),
    'Points': ('int32'),
    'CurrentRanking': ('float32'),
    'HighestRanking': ('float32'),
    'TotalGold': ('int32'),
    'TotalSilver': ('int32'),
    'TotalBronze': ('int32'),
}

In [2]:
def read_all():
    dfs = []
    for f in sorted(os.listdir(DS)):
        m = re.match(r'UserAchievements_(\d\d)(\d\d)(\d\d).csv', f)
        if m:
            ds = '/'.join(m.groups())
            date = pd.to_datetime(ds, yearfirst=True)
            df = pd.read_csv(DS / f,
                             index_col=0,
                             dtype=DTYPE,
                             parse_dates=['TierAchievementDate'])
            dfs.append(df.assign(date=date))
    return pd.concat(dfs)

In [3]:
df = read_all()
df.shape

In [4]:
users = pd.read_csv(MK / 'Users.csv', index_col=0)
users.shape

In [5]:
# Add 1 to the codes - the number now refers to the number of dots in the UI
df['TierName'] = df.Tier.apply(lambda x: f't{x+1}_{TIERNAMES[x]}')

In [6]:
df.head()

Time lag: new users start at novice, takes time to move up tiers, there should be more novices in the *recent* past.

In [7]:
df.loc[df.AchievementType=="Competitions"].groupby(['date', 'TierName']).size().unstack().style.background_gradient(axis=1)

However, scripts and discussions require **contribution of content** for users to make it into the rankings.

I did not realise discussion master (or master & GM) is actually the rarest of the upper tiers.

In [8]:
df.loc[df.AchievementType=="Scripts"].groupby(['date', 'TierName']).size().unstack().style.background_gradient(axis=1)

In [9]:
df.loc[df.AchievementType=="Discussion"].groupby(['date', 'TierName']).size().unstack().style.background_gradient(axis=1)

In [10]:
TCOLS = [
    't1_novice', 't2_contributor', 't3_expert', 't4_master', 't5_grandmaster'
]

In [11]:
import matplotlib as mpl
import matplotlib.pyplot as plt

FIGSIZE = (15, 9)
plt.style.use('ggplot')
plt.rc('figure', figsize=FIGSIZE)  # works locally, not on Kaggle
plt.rc('font', size=14)

# Tier Counts Over Time

In [12]:
for t in ATYPES:
    display(HTML(f"<h2>{t} Tier Counts Over Time</h2>"))
    tmp = pd.get_dummies(df.loc[df.AchievementType==t], columns=['TierName'], prefix='', prefix_sep='')
    tmp.groupby('date')[TCOLS].sum().plot(figsize=FIGSIZE)
    plt.show()

In [13]:
MIN_TIER = 2
for t in ATYPES:
    display(HTML(f"<h2>{t} Tier Counts Over Time</h2>"))
    tmp = df.loc[(df.AchievementType==t) & (df.Tier>=MIN_TIER)]
    tmp = pd.get_dummies(tmp, columns=['TierName'], prefix='', prefix_sep='')
    tmp.groupby('date')[TCOLS[MIN_TIER:]].sum().plot(figsize=FIGSIZE)
    plt.show()

In [14]:
MIN_TIER = 3
for t in ATYPES:
    display(HTML(f"<h2>{t} Tier Counts Over Time</h2>"))
    tmp = df.loc[(df.AchievementType==t) & (df.Tier>=MIN_TIER)]
    tmp = pd.get_dummies(tmp, columns=['TierName'], prefix='', prefix_sep='')
    tmp.groupby('date')[TCOLS[MIN_TIER:]].sum().plot(figsize=FIGSIZE)
    plt.show()

In [15]:
MIN_TIER = 4
for t in ATYPES:
    display(HTML(f"<h2>{t} Tier Counts Over Time</h2>"))
    tmp = df.loc[(df.AchievementType==t) & (df.Tier>=MIN_TIER)]
    tmp = pd.get_dummies(tmp, columns=['TierName'], prefix='', prefix_sep='')
    tmp.groupby('date')[TCOLS[MIN_TIER:]].sum().plot(figsize=FIGSIZE)
    plt.show()

# Points at Top Spots

Points required to be #1 are any quantity over the points of #2 :)

But approximately, points of #100 are the points required to be top 100.

Note how competitions points requirement is fairly constant over time. The mixture of competitions has moved towards image processing competitions and larger datasets, meaning computer vision experts move in to the top ranks and others are unable to compete and drop out.

Scripts and discussions are a rising tide-line: content is added in ever greater quantities and the requirements for top ranks increase monotonically.

In [16]:
def show_top(ranks):
    for t in ATYPES:
        title = f"{t} Points at Top {ranks} Ranks"
        display(HTML(f"<h2>{t}</h2>"))
        for r in ranks:
            tmp = df.loc[(df.AchievementType == t) & (df.CurrentRanking == r)].set_index('date')
            tmp.Points.plot(figsize=FIGSIZE, label=f'Top {r}', legend=True)
        plt.ylim(0)
        plt.title(title)
        plt.show()

def show_top_sum(fields, func):
    for t in ATYPES:
        title = f"{t} {func} {fields}"
        display(HTML(f"<h2>{t}</h2>"))
        tmp = df.loc[(df.AchievementType == t)]
        res = tmp.groupby('date')[fields].agg(func)
        res.plot(figsize=FIGSIZE, label=f'{func} {fields}', legend=True)
        plt.ylim(0)
        plt.title(title)
        plt.show()

In [17]:
show_top([100])

In [18]:
show_top([1, 5, 10])

In [19]:
show_top([200, 400])

# Medal Counts

Competition Silver and Bronze are so close because they're based on top 5% and 10%, whilst gold is traditionally top 10, or up to top ~30 depending on # of teams.

Discussion bronze medals are *way* higher than silver and bronze - because they include many 1:1 "*saw this - thanks!*" type votes?



In [20]:
show_top_sum(['TotalGold', 'TotalSilver', 'TotalBronze'], 'sum')

In [21]:
show_top_sum(['TotalGold', 'TotalSilver'], 'sum')

In [22]:
show_top_sum('TotalGold', 'sum')

It seems discussion gold (& silver) medals hit a new rate around March 2019 - the date of the [largest competition to date][1] which has 865 topics!!!

[1]: https://www.kaggle.com/c/santander-customer-transaction-prediction