# How To Compute Notebook Rankings

This notebook attempts to recompute the notebook ranking points for all users.

Kaggle is pretty open about how the voting system works, see this from the [progression](https://www.kaggle.com/progression) guide:

___________________________


<div style="background:#f0f0f0; padding: 1em;">

<h3>Points</h3>

<p>While tiers and medals are permanent representations of a data scientist’s achievements, points are designed to decay over time. This keeps Kaggle’s rankings contemporary and competitive. All points awarded decay in a consistent way using the formula below:</p>

$$e^{-t/500}$$

In this formula, t is the number of days elapsed since the point was awarded. 


<h3>Notebook Medals</h3>

Notebook Medals are awarded to popular notebooks, as measured by the number of upvotes a notebook receives. Not all upvotes count towards medals: self-votes, votes by novices, and votes on old posts are excluded from medal calculation.

<ul>
<li>Bronze : 5 Votes
<li>Silver : 20 Votes
<li>Gold : 50 Votes
<ul>

</div>

___________________________


We will see in the notebook that "self-votes, votes by novices" also do not apply to the ranking points.
There is a 0.97 correlation between my computation of points and Kaggle's own published calculation - whilst most users match very well, intriguingly, some users are WAY off.
Kaggle seems to have boosted the points for some users and taken points away from others.


***Why?***


## Contents

 * [Vote Counts](#Vote-Counts)
 * [V1: Sum of Votes](#V1:-Sum-of-Votes)
 * [V2: Exclude Self Votes and Novices](#V2:-Exclude-Self-Votes-and-Novices)
 * [V3: Add Time Decay](#V3:-Add-Time-Decay)
 * [Comparing Estimate to True Points: Plotly](#Comparing-Estimate-to-True-Points:-Plotly)
 * [Comparing Ranks of Estimated Points: Plotly](#Comparing-Ranks-of-Estimated-Points:-Plotly)
 * [Users With Fewer Points Than Expected](#Users-With-Fewer-Points-Than-Expected)
 * [Users With More Points Than Expected](#Users-With-More-Points-Than-Expected)
 * [Conclusions](#Conclusions)


In [1]:
from jt_mk_utils import *

In [2]:
import os, sys, re, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from IPython.display import HTML, Image, display
from sklearn import linear_model

TIER_COLORS = np.asarray(["green", "blue", "purple", "orange", "gold", "black"])
TIERS = np.asarray(["Novice", "Contributor", "Expert", "Master", "GrandMaster", "Staff"])

In [3]:
plt.rc('figure', figsize=(15, 9))
plt.rc('font', size=14)
plt.style.use('bmh')

In [4]:
plt.rc('figure', figsize=(15, 9))
plt.rc('font', size=14)

def plt_log_scales():
    plt.yscale('symlog')
    plt.xscale('symlog')
    plt.ylim(10)
    plt.xlim(10)

In [5]:
kernels = read_kernels(index_col=0)
kernels.shape

In [6]:
votes = read_kernel_votes(index_col=0)
votes.shape

In [7]:
ver = read_kernel_versions(
    usecols=['Id', 'ScriptId', 'AuthorUserId']).set_index('Id')
ver.shape

In [8]:
votes = votes.join(ver, on='KernelVersionId', how='inner')
votes.count()

In [9]:
UIDS = set(votes.AuthorUserId) | set(votes.UserId)
users = read_users(filter=('Id', UIDS)).set_index('Id')
users.shape

In [10]:
votes = votes.join(users[['PerformanceTier']], on='UserId') # UserId is ID of voter

# Vote Counts

Since 2020 votes have averaged around 2000 per day

In [11]:
votes['VoteDate'].value_counts().sort_index().plot()
plt.title('Notebook Votes');


***Unfortunately*** : Novices (Tier 0) are the only tier whose votes *do not count for the points/rankings* yet by some way are the largest class of voters!!

In [12]:
votes['PerformanceTier'].dropna().astype(int).hist()
plt.title('Voter Tiers');


### Notebook rankings and points are supposed to be in UserAchievements.csv, however:

[Bug: Users missing from UserAchievements](https://www.kaggle.com/kaggle/meta-kaggle/discussion/181048)

... so I am using a [Kaggle Notebook User Rankings dataset](https://www.kaggle.com/jtrotman/kaggle-notebook-user-rankings) with a saved version of the rankings from the website.

In [13]:
path = '/kaggle/input/kaggle-notebook-user-rankings/NotebookRankings.csv'

user_achievements = pd.read_csv(
    path,
    index_col='UserId',
    dtype={'Points': 'int'},
    parse_dates=['RegisterDate'])
user_achievements['NotebookCount'] = kernels.AuthorUserId.value_counts()
user_achievements.count()

# V1: Sum of Votes

In [14]:
votes['weight'] = 1

In [15]:
user_achievements['VoteCount'] = votes.groupby('AuthorUserId')['weight'].sum()

In [16]:
user_achievements[['Points', 'VoteCount']].corr()

In [17]:
title = 'Top Kaggle Notebook Authors'
user_achievements.plot.scatter('VoteCount', 'Points', title=title);
plt_log_scales();

# V2: Exclude Self Votes and Novices

In [18]:
invalid_vote_index = ((votes.UserId == votes.AuthorUserId) |
                      (votes.PerformanceTier == 0))
votes.loc[invalid_vote_index, 'weight'] = 0

In [19]:
user_achievements['VoteCount2'] = votes.groupby('AuthorUserId')['weight'].sum()

In [20]:
user_achievements[['Points', 'VoteCount', 'VoteCount2']].corr()

In [21]:
user_achievements.plot.scatter('VoteCount2', 'Points', title=title);
plt_log_scales();

# V3: Add Time Decay

In [22]:
age = votes.VoteDate - votes.VoteDate.max()
age.describe()

In [23]:
votes.loc[~invalid_vote_index, 'weight'] = np.exp(age.dt.days / 500)

In [24]:
votes['weight'].describe()  # max weight is 1, many still 0

In [25]:
votes['weight'].hist(bins=50, color='g')
plt.title('Distribution of Notebook Vote "Weights"');

In [26]:
user_achievements['PointsEstimate'] = votes.groupby('AuthorUserId')['weight'].sum()
user_achievements['PointsEstimate'] = user_achievements['PointsEstimate'].fillna(1).round(0).astype(int)

In [27]:
user_achievements[['Points', 'VoteCount', 'VoteCount2', 'PointsEstimate']].corr().style

In [28]:
user_achievements.plot.scatter('PointsEstimate', 'Points', title=title);
plt_log_scales();

In [29]:
df = user_achievements.copy() # join(users.drop(['DisplayName', 'UserName'], 1))

In [30]:
df['RegisterDateText'] = df.RegisterDate.dt.strftime('%a %-d %B %Y')

In [31]:
df['TierName'] = TIERS[df.Tier]

# Comparing Estimate to True Points: Plotly

Who are the outliers above? Hover to see.

Top tip: click or double-click on the TierName values in the legend to turn points on or off

In [32]:
px.scatter(df,
           'PointsEstimate',
           'Points',
           color='TierName',
           log_x=True,
           log_y=True,
           hover_name='DisplayName',
           hover_data=[
               'CurrentRanking', 'NotebookCount', 'TotalGold', 'TotalSilver',
               'TotalBronze', 'UserName', 'RegisterDateText'
           ],
           title=title)

# Comparing Ranks of Estimated Points: Plotly

Points are only used to determine a ranking so the effect of a mismatch is actually best shown on a rank vs rank plot.

Those under the line have been demoted, those over the line promoted/boosted.


In [33]:
df['PointsEstimateRatio'] = df.eval('Points/PointsEstimate')
df['PointsEstimateDiff'] = df.eval('Points-PointsEstimate')
df['EstimatedRank'] = df.PointsEstimate.rank(ascending=False)
df['EstimatedRankDiff'] = df['CurrentRanking'] - df['EstimatedRank']

px.scatter(df,
           'CurrentRanking',
           'EstimatedRank',
           color='TierName',
           hover_name='DisplayName',
           hover_data=[
               'UserName', 'RegisterDateText',
               'CurrentRanking', #'HighestRanking',
               'NotebookCount',
               'TotalGold', 'TotalSilver', 'TotalBronze',
               'Points', 'PointsEstimate'
           ],
           title=title)

In [34]:
SHOW = ['User', 'Points', 'PointsEstimate', 'PointsEstimateRatio', 'PointsEstimateDiff']
BC = '#c0c0c0'

# f'HighestRanking: {r.HighestRanking}\n'
def user_name_link(r):
    return (f'<a href="https://www.kaggle.com/{r.UserName}" '
            f' title="UserName: {r.UserName}\n'
            f'Tier: {TIERS[r.Tier]}\n'
            f'RegisterDate: {r.RegisterDate.date()}\n'
            f'CurrentRanking: {r.CurrentRanking:.0f}\n'
            f'EstimatedRank: {r.EstimatedRank:.0f}\n'
            f'EstimatedRankDiff: {r.EstimatedRankDiff:.0f}\n'
            f'NotebookCount: {r.NotebookCount:.0f}\n'
            f'TotalGold: {r.TotalGold}\n'
            f'TotalSilver: {r.TotalSilver}\n'
            f'TotalBronze: {r.TotalBronze}\n'
            f'VoteCount: {r.VoteCount:.0f}">'
            f'{r.DisplayName}</a>')

def fmt_df(df):
    df = df.assign(User=df.apply(user_name_link, 1))
    return df

def show_df(df):
    return df[SHOW].set_index('User').style.bar(color=BC, width=85).format({'PointsEstimateRatio': lambda v: f'{v:.3f}'})

# Users With Fewer Points Than Expected

In [35]:
show_df(fmt_df(df.sort_values('PointsEstimateDiff', ascending=True).head(100)))

# Users With More Points Than Expected

Possible reasons?
 - The top 3-4 are ex Kaggle staff, maybe the time-decay on their votes was reset when they left?
 - A few have a lot of collaborators on their notebooks(?)


In [36]:
show_df(fmt_df(df.sort_values('PointsEstimateDiff', ascending=False).head(100)))

In [37]:
# save a snapshot of the source dataset with our extra computations
user_achievements.to_csv('NotebookRankings.csv', float_format='%.0f')

# Conclusions

As I said in the intro: Kaggle is pretty open about how the voting system works.
However there is clearly an adjustment layer that means some votes count for less.

I will not try to reverse engineer that system, but we can see the results of it above.
Informally, I have noticed some effects:

 - ***Too much reciprocation*** *A votes for B, then B votes for A*
 - ***Too fast*** *A votes for B say 30 times in just a few minutes (check the KernelVotes.csv for evidence that this happens a lot!)*
 - ***Lack of voter diversity*** *A has 7 votes on 5 Notebooks but always the same 7 users &rarr; No Bronze!*
 
