# From Dabbler to Kaggler

## Reflections on the first "30 Days of Machine Learning Challenge"

The [first "30 Days of Machine Learning Challenge"](https://www.kaggle.com/c/30-days-of-ml) was a tremendous learning experience. I learned a lot from all the participants but especially from [Alexis Cook](https://www.kaggle.com/alexisbcook), [Abhishek Thakur](https://www.kaggle.com/abhishek) (I highly commend his [videos](https://www.youtube.com/watch?v=_55G24aghPY&list=PL98nY_tJQXZnP-k3qCDd1hljVSciDV9_N), [notebooks](https://www.kaggle.com/abhishek/competition-part-1-baseline), and [his book](https://www.amazon.com/Approaching-Almost-Machine-Learning-Problem-ebook/dp/B089P13QHT) which is essential reading for all Kagglers), and [Luca Massaron](https://www.kaggle.com/lucamassaron) (himself [a prolific writer on data science](https://www.amazon.com/Luca-Massaron/e/B00RW7GV02%3Fref=dbs_a_mng_rwt_scns_share)).

## Top 2%: Ranking #139

I managed to earn a satisfactory private ranking of #139 on the private LB. My final notebook is [Experiment 11B: Blending to Stacking](https://www.kaggle.com/gauden/experiment-11b-blending-to-stacking). There is no major insight in the notebook -- I just accumulated experience from the community and built my own version of the common ensemble. Thanks to all!

## Souvenir: Leaderboard Dataset

As a souvenir of the event, I downloaded the leaderboards in [both HTML and CSV](https://www.kaggle.com/gauden/30dmlleaderboards) formats. In [this notebook](https://www.kaggle.com/gauden/top-2-30dml-dabbler-to-kaggler), I provide a script to scrape the data into a dataset and add some Plotly graphics, in the hope that the participants in the challenge will enjoy hovering over the plots and seeking their own Team.

The Dataset is open for exploration either by interacting with the plots below or by using the dataset and script below to use the data in their own graphics.

---

## Would you consider upvoting?

#### Do you have ideas for more plots? Do comment below...

##### And now, on to the [next challenge](https://www.kaggle.com/c/tabular-playground-series-sep-2021)!


## `:-)`



In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

# Data scraping tools
from bs4 import BeautifulSoup

# Display tools
from IPython.core.display import display, Markdown
import plotly.express as px

In [None]:
PUB = Path('../input/30dmlleaderboards/public_lb.html')
PRIV = Path('../input/30dmlleaderboards/private_lb.html')
CSV_PUB = Path('../input/30dmlleaderboards/30-days-of-ml-publicleaderboard/30-days-of-ml-publicleaderboard.csv')

# Script: Scrape both Leaderboards into a single DataFrame

The scraping script is lengthy and hidden in the cell below. Feel free to use it if interested in this dataset. I decided to give the raw data and the script rather than the final dataframe since this seemed to me more in the spirit of the challenge and respects Kaggle's ownership. (Someone please tell me in the comments if this is not the done thing).

In [None]:
def _strip_all_spaces(series):
    # clean up internal and external spaces from a pandas series
    return series.replace(r"\s+", " ", regex=True).str.strip()

def _extract_kernel(element):
    # scrape title and href attributes from a single BeautifulSoup anchor element
    try:
        anchor = element.find('a')
        title = anchor.get('title')
        href = anchor.get('href')
    except AttributeError:
        title, href = "", ""
    return title, href

def _load_pub_dataframe(pathname=PUB):
    """
    Scrape the key data items from the HTML version of the public leaderboard.
    
    Return a pandas DataFrame
    """
    with pathname.open() as f:
        soup = BeautifulSoup(f,'lxml-xml')

    recs = soup.find_all('tr', class_=["competition-leaderboard__row",
                                       "competition-leaderboard__row competition-leaderboard__row--user-scored"
                                      ])
    rows = []
    for rec in recs:
        record = dict()

        keys = ["Rank", "Team Name", "Kernel", "Score", "Number of Entries"]
        for key in keys:
            if key != "Kernel":
                record[key] = rec.find('td', {"data-th" : key}).text
            else:
                element = rec.find('td', {"data-th" : key})
                record['KernelTitle'], record['KernelHref'] = _extract_kernel(element)

        record["Last Entry"] = rec.find_all('span')[-1]['title']
        rows.append(record)

    df = pd.DataFrame(rows)
    df["Last Entry"] = pd.to_datetime(df["Last Entry"].str.split().str[0:6].str.join(" "))
    df["Team Name"] = _strip_all_spaces( df["Team Name"] )
    df = df[df["Team Name"].notna()]
    df.columns = ['Rank', 'TeamName', 'KernelTitle', 'KernelHref', 'Score', 'Entries', 'Latest']
    df = df.drop(['Score', 'Latest'], axis='columns')    
    return df

def _load_priv_dataframe(pathname=PRIV):
    """
    Scrape the key data items from the HTML version of the private leaderboard.
    
    Return a pandas DataFrame
    """
    with pathname.open() as f:
        soup = BeautifulSoup(f,'lxml-xml')

    recs = soup.find_all('tr', class_=["competition-leaderboard__row",
                                       "competition-leaderboard__row competition-leaderboard__row--user-scored"
                                      ])
    rows = []
    for rec in recs:
        record = dict()
        keys = ["Rank", "Team Name", "Score"]
        for key in keys:
            record[key] = rec.find('td', {"data-th" : key}).text
            
        change_span = rec.find('td', {"data-th" : "Change"}).find('span')
        if change_span.find('span', class_="position-change__none"):
            change = (0, 0)
        elif change_span.find('span', class_="position-change__risen"):
            change = (1, int(change_span.find('span', class_="position-change__risen").text))
        else:
            change = (-1, int(change_span.find('span', class_="position-change__fallen").text))
            
        record["ChangeDirection"], record["ChangeNo"] = change
        rows.append(record)
        
    df = pd.DataFrame(rows)
    df["Team Name"] = _strip_all_spaces( df["Team Name"] )
    df.columns = ['PrivRank', 'TeamName', 'PrivScore', 'ChangeDirection', 'ChangeNo']
    return df

def _load_csv_dataframe(pathname=CSV_PUB):
    """
    Read the CSV version of the public leaderboard as downloaded from Kaggle.
    (This will be merged with the scraped version).
    
    Return a pandas DataFrame
    """
    df = pd.read_csv(pathname)
    df.TeamName = df.TeamName.replace(r"\s+", " ", regex=True).str.strip()
    
    return df

def _load_dataframe(pathname):
    # entry function to loading the two versions of the public leaderboard
    if pathname == PUB:
        df = _load_pub_dataframe()
    elif pathname == CSV_PUB:
        df = _load_csv_dataframe()
    return df

def _load_and_merge_public_lb():
    """
    Load and clean the two versions (HTML and CSV) of the public leaderboard.
    Then merge the two on TeamName.
    
    Return a merged pandas DataFrame.
    """
    df = _load_dataframe(PUB)
    df_pub = _load_dataframe(CSV_PUB)
    
    df = df[df["TeamName"].isin(df_pub["TeamName"])]
    df_pub = df_pub[df_pub["TeamName"].isin(df["TeamName"])]
    
    final = df.merge(df_pub, on="TeamName", how="left")
    final.columns = ['PubRank', 'TeamName', 'KernelTitle', 'KernelHref', 
                     'Entries', 'TeamId', 'SubmissionDate', 'PubScore']
    return final

def load_data():
    """
    Load all data (two versions of public and scraped version of private leaderboard).
    Merge into one DataFrame and set dtypes correctly.
    
    Return a pandas DataFrame
    """
    pub_df = _load_and_merge_public_lb()
    priv_df = _load_priv_dataframe()
    df = pub_df.merge(priv_df, on="TeamName", how="left")
    
    # set dtypes
    type_dict = {
    'PubRank': 'int32',
    'Entries': 'int32',
    'SubmissionDate': 'datetime64',
    'PubScore': 'float64',
    'PrivRank': 'int32',
    'PrivScore': 'float64',
    'ChangeDirection': 'category',
    'ChangeNo': 'int32',
    }
    for key, value in type_dict.items():
        df[key] = df[key].astype(value)
    
    # reorder columns
    new_order = ['TeamId', 'TeamName', 'PubRank', 'PubScore', 'PrivRank', 
                 'PrivScore', 'ChangeDirection', 'ChangeNo', 'SubmissionDate', 
                 'Entries', 'KernelTitle', 'KernelHref'
                ]
    df = df[new_order]
    return df

In [None]:
DF = load_data()
DF.sample(5)

# Limit the number of teams to include in the plots

Change the value of `RANK` to taste.

In [None]:
RANK = 500  # This refers to the highest PRIVATE rank included in the graphics below

# Build plots on a subset of the DataFrame, defined by the value of RANK
SOURCE = DF[DF['PrivRank'] <= RANK].copy()

# Scatterplot: Private Rank by Number of Entries

In [None]:
x_data = SOURCE["Entries"]
y_data = SOURCE["PrivRank"]
color_data = SOURCE["ChangeDirection"]
size_data = SOURCE["Entries"]


fig = px.scatter(SOURCE, x=x_data, y=y_data, color=color_data, 
                 size=size_data, opacity=0.75, hover_data=['TeamName'])
fig.show()

# Scatterplot: Number of Entries by Submission Date

In [None]:
x_data = SOURCE["SubmissionDate"]
y_data = SOURCE["Entries"]
color_data = SOURCE["PrivRank"]
size_data = SOURCE["Entries"]


fig = px.scatter(SOURCE, x=x_data, y=y_data, color=color_data, size=size_data, opacity=0.75, hover_data=['TeamName'])
fig.show()

# Scatterplot: Private Rank by Private Score

In [None]:
x_data = SOURCE["PrivRank"]
y_data = SOURCE["PrivScore"]
color_data = SOURCE["ChangeDirection"]
size_data = SOURCE["Entries"]


fig = px.scatter(SOURCE, x=x_data, y=y_data, color=color_data, size=size_data, opacity=0.75, hover_data=['TeamName'])
fig.show()

# Scatterplot: Public Rank by Public Score

In [None]:
x_data = SOURCE["PubRank"]
y_data = SOURCE["PubScore"]
color_data = SOURCE["ChangeDirection"]
size_data = SOURCE["Entries"]


fig = px.scatter(SOURCE, x=x_data, y=y_data, color=color_data, size=size_data, opacity=0.75, hover_data=['TeamName'])
fig.show()

# List of Ranked Notebooks

In [None]:
kernels = DF[DF.KernelHref.str.len() > 0].sort_values(by="PrivRank")
kernels = kernels[['PrivRank', 'PrivScore', 'KernelTitle', 'KernelHref']]

markdown = """
| Private Rank | Private Score | Notebook |
|--------------|--------------:|---------:|
"""
for row in kernels.iterrows():
    rec = row[1]
    priv_rank = rec['PrivRank']
    priv_score = rec['PrivScore']
    title = " ".join(rec['KernelTitle'].split())
    title = title.replace("|", "/")
    url = rec['KernelHref']
    anchor = f"[{ title }](https://kaggle.com{ url })"
    line = f"| {priv_rank} | {priv_score}  | {anchor} |\n"
    markdown += line
    
display(Markdown(markdown))