# Introduction
This notebook contains the steps to complete day 1 of Dashboarding with Notebooks tutorial by Rachael Tatman at Kaggle (https://www.kaggle.com/rtatman/dashboarding-with-notebooks-day-1).

# Dataset selection
This tutorial will be completed using data from the Meta Kaggle dataset (described and available for download here: https://www.kaggle.com/kaggle/meta-kaggle). This dataset is a nice combination of interesting and easy to work with - perhaps unsurprisingly, Kaggle maintains their datasets well.

# Data preparation
The Meta Kaggle dataset is downloaded and read into pandas.DataFrame.

## Download
Download the data using the Kaggle API and unzip.

In [1]:
# Download and unzip data.

## Read
Read the data files.

In [1]:
from pathlib import Path  # Easy-to-use, cross-platform path-to-file.

import numpy as np
import pandas as pd

In [2]:
path_to_data = Path('../input')

# Load users data.
users = pd.read_csv(path_to_data / 'Users.csv', parse_dates=['RegisterDate'], dayfirst=False)
users.head()

In [3]:
# Load competitions data (takes a little while).
competitions = pd.read_csv(path_to_data / 'Competitions.csv', 
                           parse_dates=['EnabledDate', 'DeadlineDate', 'ProhibitNewEntrantsDeadlineDate'], 
                           dayfirst=False)
competitions.head()

# Visualisation

In [4]:
import matplotlib.pyplot as plt  # Quick plotting.
import seaborn  # Make plotting better.
%matplotlib inline

The following plot shows the number of new users on each day as a time-series.

In [5]:
# Group by RegisterDate and count unique Id values.
new_users_per_day = users.groupby('RegisterDate').agg({'Id': 'nunique'}).rename({'Id': 'NewUsers'}, axis=1)

new_users_per_day.plot(title='New users per day', figsize=(15, 7), grid=True, legend=False)

The rate of users signing has been accelerating since Kaggles inception in 2010. In the latter stages there is evidence of seasonal effects, with a noticeable peak in either half of the year. Kaggle could use this information to anticipate future demand on their service and plan appropriately.

The next figure shows a bar chart of the median number of competitors for different reward types. Note that the average is used rather than the sum to control for the differing frequencies of competitions with each reward type (i.e. the vast majority of competitions offer cash rewards, which drives up the number of competitors); the median is used rather than the median to control for certain competitons (such as the introductory/tutorial competitions) having extremely large numbers of competitors.

In [6]:
# Replace USD & EUR --> Cash.
competitions.replace(['USD', 'EUR'], 'Cash', inplace=True)

# Median number of competitors for each reward type.
rewardtype_vs_competitors = competitions.groupby('RewardType').agg({'TotalCompetitors': 'median'})
rewardtype_vs_competitors.sort_values(by='TotalCompetitors', ascending=False, inplace=True)

ax = rewardtype_vs_competitors.plot(title='Competitors by reward type', 
                                    figsize=(15, 7), grid=True, legend=False, kind='bar')
ax.set_xlabel('Reward type')
ax.set_ylabel('Median number of competitors')

Apparently, Kaggle competitors are much more likely to compete when there are jobs on offer rather than other reward types. Sadly it would appear the worst reward for increasing the number of competitors is knowledge. This information could be useful for competition organizers to understand the best way of motivating people to participate.