<h2>Individual Project Proposal</h2>

**(1) Data Description**

Dataset: players.csv

Number of observations: 196

Number of variables: 9

Variables include:
- experience (object): whether a player is Pro, Veteran, Amateur, Regular or Beginner
- subscribe (bool): whether the player is subscribed to the game-related newsletter
- hashedEmail (object): hashed email of the player
- played_hours (float64): number of played hours
- name (object): name of the player
- gender (object): gender of the player 
- age (int64): age of the player
- individualId (float64): id of the player
- organizationName (float64): player's organization name

Issues in the dataset:
- all individualId and organizationName are null


**(2) Question**

Broad question picked: *We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.*

Specific question: *How well can age, gender, subscription status, and experience level predict player's played hours?*

To answer this question, all relevant variables are in players.csv. The response variable will be played_hours, and the explanatory variables will be age, gender, subscribe, and experience.

**(3) Exploratory Data Analysis and Visualization**

In [14]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.model_selection import train_test_split

In [15]:
players = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
players.head(10)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17,,
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19,,
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21,,
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17,,
9,Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f90...,0.0,Nyla,Female,22,,


I will first split the data into a training set and testing set, so that we are only performing EDA on the training set.

In [16]:
training_df, testing_df = train_test_split(
    players,
    test_size=0.25,
    random_state=2000,
)

First, let's get a sense of the distributions of each of our predictor and response variables.

In [17]:
age_distribution = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("age:Q", title="Age"),
    y=alt.Y('count()', title="Number of Players")
).properties(
    title="Distribution of Age"
)

played_hours_distribution = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("played_hours:Q", title="Played Hours"),
    y=alt.Y('count()', title="Number of Players")
).properties(
    title="Distribution of Played Hours"
)

experience_distribution = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("experience:N", title="Experience"),
    y=alt.Y('count()', title="Number of Players")
).properties(
    title="Distribution of Experience"
)

gender_distribution = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("gender:N", title="Gender"),
    y=alt.Y('count()', title="Number of Players")
).properties(
    title="Distribution of Gender"
)

subscribe_distribution = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("subscribe:N", title="Subscribe"),
    y=alt.Y('count()', title="Number of Players")
).properties(
    title="Distribution of Subscribe"
)
age_distribution | played_hours_distribution | experience_distribution | gender_distribution | subscribe_distribution

For our numerical predictor variable, age, we can create a scatterplot against played_hours to observe if there are any linear patterns.

In [18]:
training_df['log_played_hours'] = np.log1p(training_df['played_hours'])

alt.Chart(training_df).mark_point(opacity=0.5).encode(
    x=alt.X('age:Q').title("Age"),
    y=alt.Y('log_played_hours:Q').title("Played Hours (log scale)"),
).properties(title='Played Hours vs Age')

For each categorical predictor variable, we can create bar plots to visualize how average played hours vary across categories.

In [19]:
average_played_hours_by_experience = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("experience:N", title="Experience"),
    y=alt.Y('mean(played_hours):Q', title="Average Played Hours"),
    color=alt.Color("experience:N", legend=None)
).properties(
    title="Average Played Hours vs Experience"
)

average_played_hours_by_subscribe = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("subscribe:N", title="Subscribe"),
    y=alt.Y('mean(played_hours):Q', title="Average Played Hours"),
    color=alt.Color("subscribe:N", legend=None)
).properties(
    title="Average Played Hours vs Subscribe"
)

average_played_hours_by_gender = alt.Chart(training_df).mark_bar().encode(
    x=alt.X("gender:N", title="Gender"),
    y=alt.Y('mean(played_hours):Q', title="Average Played Hours"),
    color=alt.Color("gender:N", legend=None)
).properties(
    title="Average Played Hours vs Gender"
)
average_played_hours_by_experience | average_played_hours_by_subscribe | average_played_hours_by_gender

**Insights:**
- The distribution of age is heavily centered around 15-30 year olds. There are some outliers (age 90-100) that may effect our model if we choose to perform linear regression.
- The distribution of played_hours is heavily skewed towards small values.
- There are far more male players than players of other genders.
- While amateur players are the largest group, regular players play the most hours on average.
- Subscribed players spend more hours on average playing than non subscribed players.
- Agender players have the highest average played hours. However, given the small number of agender players, this high average could be driven by outlier(s). Female players also have higher average played hours than male players.
- There is no obvious linear relationship between age and played hours. This suggests that using KNN Regression, which does not assume underlying patterns in the data, may perform better than linear regression. However, linear regression is more interpretable if we want to find out how age is associated with played hours.

**(4) Method and Plan**

*Selecting type of model to use:*

To assess our question of whether age, gender, subscription status, and experience level are good predictors of played hours, a regression model would be appropriate since played hours is a numerical variable. In our EDA, we were not able to observe a clear linear relationship between age and played hours, so we will use a KNN Regression model. Therefore, no assumption about the underlying pattern of the data is needed. However, we are aware that a limitation to using KNN regression is that the predictions are less interpretable than linear regression coefficients. 

*Data preprocessing and wrangling:*

Before doing any modelling, we will generate a 75%-25% train-test data split. Then, we will create our X_train, y_train, and X_test, y_test by selecting the appropriate response/outcome columns. We also have to transform all of our categorical predictor variables to numerical values. Experience has an inherent hierarchy, so we will use ordinal encoding. Subscribe is a boolean variable, so we will encode it as 0 (false) or 1 (true). As for gender, there is more than 2 values and there is no natural ordering to it, so we will use one-hot encoding. 

*Modelling:*

To search for the best model, we will conduct hyperparamter tuning on the parameter n_neighbors with GridSearchCV and use 5-fold cross validation. We will search over values of k from 1 to 50. After finding the best model and refitting on the entire training set, we will predict on our test set. Finally, we will compute the RMSPE with our prediction and y_test, which helps us assess how well our model uses age, gender, subscribe, and experience to predict played_hours.