 # Data Description

### Chosen Data Set
* players.csv 

### Number of Variables and Observations
* 9 variables
* 196 observations

### Names of Variables, Type and Description: 
* experience(string): level of experience of players, can be amateur, regular, pro, and veteran 
* subscribe(boolean): whether or not participants are subscribed, can be true or false
* hashedEmail(string): encrypted version of an email address of players 
* played_hours(float): number of hours played by participants, represented by a number to one decimal 
* name(string): name of the participant, represented by a string 
* gender(string): gender of participants, can be male, female, non binary, and two-spirited
* age(integer): age of participants as integers 
* individualID(Nonetype): represents no value
* organisationName(Nonetype): represents no value

### Issues in Data
* Inaccurate player data may have been caused by players inputting an incorrect age, experience level, etc. 
* Due to the cap of the server, some players who would have played at a certain time may have not been able to, leading to incomplete data
* Gender and age imbalance can skew the data. Data frame includes more males than any other gender, and more 17 year olds than any other age

### Data Collection
* Data is collected when  participants sign up with a name, their age, gender, and their experience level.
* Once they join the server to play the game, the time they spend on the server is recorded


# Question 

### Chosen Question
* We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts

### How Data will Help Adress the Question
* Data provides information on the number of hours played by each participants, which represents the amount of data provided by each player
* Variables "age" and "experience level" help define characteristics of players

### Wrangling Steps
* Drop unwanted columns (i.e.  subscribe, hashedEmail, name, individualId, organizationName) and outliers
* Standardize numerical variables 

# Exploratory Data Analysis and Visualization 

In [2]:
import pandas as pd
import altair as alt

In [3]:
# loading the data and doing minimal wangling(dropping empty columns)
players_data = pd.read_csv("data/players.csv").drop(columns = ['individualId', 'organizationName'])
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


In [4]:
# exploratory visualizations
planning_viz = alt.Chart(players_data).mark_circle().encode(
    x = alt.X("age").title("Age"), 
    y = alt.Y("played_hours").title("Hours Played"),
    color = alt.Color("experience").title("Experience Level")
)
planning_viz

### Preliminary Insights from Visualizations 
* Data is very concentrated near the bottom of the graph, indicating most players spend a similar amount of time playing 
* Experience level does not seem to affect played hours significantly

# Methods and Plan
Will use two methods to effectively answer the question 

### First Method: Visualization
##### Description
* Scatter plot
* X-axis: age
* Y-axis: hours played 
* Color: experience level

##### Explanation
* Allows us to visualize the types of players which contribute the most data, where "most data"
 is most hours played
* Graph shows how age and experience level affect hours played

##### Assumptions 
* Assume the data represents each player accurately, ie that the data imputed by the players is correct

##### Limitations
* Lack of range of age and gender data can skew conclusions

### Second Method: KNN regression 
##### Description
* Response variable: experience level
* Predictor variables: age and played hours

##### Explanation
* Allows us to predict the experience level of players based on their age and the number of hours they have played
* Allows us to further visualize the relationship between the three variables and conclude what kind of player contributes the most data

##### Assumptions 
* Assume the data represents each player accurately, ie that the data imputed by the players is correct

##### Limitations 
* Lack of range of age and gender data can skew conclusions

##### Comparing the Model
* Compare the model by using cross-validation to choose what the best value of k is in our dataset
* Select the model with the value of k that gives us the lowest RMSE to predict

##### Processing and Applying the model 
* Data will be split before creating visualizations into training and testing, with 75% as training data
* Will include a validation set, with cross 5 validations to determine the right k value to use