***Individual Project Planning Stage***

**Author: Megan Zhang**

**Date: 11/05/2025**

**Question 2:** We would like to know which "kinds" of players are most likely to contribute a large amount of data
so that we can target those players in our recruiting efforts.

In [1]:
# Import the necessary libraries
import altair as alt
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
# Read in the player data

url = "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
players_data = pd.read_csv(url)
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


***Data Description***

There appear to be 196 x 9 observations in the player dataset before any cleaning or wrangling. In the players dataset, there are the following 9 variables:
- Experience (categorical) - this variable determines if the player 
- Subscription status (categorical)
- Hashed email (categorical)
- Played hours (numerical)
- Name (categorical)
- Gender (categorical)
- Age (numerical)
- Individual ID (numerical)
- Organization name (categorical)
   
These variables contain distinct identifiers for each player who played on the server. This will be crucial to see if there is any correlation between one of these identifier variables and their likelihood of playing on the server or for an extended duration. 

The selected main question is: 
*Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.*

*Specific Question focus: Can we predict a player's played hours based on their experience, gender, and age?*

For this question, we will focus on the players dataset. The response variables of interest will be played hours, with the explanatory variables as experience,  age, and gender. Each one of these explanatory variables will be judged to determine what type of player they are when there is a greater amount of played hours.

In [4]:
# Merge and clean the datasets
game_data = players_data.merge(sessions_data, on="hashedEmail").drop(columns=["individualId", "organizationName"])
game_data.dropna()
game_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,start_time,end_time,original_start_time,original_end_time
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,08/08/2024 00:21,08/08/2024 01:35,1.723080e+12,1.723080e+12
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,09/09/2024 22:30,09/09/2024 22:37,1.725920e+12,1.725920e+12
2,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,08/08/2024 02:41,08/08/2024 03:25,1.723080e+12,1.723090e+12
3,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,10/09/2024 15:07,10/09/2024 15:29,1.725980e+12,1.725980e+12
4,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,05/05/2024 22:21,05/05/2024 23:17,1.714950e+12,1.714950e+12
...,...,...,...,...,...,...,...,...,...,...,...
1530,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,23/08/2024 21:59,23/08/2024 22:06,1.724450e+12,1.724450e+12
1531,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,09/09/2024 02:17,09/09/2024 02:45,1.725850e+12,1.725850e+12
1532,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,23/08/2024 21:39,23/08/2024 21:53,1.724450e+12,1.724450e+12
1533,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,08/09/2024 19:40,08/09/2024 19:45,1.725820e+12,1.725820e+12


In [5]:
# Plot the data for age vs played hours
game_data_plot = alt.Chart(game_data, title="Age vs Hours Played").mark_point().encode(
    x=alt.X("age").title("Age").scale(zero=False),
    y=alt.Y("played_hours").title("Played Hours").scale(zero=False)
)
game_data_plot

***Methods and Plan***
One approach to answering this question is to use classification, which is well-suited for this problem, as it allows you to use one or more variables to predict the class of another. In this case, we can use various predictor variables such as experience, subscription status, gender, and age to classify a player based on played hours. 