# DSCI 100 Project: Final Report

**Authors**: 

**Predicting Usage of a Video Game Research Server**
A research group in the Department of Computer Science at UBC (PLAI) has a goal to aid in advancing the limits of artificial intelligence. They host a Minecraft server that records players' actions as they navigate through the world. Since running the project is so complex, they want to recruit players who will contribute large amounts of gameplay data, as well as make sure they have the resources to handle the amount of players they attract. In an attempt to optimize their players' contributions, they want to understand what demographics of players will have the greatest gameplay time. This project answers their question of how player demographics relate to total gameplay time. 

## The Question
**Question 2**: Which kinds of players are most likely to contribute a large amount of data, and how can we identify them for targeted recruitment?

**Predictive Question** Can we predict the total `played_hours` from the `players.csv` dataset based on `experience`.

**Response variable**:

`played_hours` - total play time and data contribution

**Explanatory variables**:

`experience` - player experience level

`subscribe` -  subscription status of player

The dataset allows us to examine whether `experience` or `subscribe` relates to total playtime. If certain experience levels or subscribtion status tend to play more, those groups may contribute more gameplay data. This helps identify the player demographics most likely to provide substantial data for the research team.


## # Import libraries
import pandas as pd
import altair as alt

# import the K-NN regression model
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config

# import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_errorData Description: The Players
The dataset used in this analysis is a player demographics and engagement dataset that contains 196 observations (rows) and 9 variables (columns). The dataset records information about individual players, their gaming experience level, subscription status, demographic details, and the amount of time they have spent playing.

- Number of observations: 196 (each row represents one unique player)
- Number of variables: 9
- Observational unit: Player-level data
- Purpose: Identify which demographic characteristics correspond to higher gameplay time (and therefore more data contributed)
  
**Variables**
- `played_hours` (numeric) - Total hours each player spent on the server
- `experience` (categorical) -  Self-reported experience level in Minecraft
- `gender` (categorical) - Player’s gender
- `subscribe` (Boolean) - Subscription status of player
- `hashedEmail` (String) - The players email, hashed for privacy
- `name` (String) - Name of player
- `age` (Numeric) - Age of player
- `individualID` (String) - Unique ID of player
- `organizationName` (String) - Organization name

**Issues to consider**
- Missing identifiers (`individualId`, `organizationName`).
- `hashedEmail` is anonymous for privacy.
- `played_hours` may include idle time, which could overestimate actual engagement.

Source of the original dataset:
https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit


In [1]:
# Import libraries
import pandas as pd
import altair as alt

# import the K-NN regression model
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config

# import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Read the datasets
url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
# Load the dataset 
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [100]:
# Drop columns that are entirely missing(NaN)
players_tidy = players.drop(columns=["individualId", "organizationName", "name", "gender", "hashedEmail", "age","subscribe"])
# Preview tidy players_tidy
players_tidy.head()

Unnamed: 0,experience,played_hours
0,Pro,30.3
1,Veteran,3.8
2,Veteran,0.0
3,Amateur,0.7
4,Regular,0.1


In [101]:
players_tidy.describe()

Unnamed: 0,played_hours
count,196.0
mean,5.845918
std,28.357343
min,0.0
25%,0.0
50%,0.1
75%,0.6
max,223.1


**Summary Statistics**

Most players recorded very little playtime, with three-quarters spending under an hour in total (Q3 = 0.6 hours). Only a small number logged substantially more time, which raises the overall average (mean = 5.85 hours; median = 0.1 hours) and affects the dataset (STD = 28.36 hours). **Most players played very little (close to 0 hours), but a few played way more, making the dataset highly uneven.**

In [102]:
# Experienced comapred with Average played hours played

# Summary statistics and graphs should only be done on training data

plot1= alt.Chart(players, title="Fig 1. Total Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("mean(played_hours):Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot1


In [103]:
# Experienced comapred with Average played hours played
plot2= alt.Chart(players, title="Fig. 2 Average Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("played_hours:Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot2

In [104]:
# Subscription status comapred with Average hours played
plot3= alt.Chart(players, title="Fig. 3 Average Hours by Subscribed/Unsubscribed").mark_bar().encode(
    x=alt.X("subscribe:N").title("Subscription Status"),
    y=alt.Y("mean(played_hours)").title("Average Hours Played(hr)"),
    color = alt.Color("subscribe:N").title("Subscription Status")
)
plot3

In [105]:

plot4= alt.Chart(players, title="Fig. 4 Average Hours Played by Gender").mark_bar().encode(
    x=alt.X("gender:N").title("Gender"),
    y=alt.Y("mean(played_hours)").title("Average Hours Played(hr)"),
    color = alt.Color("gender:N", legend=None).title("Gender")
)
plot4

In [106]:

plot5= alt.Chart(players, title="Fig. 5 Average Hours Played by each Age").mark_circle(opacity=0.8).encode(
    x=alt.X("age:N", scale=alt.Scale(domain=[i for i in range(2, 102, 2)])).title("Age"),
    y=alt.Y("mean(played_hours)").title("Average Hours Played(hr)"),
    color = alt.Color("age:N", legend=None).title("Age")
)
plot5

In [107]:
experience_map = {
    "Beginner" : 0,
    "Regular": 1,
    "Amateur": 2,
    "Veteran": 3,
    "Pro": 4
}
# Put above
players_tidy["experience"] = players_tidy["experience"].map(experience_map)
players_tidy

Unnamed: 0,experience,played_hours
0,4,30.3
1,3,3.8
2,3,0.0
3,2,0.7
4,1,0.1
...,...,...
191,2,0.0
192,3,0.3
193,2,0.0
194,2,2.3


In [108]:
# split data!
players_training, players_testing = train_test_split(
    players_tidy, test_size=0.2, random_state=1234
)

# set target and predictors
players_x_train = players_training[["experience"]]
players_y_train = players_training["played_hours"]

players_x_test = players_testing[["experience"]]
players_y_test = players_testing["played_hours"]

players_training

Unnamed: 0,experience,played_hours
114,0,1.0
187,2,0.0
39,2,0.0
25,1,0.6
131,2,0.0
...,...,...
152,0,0.2
116,3,0.0
53,2,0.2
38,3,0.0


In [109]:
# Experienced comapred with Average played hours played
plot6 = alt.Chart(players_training, title="Fig. 6 Average Hours Played per Experience Group on training set").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("played_hours:Q").title("Hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot6

In [110]:
# preprocess the data, make the pipeline
players_preprocessor = make_column_transformer(
    (
        StandardScaler(),
        [
            "experience",
            ],
    )
)
players_pipeline = make_pipeline(players_preprocessor, KNeighborsRegressor())

players_cv = pd.DataFrame(
    cross_validate(
        players_pipeline,
        players_x_train,
        players_y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)
players_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.004266,0.002295,-19.160247,-27.106263
1,0.003216,0.002147,-5.8032,-21.794819
2,0.003107,0.001941,-29.223031,-16.37408
3,0.003083,0.001929,-3.798582,-21.907992
4,0.003085,0.001921,-31.885128,-15.327369


In [111]:
# create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 50, 1),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

# fit the GridSearchCV object
players_result = pd.DataFrame(players_tuned.fit(players_x_train, players_y_train).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003697,0.000778,0.002243,0.000578,1,{'kneighborsregressor__n_neighbors': 1},-9.122226,-8.72342,-30.005672,-12.052734,-29.268655,-17.834541,9.707952,44
1,0.00311,8.4e-05,0.001947,5.3e-05,2,{'kneighborsregressor__n_neighbors': 2},-9.143509,-5.170949,-29.994092,-6.304415,-30.493426,-16.221278,11.52331,19
2,0.003247,0.000234,0.002005,0.000134,3,{'kneighborsregressor__n_neighbors': 3},-31.37984,-4.345652,-30.140029,-4.925939,-32.622795,-20.682851,13.127151,49
3,0.00305,5.2e-05,0.001928,8e-05,4,{'kneighborsregressor__n_neighbors': 4},-23.631453,-3.934581,-30.069712,-4.316454,-32.123221,-18.815084,12.31755,48
4,0.003008,3.5e-05,0.001925,6.8e-05,5,{'kneighborsregressor__n_neighbors': 5},-19.160247,-5.8032,-29.223031,-3.798582,-31.885128,-17.974038,11.580485,46
5,0.003,8e-06,0.001883,6e-06,6,{'kneighborsregressor__n_neighbors': 6},-16.380491,-4.982276,-29.339701,-3.162105,-31.815267,-17.135968,11.899284,39
6,0.003001,2e-05,0.001926,8.8e-05,7,{'kneighborsregressor__n_neighbors': 7},-14.73491,-4.539553,-28.152575,-5.217285,-31.784133,-16.885692,11.332351,37
7,0.003042,9.4e-05,0.001912,4.8e-05,8,{'kneighborsregressor__n_neighbors': 8},-13.336478,-14.027496,-28.313692,-4.558332,-31.774824,-18.402164,10.134269,47
8,0.002995,8e-06,0.001915,6.5e-05,9,{'kneighborsregressor__n_neighbors': 9},-12.357332,-12.422075,-28.4438,-4.079793,-31.773035,-17.815207,10.538727,43
9,0.002987,1.5e-05,0.001913,6.4e-05,10,{'kneighborsregressor__n_neighbors': 10},-13.827831,-11.185561,-28.565624,-3.665939,-31.778249,-17.804641,10.682608,42


In [112]:
# Add Visualization to compare the ks
players_result["rmse"] = -players_result["mean_test_score"]
players_k_plot= alt.Chart(players_result).mark_line().encode(
    x=alt.X("param_kneighborsregressor__n_neighbors:Q", title="k (Number of Neighbours)"),
    y=alt.Y("rmse:Q", title="5-fold CV RMSE")
)
players_k_plot

In [113]:
# Retrieve the CV scores

# get the best parameter values
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 39}

In [114]:
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(16.028531937252072)

In [115]:
knn = KNeighborsRegressor(n_neighbors = 39)
player_pipe = make_pipeline(players_preprocessor, knn)
player_pipe.fit(players_x_train, players_y_train)

players_y_predictions = player_pipe.predict(players_x_train)

players_predictions = players_training.assign(prediction =players_train_predictions)
players_predictions

Unnamed: 0,experience,played_hours,prediction
114,0,1.0,1.410256
187,2,0.0,7.261538
39,2,0.0,7.261538
25,1,0.6,6.494872
131,2,0.0,7.261538
...,...,...,...
152,0,0.2,1.410256
116,3,0.0,0.464103
53,2,0.2,7.261538
38,3,0.0,0.464103


In [116]:
rmse_df = pd.DataFrame({
    "actual": players_y_train, 
    "predicted": players_train_predictions
})
train_rmse = mean_squared_error(players_y_train, players_train_predictions) ** 0.5
train_rmse

np.float64(19.51054219122817)

In [117]:

players_plot = (alt.Chart(players_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "K=39"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction"
))
players_plot


In [118]:

# create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 50, 1),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)


players_result = pd.DataFrame((players_tuned.fit(players_x_train, players_y_train)).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003385,0.000206,0.002379,0.000575,1,{'kneighborsregressor__n_neighbors': 1},-9.122226,-8.72342,-30.005672,-12.052734,-29.268655,-17.834541,9.707952,44
1,0.003111,2.5e-05,0.001942,1e-05,2,{'kneighborsregressor__n_neighbors': 2},-9.143509,-5.170949,-29.994092,-6.304415,-30.493426,-16.221278,11.52331,19
2,0.0031,8.4e-05,0.002862,0.001855,3,{'kneighborsregressor__n_neighbors': 3},-31.37984,-4.345652,-30.140029,-4.925939,-32.622795,-20.682851,13.127151,49
3,0.003025,7e-06,0.001933,7.1e-05,4,{'kneighborsregressor__n_neighbors': 4},-23.631453,-3.934581,-30.069712,-4.316454,-32.123221,-18.815084,12.31755,48
4,0.003011,1.1e-05,0.001912,3.1e-05,5,{'kneighborsregressor__n_neighbors': 5},-19.160247,-5.8032,-29.223031,-3.798582,-31.885128,-17.974038,11.580485,46
5,0.003011,1.2e-05,0.001937,8.2e-05,6,{'kneighborsregressor__n_neighbors': 6},-16.380491,-4.982276,-29.339701,-3.162105,-31.815267,-17.135968,11.899284,39
6,0.003296,0.00062,0.001887,1.9e-05,7,{'kneighborsregressor__n_neighbors': 7},-14.73491,-4.539553,-28.152575,-5.217285,-31.784133,-16.885692,11.332351,37
7,0.003009,1.8e-05,0.00198,0.00018,8,{'kneighborsregressor__n_neighbors': 8},-13.336478,-14.027496,-28.313692,-4.558332,-31.774824,-18.402164,10.134269,47
8,0.003003,3e-05,0.001913,6.4e-05,9,{'kneighborsregressor__n_neighbors': 9},-12.357332,-12.422075,-28.4438,-4.079793,-31.773035,-17.815207,10.538727,43
9,0.002984,5e-06,0.001878,6e-06,10,{'kneighborsregressor__n_neighbors': 10},-13.827831,-11.185561,-28.565624,-3.665939,-31.778249,-17.804641,10.682608,42


In [119]:
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 39}

In [120]:
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(16.028531937252072)

In [121]:
players_tidy

Unnamed: 0,experience,played_hours
0,4,30.3
1,3,3.8
2,3,0.0
3,2,0.7
4,1,0.1
...,...,...
191,2,0.0
192,3,0.3
193,2,0.0
194,2,2.3


In [122]:
best_model = players_tuned.best_estimator_
all_pred = best_model.predict(players_tidy[["experience"]])
players_tidy_predictions =players_tidy.assign(prediction=all_pred)
players_tidy_predictions

Unnamed: 0,experience,played_hours,prediction
0,4,30.3,1.305128
1,3,3.8,0.464103
2,3,0.0,0.464103
3,2,0.7,7.261538
4,1,0.1,6.494872
...,...,...,...
191,2,0.0,7.261538
192,3,0.3,0.464103
193,2,0.0,7.261538
194,2,2.3,7.261538


In [125]:
all_rmse = mean_squared_error(
    players_tidy_predictions["played_hours"],
    players_tidy_predictions["prediction"]
) ** 0.5

all_rmse

np.float64(28.04716413446685)

In [128]:

players_plot = (alt.Chart(players_tidy_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "K=39"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction"
))
players_plot