# DSCI 100 Project: Final Report

**Authors**: 

**Predicting Usage of a Video Game Research Server**
A research group in the Department of Computer Science at UBC (PLAI) has a goal to aid in advancing the limits of artificial intelligence. They host a Minecraft server that records players' actions as they navigate through the world. Since running the project is so complex, they want to recruit players who will contribute large amounts of gameplay data, as well as make sure they have the resources to handle the amount of players they attract. In an attempt to optimize their players' contributions, they want to understand what demographics of players will have the greatest gameplay time. This project answers their question of how player demographics relate to total gameplay time. 

## The Question
**Question 2**: Which kinds of players are most likely to contribute a large amount of data, and how can we identify them for targeted recruitment?

**Predictive Question** Can we predict the total `played_hours` from the `players.csv` dataset based on the two predictors, `experience` and `subscribe`.

**Response variable**:

`played_hours` - total play time and data contribution

**Explanatory variables**:

`experience` - player experience level

`subscribe` -  subscription status of player

The dataset allows us to examine whether `experience` or `subscribe` relates to total playtime. If certain experience levels or subscribtion status tend to play more, those groups may contribute more gameplay data. This helps identify the player demographics most likely to provide substantial data for the research team.


## Data Description: The Players
The dataset used in this analysis is a player demographics and engagement dataset that contains 196 observations (rows) and 9 variables (columns). The dataset records information about individual players, their gaming experience level, subscription status, demographic details, and the amount of time they have spent playing.

- Number of observations: 196 (each row represents one unique player)
- Number of variables: 9
- Observational unit: Player-level data
- Purpose: Identify which demographic characteristics correspond to higher gameplay time (and therefore more data contributed)
  
**Variables**
- `played_hours` (numeric) - Total hours each player spent on the server
- `experience` (categorical) -  Self-reported experience level in Minecraft
- `gender` (categorical) - Player’s gender
- `subscribe` (Boolean) - Subscription status of player
- `hashedEmail` (String) - The players email, hashed for privacy
- `name` (String) - Name of player
- `age` (Numeric) - Age of player
- `individualID` (String) - Unique ID of player
- `organizationName` (String) - Organization name

**Issues to consider**
- Missing identifiers (`individualId`, `organizationName`).
- `hashedEmail` is anonymous for privacy.
- `played_hours` may include idle time, which could overestimate actual engagement.

Source of the original dataset:
https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit


In [2]:
# Import libraries
import pandas as pd
import altair as alt
# Read the datasets
url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
# Load the dataset 
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# Drop columns that are entirely missing(NaN)
players_tidy = players.drop(columns=["individualId", "organizationName"])
# Preview tidy players_tidy
players_tidy.head()

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21


In [4]:
players_tidy.describe()

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


**Summary Statistics**

Most players recorded very little playtime, with three-quarters spending under an hour in total (Q3 = 0.6 hours). Only a small number logged substantially more time, which raises the overall average (mean = 5.85 hours; median = 0.1 hours) and affects the dataset (STD = 28.36 hours). **Most players played very little (close to 0 hours), but a few played way more, making the dataset highly uneven.**

The player population is primarily young adults, concentrated between the late teens and early twenties (mean = 21.28; Q1–Q3 = 17–22 years). 

In [5]:
# Experienced comapred with Average played hours played
plot1= alt.Chart(players_tidy).mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("mean(played_hours):Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot1

In [17]:
# Experienced comapred with Average played hours played
plot1= alt.Chart(players_tidy).mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("played_hours:Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot1

In [6]:
# Subscription status comapred with Average hours played
plot2= alt.Chart(players_tidy).mark_bar().encode(
    x=alt.X("subscribe:N").title("Subscription Status"),
    y=alt.Y("mean(played_hours)").title("Average Hours Played(hr)"),
    color = alt.Color("subscribe:N").title("Subscription Status")
)
plot2

In [7]:
experience_map = {
    "Beginner" : 0,
    "Regular": 1,
    "Amateur": 2,
    "Veteran": 3,
    "Pro": 4
}

players["experience"] = players["experience"].map(experience_map)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,4,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,3,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,3,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,2,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,1,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,2,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,3,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,2,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,2,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [8]:
subscribe_map = {
    False : 0,
    True : 1,
}

players["subscribe"] = players["subscribe"].map(subscribe_map)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,4,1,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,3,1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,3,0,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,2,1,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,1,1,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,2,1,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,3,0,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,2,0,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,2,0,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [9]:
# import the K-NN regression model
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config

# import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error



# split data!
players_training, players_testing = train_test_split(
    players, test_size=0.2, random_state=1234
)

# set target and predictors
players_x_train = players_training[["experience"]]
players_y_train = players_training["played_hours"]

players_x_test = players_testing[["experience"]]
players_y_test = players_testing["played_hours"]

players_training

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
114,0,1,ae8d8a9dcf80b38466f89466201e6d594eb94d6994eda1...,1.0,Ella,Female,17,,
187,2,1,e3f0ad9aadd27f3d1d9197e58546d045018daa76767503...,0.0,Jasper,Male,17,,
39,2,1,6ef84a204d64edc3b3779127b13f298d0c70f96568486f...,0.0,Vivian,Male,17,,
25,1,1,5baba1651a0b92788bc0d6dcdf00be64af1cf9f0015bbe...,0.6,Kendall,Female,28,,
131,2,0,11bf6125c4264b3a8f3bffa57b33bd598e2ea1ecd6331a...,0.0,Olivia,Female,23,,
...,...,...,...,...,...,...,...,...,...
152,0,1,aea049eaa7cb10db386a62990220d205ceb2a4c473cae3...,0.2,Aarav,Prefer not to say,17,,
116,3,0,58893f3187db90bf0690c88f06e6aa5f8ea9ff9691ca8e...,0.0,Noah,Prefer not to say,20,,
53,2,1,fc0224c81384770e93ca717f32713960144bf0b52ff676...,0.2,Gemna,Male,27,,
38,3,1,d782933acd14c834e53dea816005a3583cb87710f7347a...,0.0,Ishaan,Male,17,,


In [10]:
# preprocess the data, make the pipeline
players_preprocessor = make_column_transformer(
    (
        StandardScaler(),
        [
            "experience",
            ],
    )
)
players_pipeline = make_pipeline(players_preprocessor, KNeighborsRegressor())

players_cv = pd.DataFrame(
    cross_validate(
        players_pipeline,
        players_x_train,
        players_y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)
players_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.004783,0.002978,-9.099486,-21.719026
1,0.001159,0.0,-5.920221,-22.051516
2,0.0,0.007003,-28.129594,-16.957817
3,0.001758,0.001875,-1.713133,-22.352437
4,0.002027,0.001726,-32.491334,-15.254916


In [11]:
# create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 50, 1),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

# fit the GridSearchCV object
players_result = pd.DataFrame(players_tuned.fit(players_x_train, players_y_train).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.00343,0.00336,0.000495,0.000616,1,{'kneighborsregressor__n_neighbors': 1},-14.875798,-3.258537,-29.966987,-0.857604,-31.752688,-16.142323,12.930766,28
1,0.001599,0.002038,0.001629,0.003257,2,{'kneighborsregressor__n_neighbors': 2},-10.80781,-5.038417,-29.981369,-0.967871,-31.83224,-15.725541,12.797202,15
2,0.00374,0.004854,0.0,0.0,3,{'kneighborsregressor__n_neighbors': 3},-9.610375,-4.176646,-29.977078,-2.690472,-31.893679,-15.66965,12.690033,14
3,0.001484,0.002112,0.002314,0.002981,4,{'kneighborsregressor__n_neighbors': 4},-9.241531,-7.076399,-27.883119,-2.066642,-32.804014,-15.814341,12.188988,16
4,0.001788,0.002422,0.001806,0.003611,5,{'kneighborsregressor__n_neighbors': 5},-9.099486,-5.920221,-28.129594,-1.713133,-32.491334,-15.470754,12.417924,13
5,0.003259,0.005277,0.000101,0.000203,6,{'kneighborsregressor__n_neighbors': 6},-9.034982,-5.276932,-28.353955,-2.274299,-31.797819,-15.347597,12.26339,11
6,0.000308,0.000615,0.003038,0.004601,7,{'kneighborsregressor__n_neighbors': 7},-8.996115,-4.69561,-28.52769,-1.949443,-31.768886,-15.187549,12.462441,5
7,0.000201,0.000402,0.003132,0.004529,8,{'kneighborsregressor__n_neighbors': 8},-8.994845,-4.275145,-28.674122,-1.739292,-31.771527,-15.090986,12.61075,2
8,0.002347,0.004694,0.001008,0.002015,9,{'kneighborsregressor__n_neighbors': 9},-9.042175,-3.967608,-28.787206,-1.566809,-31.77146,-15.027052,12.720261,1
9,0.002214,0.004428,0.001122,0.002243,10,{'kneighborsregressor__n_neighbors': 10},-9.008174,-3.750653,-28.888777,-3.389936,-31.778284,-15.363165,12.417672,12


In [12]:
# Retrieve the CV scores


# get the best parameter values
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 9}

In [13]:
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(15.027051604673355)

In [14]:
knn = KNeighborsRegressor(n_neighbors = 10)
player_pipe = make_pipeline(players_preprocessor, knn)
player_pipe.fit(players_x_train, players_y_train)

players_predictions = players_testing.assign(prediction = player_pipe.predict(players_x_test))
players_predictions

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,prediction
101,2,1,25879aecc205544bc6505f9faf768356e0a3b712605730...,0.0,Eli,Female,17,,,6.56
51,1,1,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,218.1,Akio,Non-binary,20,,,2.1
146,4,1,5669c0f4b50dbfe3e2f851e4bb89fa43c55cd1f71ba362...,0.0,Padma,Non-binary,25,,,3.44
153,0,1,0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce02...,0.1,Osiris,Male,17,,,2.84
106,1,0,63139b524e44d156daf265875fb8078bb2b24d6b7520a3...,0.0,Elias,Male,23,,,2.1
59,3,1,ca20f724571080b997e0efa874b9611e9f280c1af5f68f...,0.2,Edward,Male,38,,,0.22
161,3,0,f174555ae3ff613d5d0b98f83f26f689619d92ef272f28...,0.0,Finley,Non-binary,17,,,0.22
167,0,0,42eafe96ed5c1684e3b5cc614d1b01a117173d3ec6898a...,0.3,Ariana,Female,17,,,2.84
193,2,0,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,,6.56
88,0,1,a1094a440899e69e804f888f44d1154ec0b7675d05d977...,0.0,Maya,Female,17,,,2.84


In [15]:

players_plot = (alt.Chart(players_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "K=39"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction"
))
players_plot

In [None]:
players_predictions = players_predictions.rename(columns={"prediction":"prediction_80"})

players_cv = pd.DataFrame(
    cross_validate(
        players_pipeline,
        players_x_train,
        players_y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

# create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 50, 1),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)



players_result = pd.DataFrame(players_tuned.fit(players[["experience"]], players[["played_hours"]]).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002222,0.00299,0.000811,0.001051,1,{'kneighborsregressor__n_neighbors': 1},-9.026669,-50.68261,-24.887578,-12.22919,-28.983811,-25.161971,14.789115,27
1,0.002604,0.003052,0.00382,0.005448,2,{'kneighborsregressor__n_neighbors': 2},-9.022752,-50.735247,-24.151193,-10.710419,-29.001063,-24.724135,15.087186,3
2,0.001646,0.001408,0.00256,0.004346,3,{'kneighborsregressor__n_neighbors': 3},-30.141515,-50.764899,-24.051864,-10.43795,-28.958721,-28.87099,12.992045,49
3,0.001826,0.003652,0.001066,0.001328,4,{'kneighborsregressor__n_neighbors': 4},-23.32223,-50.762395,-24.00896,-10.337021,-29.113796,-27.508881,13.175371,48
4,0.005662,0.005685,0.000789,0.00097,5,{'kneighborsregressor__n_neighbors': 5},-19.39905,-50.746695,-23.13664,-10.296795,-29.049374,-26.525711,13.554909,46
5,0.002686,0.004481,0.000366,0.000733,6,{'kneighborsregressor__n_neighbors': 6},-17.012101,-50.760374,-23.180837,-10.267327,-29.024346,-26.048997,13.845272,43
6,0.002885,0.004464,0.003337,0.004505,7,{'kneighborsregressor__n_neighbors': 7},-16.034618,-50.605497,-23.236425,-10.067064,-28.912939,-25.771309,13.960364,40
7,0.002628,0.002766,0.000596,0.001192,8,{'kneighborsregressor__n_neighbors': 8},-14.60813,-50.620424,-23.292973,-9.982779,-28.890012,-25.478864,14.190233,36
8,0.000204,0.000407,0.002823,0.00432,9,{'kneighborsregressor__n_neighbors': 9},-13.589229,-50.640167,-24.905163,-9.94716,-26.908216,-25.197987,14.266507,29
9,0.005559,0.00607,0.000399,0.000797,10,{'kneighborsregressor__n_neighbors': 10},-12.763487,-50.559835,-24.653011,-9.930753,-26.920004,-24.965418,14.379989,21


In [21]:
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 12}

In [22]:
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(24.622432143663325)

In [30]:
knn = KNeighborsRegressor(n_neighbors = 12)
player_pipe = make_pipeline(players_preprocessor, knn)
player_pipe.fit(players[["experience"]], players[["played_hours"]])

print(players_predictions)

players_predictions = players.assign(prediction_all = player_pipe.predict(players[["experience"]]))
players_predictions

     experience  subscribe                                        hashedEmail  \
101           2          1  25879aecc205544bc6505f9faf768356e0a3b712605730...   
51            1          1  b622593d2ef8b337dc554acb307d04a88114f2bf453b18...   
146           4          1  5669c0f4b50dbfe3e2f851e4bb89fa43c55cd1f71ba362...   
153           0          1  0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce02...   
106           1          0  63139b524e44d156daf265875fb8078bb2b24d6b7520a3...   
59            3          1  ca20f724571080b997e0efa874b9611e9f280c1af5f68f...   
161           3          0  f174555ae3ff613d5d0b98f83f26f689619d92ef272f28...   
167           0          0  42eafe96ed5c1684e3b5cc614d1b01a117173d3ec6898a...   
193           2          0  d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...   
88            0          1  a1094a440899e69e804f888f44d1154ec0b7675d05d977...   
29            3          0  951e54f7376e2b2f0915e9e3646c701af4a2fe839385b1...   
130           2          1  

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,prediction_all
0,4,1,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,,3.016667
1,3,1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,,0.625000
2,3,0,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,,0.625000
3,2,1,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,,4.233333
4,1,1,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,,19.075000
...,...,...,...,...,...,...,...,...,...,...
191,2,1,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,,4.233333
192,3,0,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,,0.625000
193,2,0,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,,4.233333
194,2,0,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,,4.233333


In [31]:

players_plot = (alt.Chart(players_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "K=12"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction_all"
))
players_plot