# DSCI 100 Project: Final Report

**Authors**: Rodrigo Moreno González, Trevor Lau, Sydney Peters, Megan Zhang


**Predicting Usage of a Video Game Research Server**
A research group in the Department of Computer Science at UBC (PLAI) has a goal to aid in advancing the limits of artificial intelligence. They host a Minecraft server that records players' actions as they navigate through the world. Since running the project is so complex, they want to recruit players who will contribute large amounts of gameplay data, as well as make sure they have the resources to handle the amount of players they attract. In an attempt to optimize their players' contributions, they want to understand what demographics of players will have the greatest gameplay time. This project answers their question of how player demographics relate to total gameplay time. 

## The Question
**Question 2**: Which kinds of players are most likely to contribute a large amount of data, and how can we identify them for targeted recruitment?

**Predictive Question** Can we predict the total `played_hours` from the `players.csv` dataset based on the predictor, `experience`. 

**Response variable**:

`played_hours` - total play time and data contribution

**Explanatory variable**:

`experience` - player experience level

The dataset allows us to examine if `experience` relates to total playtime. If certain experience levels tend to play more, those groups may contribute more gameplay data. This helps identify the player demographics most likely to provide substantial data for the research team.


## Description: The Players
The dataset used in this analysis is a player demographics and engagement dataset that contains 196 observations (rows) and 9 variables (columns). The dataset records information about individual players, their gaming experience level, subscription status, demographic details, and the amount of time they have spent playing.

- Number of observations: 196 (each row represents one unique player)
- Number of variables: 9
- Observational unit: Player-level data
- Purpose: Identify which demographic characteristics correspond to higher gameplay time (and therefore more data contributed)
  
**Variables**
- `played_hours` (numeric) - Total hours each player spent on the server
- `experience` (categorical) -  Self-reported experience level in Minecraft
- `gender` (categorical) - Player’s gender
- `subscribe` (Boolean) - Player’s subscription status
- `hashedEmail` (String) - Player’s email address, hashed for privacy protection
- `name` (String) - Player’s name
- `age` (Numeric) - Player’s age
- `individualID` (String) - Unique player identifier
- `organizationName` (String) - Name of the organization associated with the player

**Issues to consider**
- Missing identifiers (`individualId`, `organizationName`).
- `hashedEmail` is anonymous for privacy.
- `played_hours` may include idle time, which could overestimate actual engagement.
- `experience` is self reported which could lead to inconsistent standards. 

Source of the original dataset:
https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit


In [1]:
# Import libraries
import pandas as pd
import altair as alt

# Import the K-NN regression model
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config

# Import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Read the datasets
url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
# Load the dataset 
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [2]:
# Experience comapred with total played hours played
plot1= alt.Chart(players, title="Fig 1. Total Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("played_hours:Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot1

In [3]:
# Experience comapred with average played hours played
plot2= alt.Chart(players, title="Fig. 2 Average Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("mean(played_hours):Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot2

In [4]:
# Subscription status comapred with average hours played
plot3= alt.Chart(players, title="Fig. 3 Average Hours by Subscribed/Unsubscribed").mark_bar().encode(
    x=alt.X("subscribe:N").title("Subscription Status"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played(hr)"),
    color = alt.Color("subscribe:N").title("Subscription Status")
)
plot3

In [5]:
# Gender comapred with average hours played
plot4= alt.Chart(players, title="Fig. 4 Average Hours Played by Gender").mark_bar().encode(
    x=alt.X("gender:N").title("Gender"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played(hr)"),
    color = alt.Color("gender:N").title("Gender")
)
plot4

In [6]:
# Age comapred with average hours played 
plot5 = alt.Chart(players, title="Fig 5. Average Hours Played by Age").mark_circle(opacity=0.8).encode(
    x=alt.X("age:Q").title("Age"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played (hr)"),
    color=alt.Color("age:Q").title("Age")
)
plot5

**Visualizations Summary**

Across all visualizations, the data show that total gameplay hours are heavily right-skewed, with most players recording very little playtime and a small number accounting for the largest values.

Experience level (Figures 1 & 2) displays the clearest relationship with gameplay time: average hours generally increase with higher experience categories, suggesting a positive association between self-reported experience and total playtime. Although variability is high within each group, the upward trend indicates that experience is meaningfully related to engagement.

Subscription status (Figure 3) shows some difference between subscribed and unsubscribed players, with subscribed players tending to average slightly more gameplay hours. However, the distributions overlap substantially, meaning subscription alone does not strongly separate players by playtime.

Gender (Figure 4) reveals no consistent pattern across groups; average hours between genders appear similar, with any observed differences likely driven by small sample sizes or extreme outliers rather than true group-level effects. This suggests gender provides little predictive value for modeling gameplay time.

Age (Figure 5) shows no strong monotonic trend. A few ages exhibit higher average values, but these spikes are inconsistent and likely reflect sparse data and outliers rather than a genuine relationship. Overall, age does not demonstrate a stable or reliable association with total played hours

Self-reported experience is the best predictor for KNN regression in this dataset, as it exhibits the strongest and most consistent relationship to total gameplay time. Subscription status may provide minor additional information, while age and gender contribute little meaningful predictive value and should not be heavily weighted in the model.


In [7]:
# Drop unused and non-predictive columns
players_tidy = players.drop(columns=["individualId", "organizationName", "name", "gender", "hashedEmail", "age", "subscribe"])
# Preview tidy players_tidy
players_tidy

Unnamed: 0,experience,played_hours
0,Pro,30.3
1,Veteran,3.8
2,Veteran,0.0
3,Amateur,0.7
4,Regular,0.1
...,...,...
191,Amateur,0.0
192,Veteran,0.3
193,Amateur,0.0
194,Amateur,2.3


In [8]:
# View full summary (numeric + categorical)
players_tidy.describe(include="all")

Unnamed: 0,experience,played_hours
count,196,196.0
unique,5,
top,Amateur,
freq,63,
mean,,5.845918
std,,28.357343
min,,0.0
25%,,0.0
50%,,0.1
75%,,0.6


**Statistics Summary**

Most players recorded very little playtime, with three-quarters spending under one hour in total (Q3 = 0.6 hours). Only a small number logged substantially more time, which raises the overall average (mean = 5.85 hours; median = 0.1 hours) and contributes to a large spread in the data (SD = 28.36 hours). This indicates that most players played very little (close to 0 hours), while a few played a great deal, making the distribution highly uneven and right-skewed. The experience variable includes several levels that are not evenly represented, with some experience groups appearing more frequently than others. Overall, the dataset shows substantial variability in both gameplay time and experience level, supporting the use of KNN regression to explore patterns between player experience and total playtime.

In [9]:
# Create a numeric mapping for ordered experience levels
experience_map = {
    "Beginner" : 0,
    "Regular": 1,
    "Amateur": 2,
    "Veteran": 3,
    "Pro": 4
}
# Convert the categorical experience column to numeric values
players_tidy["experience"] = players_tidy["experience"].map(experience_map)
players_tidy

Unnamed: 0,experience,played_hours
0,4,30.3
1,3,3.8
2,3,0.0
3,2,0.7
4,1,0.1
...,...,...
191,2,0.0
192,3,0.3
193,2,0.0
194,2,2.3


In [10]:
# Split the data
players_training, players_testing = train_test_split(
    players_tidy, 
    test_size=0.2, 
    random_state=1234
)

# Set target and predictors
X_train = players_training[["experience"]]
y_train = players_training["played_hours"]

X_test = players_testing[["experience"]]
y_test = players_testing["played_hours"]

In [11]:
# Create preprocessor
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["experience"]),
    remainder="drop",
    verbose_feature_names_out=False,
)

# Create pipeline
players_pipeline = make_pipeline(
    players_preprocessor,
    KNeighborsRegressor()
)

# Evaluate the performance of players_pipeline using 5-fold cross-validation
players_cv = pd.DataFrame(
    cross_validate(
        players_pipeline,
        X_train,
        y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

players_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.004293,0.002584,-19.160247,-27.106263
1,0.003353,0.002203,-5.8032,-21.794819
2,0.003131,0.001952,-29.223031,-16.37408
3,0.003076,0.001908,-3.798582,-21.907992
4,0.003044,0.001887,-31.885128,-15.327369


In [12]:
# Create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 51),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

# fit the GridSearchCV object
players_result = pd.DataFrame(players_tuned.fit(X_train, y_train).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003446,0.00033,0.002164,7.8e-05,1,{'kneighborsregressor__n_neighbors': 1},-9.122226,-8.72342,-30.005672,-12.052734,-29.268655,-17.834541,9.707952,45
1,0.003052,3.4e-05,0.018451,0.033085,2,{'kneighborsregressor__n_neighbors': 2},-9.143509,-5.170949,-29.994092,-6.304415,-30.493426,-16.221278,11.52331,20
2,0.003101,0.000125,0.001915,3.7e-05,3,{'kneighborsregressor__n_neighbors': 3},-31.37984,-4.345652,-30.140029,-4.925939,-32.622795,-20.682851,13.127151,50
3,0.003057,6.7e-05,0.00401,0.004209,4,{'kneighborsregressor__n_neighbors': 4},-23.631453,-3.934581,-30.069712,-4.316454,-32.123221,-18.815084,12.31755,49
4,0.003049,6e-05,0.001905,1.8e-05,5,{'kneighborsregressor__n_neighbors': 5},-19.160247,-5.8032,-29.223031,-3.798582,-31.885128,-17.974038,11.580485,47
5,0.002994,1.1e-05,0.001898,3.4e-05,6,{'kneighborsregressor__n_neighbors': 6},-16.380491,-4.982276,-29.339701,-3.162105,-31.815267,-17.135968,11.899284,40
6,0.003019,5.2e-05,0.003396,0.003047,7,{'kneighborsregressor__n_neighbors': 7},-14.73491,-4.539553,-28.152575,-5.217285,-31.784133,-16.885692,11.332351,38
7,0.003003,5.2e-05,0.001876,5e-06,8,{'kneighborsregressor__n_neighbors': 8},-13.336478,-14.027496,-28.313692,-4.558332,-31.774824,-18.402164,10.134269,48
8,0.003002,5.4e-05,0.001868,1.2e-05,9,{'kneighborsregressor__n_neighbors': 9},-12.357332,-12.422075,-28.4438,-4.079793,-31.773035,-17.815207,10.538727,44
9,0.00297,6e-06,0.001873,6e-06,10,{'kneighborsregressor__n_neighbors': 10},-13.827831,-11.185561,-28.565624,-3.665939,-31.778249,-17.804641,10.682608,43


In [13]:
# Get the best parameter values
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 39}

In [14]:
# Extract the best (lowest) cross-validation error from the tuned KNN model
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(16.028531937252072)

**Cross-Validation Summary**

Grid search cross-validation identified the optimal number of neighbors for the KNN regression model as K = 39. This means the model achieved its best predictive performance when each prediction was calculated by averaging the outcomes of the 39 most similar players. Using a relatively large value of K helps smooth predictions and reduce the influence of extreme outliers in the dataset, which is particularly beneficial given the highly skewed distribution of gameplay hours. 

The tuned KNN regression model achieved a cross-validated Root Mean Squared Prediction Error (RMSPE) of approximately 16.03 hours. This means that, on average, the model’s predicted gameplay time differs from the true recorded value by about 16 hours per player. Given the highly right-skewed distribution of playtime—where most players logged very little activity and a small number recorded extremely high usage—this level of error is expected and highlights the difficulty of accurately predicting individual play behavior. Overall, the model captures general trends related to player experience but remains limited in precision when forecasting exact playtime for each player.

In [15]:
# Set K, make pipeline, and fit the training data
knn = KNeighborsRegressor(n_neighbors = 39)
player_pipe = make_pipeline(players_preprocessor, knn)
player_pipe.fit(X_train, y_train)

# Predicte total hours based on experience level from testing set 
players_predictions = players_testing.assign(prediction = player_pipe.predict(X_test))
players_predictions

Unnamed: 0,experience,played_hours,prediction
101,2,0.0,7.261538
51,1,218.1,6.494872
146,4,0.0,1.305128
153,0,0.1,1.410256
106,1,0.0,6.494872
59,3,0.2,0.464103
161,3,0.0,0.464103
167,0,0.3,1.410256
193,2,0.0,7.261538
88,0,0.0,1.410256


In [16]:
# Predicted vs total played hours by experience 
players_plot = (alt.Chart(players_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "Fig. 6 Predicted vs Total Gameplay Hours by Experience (K = 39)"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction"
))
players_plot

**KNN Model and Visualization Analysis**

The KNN regression model was constructed using the optimal hyperparameter value of K = 39 neighbors, as determined through cross-validated grid search tuning. A preprocessing pipeline was applied to standardize the predictor variable (experience) prior to modeling, which is essential for distance-based algorithms such as KNN to ensure valid similarity calculations. The model was trained on 80% of the cleaned dataset and then used to generate predictions for the remaining 20% held-out test set. Predicted playtime values were calculated as the average observed gameplay hours of the 39 most similar players in the training set based on experience level and were appended to the testing dataset for evaluation and visualization.

The scatterplot reveals a right-skewed distribution of playtime, with most players exhibiting very low gameplay hours and a small number recording extremely high usage. This results in wide vertical dispersion at all experience levels and indicates substantial individual variability that limits precise prediction accuracy. Despite this variability, the prediction line displays a clear positive upward trend, showing that regular and amateur experience levels are associated with increasing average gameplay hours. This supports the earlier conclusion that experience is the strongest predictor variable in the dataset. The smooth shape of the fitted prediction line reflects the use of a relatively large neighborhood size (K = 39). This causes each prediction to be lightly smoothed across many neighboring data points, reducing overfitting to extreme values and generating more stable trend estimates. Consequently, the model tends to predict conservative values:

- Under-predicting extreme high-usage players, and
- Slightly over-predicting very low-usage players.

This averaging behavior improves generalization but limits individual-level prediction precision, which aligns with the validated performance metrics.

In [17]:
# Predicte total hours based on experience level from whole data set
players_all_predictions = players_tidy.assign(
    prediction_all=player_pipe.predict(
        players_tidy[["experience"]]
    )
)
players_all_predictions

Unnamed: 0,experience,played_hours,prediction_all
0,4,30.3,1.305128
1,3,3.8,0.464103
2,3,0.0,0.464103
3,2,0.7,7.261538
4,1,0.1,6.494872
...,...,...,...
191,2,0.0,7.261538
192,3,0.3,0.464103
193,2,0.0,7.261538
194,2,2.3,7.261538


In [19]:
# All predicted vs total played hours by experience 
players_plot = (alt.Chart(players_all_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_all_predictions,
    title= "Fig. 7 All Predicted vs. Actual Gameplay Hours by Experience (K = 39)"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction_all"
))
players_plot

**KNN Model and Visualization Analysis - Prediction All**

The model tends to underestimate the most extreme high-playtime outliers, and slightly overestimate some of the very low-playtime players. This behavior is expected for a KNN model with a large K: predictions are essentially local averages, so they pull extreme values back toward typical levels. When plotted across the full dataset, this makes the line a good representation of the average engagement pattern rather than individual outcomes.

The scatter layer using all players reinforces earlier findings: The distribution of played_hours is highly right-skewed, meaning that most players record very low total hours, while a relatively small number log extremely high values. There is substantial vertical spread at each experience level, highlighting large individual differences in engagement even among players with similar experience. Despite this noise, there is a clear overall positive relationship between experience and gameplay hours, with regular and amateur players tending to have higher total playtime on average. Because all observations are included, clusters and density at low playtime values become more evident, emphasizing just how concentrated the majority of players are near zero hours.

## Discussion 
**Summary of Findings**

This analysis found that player experience level is a strong predictor of total gameplay time, with notable differences between experience categories. While higher experience generally corresponded with increased engagement, the Amateur experience group (2) had the highest average playtime among all categories, exceeding even Veteran and Pro players. This suggests that engagement does not increase strictly monotonically with experience level; instead, players in the Amateur stage may represent a peak period of gameplay activity. Across all groups, gameplay hours were highly right-skewed, with most players recording minimal playtime and a small number logging extremely high values. This variability contributed to a cross-validated Root Mean Squared Prediction Error of approximately 16.03 hours, even for the best-performing KNN model (K = 39), indicating that individual playtime remains difficult to predict precisely.

**Comparison to Expectations**

This result was partially expected but also somewhat surprising. It was anticipated that engagement would increase with rising experience; however, the finding that Amateur players exhibited the greatest playtime suggests that engagement may peak during the middle stages of skill development rather than at the highest levels. This aligns with behavior seen in many gaming communities, where players are most active during skill-building phases and may reduce playtime once they reach advanced mastery or specific personal goals. 

**Potential Impacts of These Findings**

The identification of Amateur players as the most engaged group has several practical implications:
- Retention strategies could focus on supporting players during the transition from Regular to Amateur to sustain peak engagement.
- Game content and challenges could be tailored to target Amateur-level skill development, where players may be most motivated to improve.
- Recruitment for gameplay data collection could prioritize Amateur players, as they appear likely to generate the largest gameplay volumes.
- Predictive models can use experience as a key feature while acknowledging that peak engagement may occur at intermediate rather than maximum experience levels.

**Future Directions**

These findings lead to several future research questions:
- Why does engagement peak at the Amateur level rather than at Veteran or Pro levels?
- Are Amateur players experiencing higher motivation, progression goals, or community involvement compared to other groups?
- Would adding measures such as session frequency, achievement completion, or social gameplay features better explain this pattern?
- Does the engagement peak persist across time, or does it differ for newer player cohorts?

Exploring these questions could deepen understanding of player engagement cycles and further enhance predictive modeling accuracy.

## Works Cited
Pacific Laboratory for Artificial Intelligence (PLAI). “Blocked.” Plaicraft.ai, 2025, www.plaicraft.ai. Accessed 3 Dec. 2025.