# DSCI 100 Project: Final Report

**Authors**: Rodrigo Moreno González, Trevor Lau, Sydney Peters, Megan Zhang

**Predicting Gameplay Contribution on a Research Minecraft Server**

A research group in the Department of Computer Science at UBC (PLAI) has a goal to aid in advancing the limits of artificial intelligence. They host a Minecraft server that records players' actions as they navigate through the world. Since running the project is so complex, they want to recruit players who will contribute large amounts of gameplay data, as well as make sure they have the resources to handle the amount of players they attract. In an attempt to optimize their players' contributions, they want to understand what demographics of players will have the greatest gameplay time. This project answers their question of how player demographics relate to total gameplay time. 

## The Question
**Question 2**: Which kinds of players are most likely to contribute a large amount of data, and how can we identify them for targeted recruitment?

**Predictive Question** Can we predict the total `played_hours` from the `players.csv` dataset based on the predictor, `experience`. 

**Response variable**:

`played_hours` - total play time and data contribution

**Explanatory variable**:

`experience` - player experience level

The dataset allows us to examine if `experience` relates to total playtime. If certain experience levels tend to play more, those groups may contribute more gameplay data. This helps identify the player demographics most likely to provide substantial data for the research team.


## Description: The Players
The dataset used in this analysis is a player demographics and engagement dataset that contains 196 observations (rows) and 9 variables (columns). The dataset records information about individual players, their gaming experience level, subscription status, demographic details, and the amount of time they have spent playing.

- Number of observations: 196 (each row represents one unique player)
- Number of variables: 9
- Observational unit: Player-level data
- Purpose: Identify which demographic characteristics correspond to higher gameplay time (and therefore more data contributed)
  
**Variables**
- `played_hours` (numeric) - Total hours each player spent on the server
- `experience` (categorical) -  Self-reported experience level in Minecraft
- `gender` (categorical) - Player’s gender
- `subscribe` (Boolean) - Player’s subscription status
- `hashedEmail` (String) - Player’s email address, hashed for privacy protection
- `name` (String) - Player’s name
- `age` (Numeric) - Player’s age
- `individualID` (String) - Unique player identifier
- `organizationName` (String) - Name of the organization associated with the player

**Issues to consider**
- Missing identifiers (`individualId`, `organizationName`).
- `hashedEmail` is anonymous for privacy.
- `played_hours` may include idle time, which could overestimate actual engagement.
- `experience` is self reported which could lead to inconsistent standards. 

Source of the original dataset:
https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit


In [1]:
# Import libraries
import pandas as pd
import altair as alt

# Import the K-NN regression model
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config

# Import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Read the datasets
url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
# Load the dataset 
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [2]:
# Fig 1. Total Hours by Experience (bar) — raw total played_hours across experience categories.
plot1= alt.Chart(players, title="Fig 1. Total Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("played_hours:Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot1

In [3]:
# Fig 2. Average Hours by Experience (bar) — mean played_hours across experience categories.
plot2= alt.Chart(players, title="Fig. 2 Average Hours Played per Experience Group").mark_bar().encode(
    x=alt.X("experience:N").title("Experience Level"),
    y=alt.Y("mean(played_hours):Q").title("Average hours Played(hr)"),
    color = alt.Color("experience:N").title("Experience Level")
)
plot2

In [4]:
# Fig 3. Average Hours by Subscription Status (bar) — compare subscribed vs unsubscribed means.
plot3= alt.Chart(players, title="Fig. 3 Average Hours by Subscribed/Unsubscribed").mark_bar().encode(
    x=alt.X("subscribe:N").title("Subscription Status"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played(hr)"),
    color = alt.Color("subscribe:N").title("Subscription Status")
)
plot3

In [5]:
# Fig 4. Average Hours by Gender (bar) — mean played_hours across gender categories.
plot4= alt.Chart(players, title="Fig. 4 Average Hours Played by Gender").mark_bar().encode(
    x=alt.X("gender:N").title("Gender"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played(hr)"),
    color = alt.Color("gender:N").title("Gender")
)
plot4

In [6]:
# Fig 5. Average Hours by Age (scatter) — age vs mean
plot5 = alt.Chart(players, title="Fig 5. Average Hours Played by Age").mark_circle(opacity=0.8).encode(
    x=alt.X("age:Q").title("Age"),
    y=alt.Y("mean(played_hours):Q").title("Average Hours Played (hr)"),
    color=alt.Color("age").title("Age")
)
plot5

**Visualizations Summary**

Across all visualizations, the data show that total gameplay hours are heavily right-skewed, with most players recording very little playtime and a small number accounting for the largest values.

*Experience level* (Figures 1 & 2) displays the clearest relationship with gameplay time: average hours generally increase with higher experience categories, suggesting a positive association between self-reported experience and total playtime. Although variability is high within each group, the upward trend indicates that experience is meaningfully related to engagement.

*Subscription status* (Figure 3) shows some difference between subscribed and unsubscribed players, with subscribed players tending to average slightly more gameplay hours. However, the distributions overlap substantially, meaning subscription alone does not strongly separate players by playtime.

*Gender* (Figure 4) reveals no consistent pattern across groups; average hours between genders appear similar, with any observed differences likely driven by small sample sizes or extreme outliers rather than true group-level effects. This suggests gender provides little predictive value for modeling gameplay time.

*Age* (Figure 5) shows no strong monotonic trend. A few ages exhibit higher average values, but these spikes are inconsistent and likely reflect sparse data and outliers rather than a genuine relationship. Overall, age does not demonstrate a stable or reliable association with total played hours

Self-reported experience is the best predictor for KNN regression in this dataset, as it exhibits the strongest and most consistent relationship to total gameplay time. Subscription status may provide minor additional information, while age and gender contribute little meaningful predictive value and should not be heavily weighted in the model.


In [7]:
# Drop unused and non-predictive columns
players_tidy = players.drop(columns=["individualId", "organizationName", "name", "gender", "hashedEmail", "age", "subscribe"])
# Preview tidy players_tidy
players_tidy

Unnamed: 0,experience,played_hours
0,Pro,30.3
1,Veteran,3.8
2,Veteran,0.0
3,Amateur,0.7
4,Regular,0.1
...,...,...
191,Amateur,0.0
192,Veteran,0.3
193,Amateur,0.0
194,Amateur,2.3


In [8]:
# View full summary (numeric + categorical)
players_tidy.describe(include="all")

Unnamed: 0,experience,played_hours
count,196,196.0
unique,5,
top,Amateur,
freq,63,
mean,,5.845918
std,,28.357343
min,,0.0
25%,,0.0
50%,,0.1
75%,,0.6


**Statistics Summary**

The dataset is strongly right-skewed. Most players recorded very low playtime—three-quarters spent under 0.6 hours, and the median is only 0.1 hours—while the mean of 5.85 hours reflects a small number of high-usage players. The large standard deviation of 28.36 hours highlights dramatic variability, confirming that most players contributed minimal gameplay while a few contributed disproportionately.

Experience levels are unevenly represented, with some categories containing many players and others very few. This imbalance contributes to variability in gameplay hours but still allows examination of how self-reported experience relates to total playtime. The combination of skewed playtime and varied experience makes the dataset suitable for KNN regression, which leverages local similarities and accommodates irregular distributions without strict assumptions.

Overall, the statistics reveal substantial variation in both gameplay and experience, supporting flexible modeling approaches. They highlight experience as a key predictor and emphasize the challenges of predicting individual behavior, while motivating exploration of experience-based patterns to identify highly engaged players.

In [9]:
# Create a numeric mapping for ordered experience levels
experience_map = {
    "Beginner" : 0,
    "Regular": 1,
    "Amateur": 2,
    "Veteran": 3,
    "Pro": 4
}
# Convert the categorical experience column to numeric values
players_tidy["experience"] = players_tidy["experience"].map(experience_map)
players_tidy

Unnamed: 0,experience,played_hours
0,4,30.3
1,3,3.8
2,3,0.0
3,2,0.7
4,1,0.1
...,...,...
191,2,0.0
192,3,0.3
193,2,0.0
194,2,2.3


In [10]:
# Split the data
players_training, players_testing = train_test_split(
    players_tidy, 
    test_size=0.2, 
    random_state=1234
)

# Set target and predictors
X_train = players_training[["experience"]]
y_train = players_training["played_hours"]

X_test = players_testing[["experience"]]
y_test = players_testing["played_hours"]

In [11]:
# Create preprocessor
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["experience"]),
    remainder="drop",
    verbose_feature_names_out=False,
)

# Create pipeline
players_pipeline = make_pipeline(
    players_preprocessor,
    KNeighborsRegressor()
)

# Evaluate the performance of players_pipeline using 5-fold cross-validation
players_cv = pd.DataFrame(
    cross_validate(
        players_pipeline,
        X_train,
        y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

players_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003967,0.002549,-19.160247,-27.106263
1,0.003282,0.002085,-5.8032,-21.794819
2,0.003095,0.001924,-29.223031,-16.37408
3,0.00318,0.00191,-3.798582,-21.907992
4,0.003042,0.001884,-31.885128,-15.327369


In [12]:
# Create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 51),
}

players_tuned = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

# fit the GridSearchCV object
players_result = pd.DataFrame(players_tuned.fit(X_train, y_train).cv_results_)

players_result

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006572,0.006619,0.002085,0.000174,1,{'kneighborsregressor__n_neighbors': 1},-9.122226,-8.72342,-30.005672,-12.052734,-29.268655,-17.834541,9.707952,45
1,0.020499,0.03498,0.00196,0.000139,2,{'kneighborsregressor__n_neighbors': 2},-9.143509,-5.170949,-29.994092,-6.304415,-30.493426,-16.221278,11.52331,20
2,0.003002,3.5e-05,0.00188,6e-06,3,{'kneighborsregressor__n_neighbors': 3},-31.37984,-4.345652,-30.140029,-4.925939,-32.622795,-20.682851,13.127151,50
3,0.003001,6.6e-05,0.00186,7e-06,4,{'kneighborsregressor__n_neighbors': 4},-23.631453,-3.934581,-30.069712,-4.316454,-32.123221,-18.815084,12.31755,49
4,0.002991,5e-05,0.001866,5e-06,5,{'kneighborsregressor__n_neighbors': 5},-19.160247,-5.8032,-29.223031,-3.798582,-31.885128,-17.974038,11.580485,47
5,0.002958,7e-06,0.001868,2.7e-05,6,{'kneighborsregressor__n_neighbors': 6},-16.380491,-4.982276,-29.339701,-3.162105,-31.815267,-17.135968,11.899284,40
6,0.002975,6e-05,0.001857,2e-05,7,{'kneighborsregressor__n_neighbors': 7},-14.73491,-4.539553,-28.152575,-5.217285,-31.784133,-16.885692,11.332351,38
7,0.002969,5.3e-05,0.001869,4.3e-05,8,{'kneighborsregressor__n_neighbors': 8},-13.336478,-14.027496,-28.313692,-4.558332,-31.774824,-18.402164,10.134269,48
8,0.00298,8e-05,0.001849,6e-06,9,{'kneighborsregressor__n_neighbors': 9},-12.357332,-12.422075,-28.4438,-4.079793,-31.773035,-17.815207,10.538727,44
9,0.002941,6e-06,0.001858,1e-05,10,{'kneighborsregressor__n_neighbors': 10},-13.827831,-11.185561,-28.565624,-3.665939,-31.778249,-17.804641,10.682608,43


In [13]:
# Get the best parameter values
players_min = players_tuned.best_params_
players_min

{'kneighborsregressor__n_neighbors': 39}

In [14]:
players_result["rmse"] = -players_result["mean_test_score"]

players_k_plot = (
    alt.Chart(players_result, title="Fig _ KNN: RMSE vs Number of Neighbours (k)")
    .mark_line()
    .encode(
        x=alt.X("param_kneighborsregressor__n_neighbors:Q",
                title="k (Number of Neighbours)"),
        y=alt.Y("rmse:Q",
                title="5-fold CV RMSE"),
    )
)
players_k_plot

In [15]:
# Extract the best (lowest) cross-validation error from the tuned KNN model
players_best_RMSPE = -players_tuned.best_score_
players_best_RMSPE

np.float64(16.028531937252072)

In [16]:
# predict on the training set
best_model = players_tuned.best_estimator_
train_pred = best_model.predict(X_train)

# calculate RMSE
train_rmse = (mean_squared_error(y_train, train_pred)) ** 0.5
train_rmse

np.float64(19.51054219122817)

**Cross-Validation Summary**

Grid search cross-validation identified K = 39 as the optimal number of neighbors for the KNN regression model. This indicates that the model performed best when each prediction was based on the average gameplay time of the 39 most similar players. Choosing a relatively large K helps stabilize predictions by reducing the influence of extreme outliers—an important consideration given the highly right-skewed distribution of total gameplay hours in the dataset.

The tuned model achieved a cross-validated RMSPE of about 16.03 hours. In practical terms, this means that the model’s predictions differ from the true values by roughly 16 hours on average. While this error may appear large, it is expected given the extreme variability in gameplay behavior: most players recorded close to zero hours, while a small number logged disproportionately high totals.

Overall, the cross-validation results show that the model captures broad patterns—particularly those associated with player experience—but cannot precisely predict individual gameplay time due to the dataset’s skewness and high variance.

In [17]:
# Set K, make pipeline, and fit the training data
knn = KNeighborsRegressor(n_neighbors = 39)
player_pipe = make_pipeline(players_preprocessor, knn)
player_pipe.fit(X_train, y_train)

# Predicte total hours based on experience level from testing set 
players_predictions = players_testing.assign(prediction = player_pipe.predict(X_test))
players_predictions

Unnamed: 0,experience,played_hours,prediction
101,2,0.0,7.261538
51,1,218.1,6.494872
146,4,0.0,1.305128
153,0,0.1,1.410256
106,1,0.0,6.494872
59,3,0.2,0.464103
161,3,0.0,0.464103
167,0,0.3,1.410256
193,2,0.0,7.261538
88,0,0.0,1.410256


In [18]:
# predict on the test set
test_pred = player_pipe.predict(X_test)

# calculate RMSE (no numpy)
test_rmse = mean_squared_error(y_test, test_pred) ** 0.5
test_rmse

np.float64(48.68237707432575)

In [19]:
# Predicted vs total played hours by experience 
players_plot = (alt.Chart(players_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_predictions,
    title= "Fig. 6 Predicted vs Total Gameplay Hours by Experience (K = 39)"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction"
))
players_plot

**KNN Model and Visualization Analysis**

The KNN regression model used the tuned hyperparameter K = 39, selected via cross-validated grid search. Because KNN relies on distance-based similarity, the predictor variable (experience) was standardized to ensure meaningful comparisons. The model was trained on 80% of the cleaned dataset, and predictions were generated for the remaining 20%. Each predicted value represents the average total gameplay hours of the 39 most similar players based on experience, and these predictions were added to the test set for evaluation and visualization.

The scatterplot shows a strongly right-skewed gameplay distribution: most players recorded very low hours, while a few logged extremely high values, creating substantial vertical spread and limiting precise individual predictions. Despite this, the fitted prediction line shows an upward trend, indicating higher experience levels are associated with greater average gameplay time, reinforcing that experience is the most informative predictor.

The smooth prediction curve reflects the large neighborhood size (K = 39). With many neighbors contributing, predictions are smoothed, reducing sensitivity to extreme outliers and improving generalization. Consequently, the model conservatively under-predicts high-usage players and slightly over-predicts near-zero players, consistent with cross-validation results showing it captures trends effectively but not exact individual values.

In [27]:
# Predict total hours based on experience level from whole data set
players_all_predictions = players_tidy.assign(
    prediction_all=player_pipe.predict(
        players_tidy[["experience"]]
    )
)
players_all_predictions

Unnamed: 0,experience,played_hours,prediction_all
0,4,30.3,1.305128
1,3,3.8,0.464103
2,3,0.0,0.464103
3,2,0.7,7.261538
4,1,0.1,6.494872
...,...,...,...
191,2,0.0,7.261538
192,3,0.3,0.464103
193,2,0.0,7.261538
194,2,2.3,7.261538


In [24]:
# predict on ALL players
all_pred = best_model.predict(players_tidy[["experience"]])

# RMSE on the entire dataset
final_rmse = mean_squared_error(players_tidy["played_hours"], all_pred) ** 0.5
final_rmse

np.float64(28.04716413446685)

In [28]:
# All predicted vs total played hours by experience 
players_plot = (alt.Chart(players_all_predictions).mark_circle(opacity=0.7).encode(
    x=alt.X("experience").title("Experience").scale(zero=False),
    y=alt.Y("played_hours").title("played_hours").scale(zero=False)
)+
alt.Chart(
    players_all_predictions,
    title= "Fig. 7 All Predicted vs. Actual Gameplay Hours by Experience (K = 39)"
).mark_line(
    color="black"
).encode(
    x="experience",
    y="prediction_all"
))
players_plot

**KNN Prediction on Full Dataset**

The KNN model was applied to predict total gameplay hours for all players in the dataset using the previously trained pipeline and tuned hyperparameter K = 39. Predictions were generated based solely on the experience variable, with each player’s estimated total hours calculated as the average of the 39 most similar players.

To evaluate model performance across the entire dataset, predictions were compared to actual total playtime, and the overall RMSE was calculated, providing a measure of prediction error across all players.

The resulting visualization plots actual gameplay hours against experience for all players, with predicted values overlaid as a smooth black line. The scatterplot highlights the right-skewed distribution of playtime: most players logged minimal hours, while a few logged very high totals. Despite this variability, the fitted prediction line shows a clear upward trend, reinforcing that higher experience levels correspond to greater average gameplay.

The use of K = 39 neighbors produces a smooth prediction curve, moderating the influence of extreme outliers. Consequently, the model generates conservative estimates, slightly under-predicting high-usage players and over-predicting low-usage players, consistent with the trend-focused nature of KNN regression. Overall, these results confirm that experience is the strongest predictor of total playtime and demonstrate the model’s ability to capture general patterns across the full dataset.

## Discussion
**Summary of Findings**

This analysis demonstrated that player experience level is a strong predictor of total gameplay time, with notable differences across experience categories. While higher experience generally corresponded with greater playtime, the Amateur experience group recorded the highest average hours, surpassing even the Veteran and Pro groups. This suggests that engagement does not follow a simple linear progression. Instead, players in the Amateur stage may represent a peak in activity, possibly reflecting a period where they are still improving rapidly, exploring game mechanics, and remaining highly motivated to engage.

The distribution of gameplay hours across the dataset was heavily right-skewed. Most players logged very low total hours, while a small subset contributed exceptionally high totals, creating substantial variability. This variability influenced the KNN model’s performance: even with the optimal K = 39, the cross-validated RMSPE remained relatively high at approximately 16.03 hours. The model successfully captured general trends related to experience level but struggled with precise individual-level predictions due to extreme differences in player behavior.

**Comparison to Expectations**

These results partially aligned with expectations. It was reasonable to predict that more experienced players would play more, but the finding that Amateur players exhibited the highest playtime challenges the assumption that engagement increases steadily with experience. Instead, it suggests that players may be most active during intermediate stages of skill development. This aligns with observed patterns in gaming communities, where motivation peaks during periods of improvement, achievement unlocking, or competition, and may plateau or decline once mastery is reached.

**Implications**

- Retention strategies could focus on supporting players as they transition into the Amateur stage, reinforcing engagement during peak activity.
- Game design—including tutorials, progression systems, or difficulty balancing—might emphasize content tailored to Amateur players, who contribute disproportionately to total playtime.
- Data collection efforts may prioritize Amateur players due to their high engagement levels.
- Predictive modeling should continue to incorporate experience level as a key feature while accounting for peak intermediate engagement.

**Future Directions**

- Investigate why engagement peaks at the Amateur level: motivation, progression goals, or social interaction.
- Identify which factors—such as session frequency, achievements, or multiplayer interactions—most influence peak playtime.
- Assess whether the engagement pattern is stable across cohorts or time periods.
- Explore incorporating additional predictors (behavioral indicators, in-game performance metrics) to improve model accuracy.
- Examine longitudinal data to understand how player engagement evolves over time and how it relates to experience development.

## Works Cited
Pacific Laboratory for Artificial Intelligence (PLAI). “Blocked.” Plaicraft.ai, 2025, www.plaicraft.ai. Accessed 3 Dec. 2025.