# Predicting whether a player is likely to contribute a large amount of data #

In [4]:
import pandas as pd
import altair as alt
import numpy as np

from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Introduction ##

PLAICraft is a data collection project that gathers gameplay data from Minecraft players. It is run by the UBC’s Pacific Laboratory for Artificial Intelligence (PLAI), whose goal is to advance artificial intelligence. This project specifically is focused on the creation of an AI that can understand and learn from its environment (Minecraft) called an embodied AI. The Project relies on data of players' speech and key presses. In order to help with data collection the team is interested in targeted recruiting for demographics that contribute the most data. To achieve this, PLAI has collaborated with students from DSCI 100 to identify the "kinds" of players most likely to provide a significant amount of data.

The provided dataset from `players.csv` has 9 variables: `experience`, `subscribe`, `played_hours`, `name` ,`gender`, `age`, `individualId`, and  `organizationName`. Among these, the variables `name` ,`gender`, `individualId`, and  `organizationName` are not related to any measured property relevant to this analysis and will not be considered. To answer what demographic contributes the most data, a variant of `played_ hours` will be our target variable. To help with downstream classification analysis players with more than 2 hours of playtime are categorized as "high", while those with less than 2 hours are categorized as "low” in the new column `play_time`. Although 2 hours is not a significant playtime, the overall play hours in the dataset are low; therefore, this threshold was set to create proportionate categories. Players who have high `play_hours` are the "kinds" of players who are likely to contribute a large amount of data. The data is tidy because every row represents an observation, every column represents a variable, and every entry is a value. `play_time` will be used as our target variable, which is calculated from `played_ hours`. A trained K-Nearest Neighbor(KNN) will be used to classify whether or not someone will have high play time based on their `experience`, `subscribe`, `name` , `age`. In addition a KNN regression model will be used to see if theres any trends with age and play hours.

## Dataset & Preprocessing ##

In [6]:
player_data = pd.read_csv("data/players.csv").drop(columns = ['individualId','organizationName','hashedEmail','name'])
#Dropped columns don't contribute to our question
player_data.head()

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21


In [7]:
# create a new column to show the classification of playtime
player_data["play_time"] = ['high' if played_hours >= 4 else 'low' for played_hours in player_data['played_hours']]
player_data.head()

Unnamed: 0,experience,subscribe,played_hours,gender,age,play_time
0,Pro,True,30.3,Male,9,high
1,Veteran,True,3.8,Male,17,low
2,Veteran,False,0.0,Male,17,low
3,Amateur,True,0.7,Female,21,low
4,Regular,True,0.1,Male,21,low


In [8]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   played_hours  196 non-null    float64
 3   gender        196 non-null    object 
 4   age           196 non-null    int64  
 5   play_time     196 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 8.0+ KB


I loaded the player dataset and removed identifier-related columns that do not contribute to the analysis. I then created a binary target variable, `play_time`, to reframe the problem as a classification task distinguishing between high and low playtime users. Finally, I inspected data types and missing values to ensure the dataset was suitable for downstream analysis.


## EDA ##

### Target Distribution (played_hours) ###

In [16]:
alt.Chart(player_data).mark_bar().encode(
    x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=40), title='Played Hours'),
    y=alt.Y('count()', title='Count')
).properties(
    width=500,
    height=300,
    title='Distribution of Played Hours'
)

The distribution of played hours is highly right-skewed, with the vast
majority of players exhibiting very short playtime and a small number of
extreme values extending the scale.

### Overview of Age, Gender, and Play Time ###

In [18]:
plot = alt.Chart(player_data).mark_circle().encode(
    x=alt.X("age")
        .title(["Age"]),
    y=alt.Y("gender")
        .title(["Gender"]),
    color=alt.Color("play_time")
        .title("play time")
).configure_axis(titleFontSize=19).properties(height=400,width=300)
plot

This overview plot shows the distribution of age and gender across high
and low playtime groups. No strong clustering of high playtime users by
age or gender is immediately apparent.

### Age vs played_hours ###

In [25]:
alt.Chart(player_data).mark_circle(opacity=0.4).encode(
    x=alt.X('age:Q', title='Age'),
    y=alt.Y('played_hours:Q', title='Played Hours')
).properties(
    width=500,
    height=300,
    title='Age vs Played Hours'
)

This graph shows a highly skewed distribution, with a small number of
players exhibiting extremely high playtime across various ages. These
extreme values obscure any clear age–playtime relationship.

### Experience vs played_hours ###

#### Preliminary Graph ####

In [14]:
#Plot preliminary graph
experience_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('played_hours:Q').title("Played Hours"),
    y=alt.Y("experience:N").title("Experience")
)
experience_plot

Played hours vary widely across experience levels, with extreme values
present in multiple categories. No clear monotonic or linear relationship
is observed between experience and playtime.

### Remove outlier ###

Because a small number of extreme playtime values dominated the scale,
I restricted the analysis to players with 0–5 hours of playtime to better
observe patterns among the majority of users. This filtered dataset is
used for subsequent modeling.

In [15]:
# Remove outliers for played_hours
filtered_player_data = player_data[player_data['played_hours'] <= 5]

#Plot filter preliminary graph
experience_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('played_hours:Q').title("Played Hours").scale(zero=False),
    y=alt.Y("experience:N").title("Experience")
)
experience_plot_filtered

**Scatter plot of** `experience` (x-axis) vs. `played_hours` (y-axis) **after filtering exterme values.** The majority of players exhibit short playtime,
with no clear separation across experience levels.

In [None]:
#Mean, Median, Max and Min of filtered_player_data
filtered_player_data_info= filtered_player_data['played_hours'].agg(['mean', 'median', 'max', 'min', 'std']).reset_index()
print("Info:",filtered_player_data_info)

#Mode of filtered_player_data
filtered_player_data_mode=filtered_player_data['played_hours'].mode()
print("Mode",filtered_player_data_mode)

Info:     index  played_hours
0    mean      1.979474
1  median      0.100000
2     max     56.100000
3     min      0.000000
4     std      7.685742
Mode 0    0.0
Name: played_hours, dtype: float64


## Modeling ##

In [26]:
#change experience value to numeric value
filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({
    'Amateur': 1,
    'Beginner': 2,
    'Pro': 3,
    'Regular': 4,
    'Veteran': 5
})

X = filtered_player_data[['experience_numeric']]
y = filtered_player_data['played_hours']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

  filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({


In [None]:
experience_param_grid = {'n_neighbors': range(1, 21)}

grid_search = GridSearchCV(
    KNeighborsRegressor(),
    experience_param_grid,
    cv=5,
    scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
best_rmspe = -grid_search.best_score_

print(best_k)
print(best_rmspe)

15
0.5971404115226336


In [None]:
knn_regressor = KNeighborsRegressor(n_neighbors=best_k)
knn_regressor.fit(X_train, y_train)

In [None]:
#Evaluating RMSPE on the test set
experience_prediction = grid_search.predict(X_test)
experience_summary = mean_squared_error(
    y_true=y_test,
    y_pred=experience_prediction
)**(1/2)
print("RMPSE of test set:", experience_summary)

RMPSE of test set: 0.7924596333766161


In [None]:
np.random.seed(33)
#Predict the hours played for age
experience_preds = filtered_player_data.assign(
    predictions= grid_search.predict(filtered_player_data[['experience_numeric']])
)
#Plot all players
experience_plot = alt.Chart(experience_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('experience_numeric').title('Experience').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
experience_plot = experience_plot + alt.Chart(experience_preds, title= "K=15").mark_line(color="Black").encode(
    x="experience_numeric",
    y="predictions",
)
experience_plot

**Experience level does not exhibit a clear relationship with hours played.**
The KNN model (K=15) produces a relatively flat prediction across experience
levels, indicating that higher experience does not consistently correspond
to increased playtime. Most observations remain concentrated below 0.5 hours,
with only minor variation across experience categories.

### Comparison: KNN Regression Using Age ###

In [None]:
#Split data into training and testing dataframes
player_training, player_testing = train_test_split(
    filtered_player_data,
    test_size=0.25,
    random_state=33,
)
X_train = player_training[['age']]
y_train = player_training['played_hours']

X_test = player_testing[['age']]
y_test = player_testing['played_hours']

In [None]:
# Preprocess the data, make the pipeline
age_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

In [None]:
# create the 5-fold GridSearchCV object
param_grid = {
    'kneighborsregressor__n_neighbors': range(1, 111, 1) #neighbors ranging from 1 to 110
}
age_tuned = GridSearchCV(
    age_pipe,
    param_grid,
    cv=5,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

# Fit the GridSearchCV object and retrieve the CV scores
age_results = pd.DataFrame(age_tuned.fit(X_train, y_train).cv_results_)
age_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.009230,0.011443,0.002135,0.000126,1,{'kneighborsregressor__n_neighbors': 1},-0.722066,-12.073896,-11.456517,-6.681023,-9.002519,-7.987204,4.105031,107
1,0.002815,0.000038,0.001951,0.000037,2,{'kneighborsregressor__n_neighbors': 2},-7.639823,-15.316804,-11.227851,-6.806057,-6.596516,-9.517410,3.345653,110
2,0.002750,0.000013,0.001890,0.000005,3,{'kneighborsregressor__n_neighbors': 3},-5.155806,-13.604928,-11.103494,-6.162283,-6.629261,-8.531154,3.255206,109
3,0.002933,0.000324,0.001931,0.000065,4,{'kneighborsregressor__n_neighbors': 4},-4.238870,-13.077508,-11.234713,-5.857979,-5.912204,-8.064255,3.444331,108
4,0.004435,0.003369,0.001887,0.000006,5,{'kneighborsregressor__n_neighbors': 5},-3.410022,-12.747023,-11.284856,-6.075152,-5.287915,-7.760994,3.610187,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,0.003003,0.000591,0.002129,0.000160,106,{'kneighborsregressor__n_neighbors': 106},-2.242995,-12.079505,-10.663243,-5.657835,-4.618947,-7.052505,3.723086,68
106,0.002752,0.000097,0.002456,0.000734,107,{'kneighborsregressor__n_neighbors': 107},-2.300962,-12.056163,-10.660851,-5.665565,-4.619813,-7.060671,3.700680,73
107,0.002683,0.000059,0.002009,0.000015,108,{'kneighborsregressor__n_neighbors': 108},-2.299375,-12.061981,-10.657457,-5.650980,-4.617129,-7.057384,3.703457,72
108,0.002736,0.000103,0.002012,0.000031,109,{'kneighborsregressor__n_neighbors': 109},-2.284726,-12.049145,-10.665866,-5.653414,-4.616823,-7.053995,3.705253,70


In [None]:
#Find the best K and its RMSPE value
age_min = age_tuned.best_params_
age_best_RMSPE = -age_tuned.best_score_
print("Best Parameters (age_min):", age_min)
print("Best RMSPE (age_best_RMSPE):", age_best_RMSPE)

Best Parameters (age_min): {'kneighborsregressor__n_neighbors': 70}
Best RMSPE (age_best_RMSPE): 6.946063563917676


In [None]:
#Evaluating RMSPE on the test set
age_prediction = age_tuned.predict(X_test)
age_summary = mean_squared_error(
    y_true=y_test,
    y_pred=age_prediction
)**(1/2)
print("RMPSE of test set:", age_summary)

RMPSE of test set: 6.9597827719687615


In [None]:
np.random.seed(33)
#Predict the hours played for age
age_preds = filtered_player_data.assign(
    predictions= age_tuned.predict(filtered_player_data[['age']])
)
#Plot all players
age_plot = alt.Chart(age_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('age').title('Age').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
age_plot = age_plot + alt.Chart(age_preds, title= "K=70").mark_line(color="Black").encode(
    x="age",
    y="predictions",
)
age_plot

**No relationship between age and hours played.** Scatter plot of quantitive variables `filtered_player_data` with a prediction line(Black). `age`(x-axis) is plotted against `played_hours`(y-axis). **n=190** player data represented in graph. Predicted values of hours played (black line) for K-NN regression model (K=70).

## Discussion ##

### Experience ###

KNN regression using experience level as the predictor produces relatively
flat predictions across categories, indicating limited association between
experience and playtime. While minor variation is observed across experience
levels, the overall pattern suggests that experience alone does not strongly
influence playtime behavior.

### Age ###

KNN regression results indicate no meaningful relationship between age and
PLAICraft playtime, as reflected by the near-horizontal prediction line.
Although the model exhibits similar RMSPE values on the training and test
sets, the error magnitude is large relative to typical playtime values,
limiting its practical usefulness. This suggests that age alone does not
provide sufficient explanatory power for predicting playtime behavior.

### Overall ###

Across all analyses, neither age nor experience demonstrates meaningful
predictive power for explaining playtime behavior. Both linear inspection
and non-linear KNN modeling indicate that playtime is largely independent
of the available demographic variables, suggesting that other behavioral or
contextual factors likely play a more substantial role.