## **Understanding Player Engagement: Modelling Game Playtime Using Age and Experience Level**

## **Introduction**

The Pacific Laboratory for Artificial Intelligence (PLAI) is a research group at the University of British Columbia interested in collecting data on how people play and interact in video games. PLAI has publicly released a free Minecraft server, plaicraft.ai, a “free Minecraft in the cloud” to enable their project on generative AI research data collection. However, research efforts and resources are demanding in a project of this size, where the actions of each player are recorded throughout their playthrough. Thus, project lead Frank Wood has offered the undergraduate DSCI100 students an opportunity to assist in answering one of three questions related to the data they have accumulated.

Our group has chosen to address question 2: “We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.” Based on this overarching question, we created our group question of interest: “What experience levels and ages contribute to greater playtime in hours?” Playtime in hours was chosen as the response variable, as we believe it best captures the amount of data players contribute. Thus, predicting higher amounts of playtime from explanatory variables may yield insight into what player characteristics are most significantly associated with larger amounts of data generated. 

To narrow down the overarching inquiry, player age and experience level (Amateur, Beginner, Regular, Pro, Veteran) were chosen as explanatory variables. Age is heavily tied to life stages, influencing factors such as time availability, patterns in cognitive engagement, and gaming habits, which have the potential to affect total playtime. Experience level reflects player commitment and implies dedication towards the game before participation, possibly indicating the amount of time and sessions players are willing to dedicate to gameplay. Thus, we believed these variables would be influential in contributing the most data and subsequent targeted recruitment by narrowing down the scope of age groups and experience levels most likely to have the highest playtime. 

The dataset consists of 196 observations and 9 variables. This includes the columns “experience” (categorical string) representing player Minecraft experience level, “subscribe” (categorical boolean) indicating player subscription to a game-related newsletter, “hashedEmail” (categorical string) as the player’s email, “played_hours” (numerical integer) as the number of played hours, “name” (categorical string), "gender” (categorical string), “age” (numerical integer) in years, and two empty fields “individualID” and “organizationName” (NoneType). Components of the data were sourced from player activity, actions, and interactions on plaicraft.ai, while other aspects such as player names and emails were self-reported. It is not declared whether player age, experience level, and gender were sourced from a self-report measure or player activity.

## **Methods and Results**

In [18]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [19]:
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


Table 1. The entire players dataset. 

In [20]:
players_filtered = players[["age", "played_hours", "experience"]]
players_filtered

Unnamed: 0,age,played_hours,experience
0,9,30.3,Pro
1,17,3.8,Veteran
2,17,0.0,Veteran
3,21,0.7,Amateur
4,21,0.1,Regular
...,...,...,...
191,17,0.0,Amateur
192,22,0.3,Veteran
193,17,0.0,Amateur
194,17,2.3,Amateur


Table 2. Players filtered to just include the variables we are interested in (age, played_hours, experience).

In [21]:
experience_map = {
    "Amateur": 1,
    "Beginner": 2,
    "Regular": 3,
    "Pro": 4,
    "Veteran": 5
}

players_filtered = players_filtered.assign(
    experience_num = players_filtered['experience'].map(experience_map)
)

players_filtered

Unnamed: 0,age,played_hours,experience,experience_num
0,9,30.3,Pro,4
1,17,3.8,Veteran,5
2,17,0.0,Veteran,5
3,21,0.7,Amateur,1
4,21,0.1,Regular,3
...,...,...,...,...
191,17,0.0,Amateur,1
192,22,0.3,Veteran,5
193,17,0.0,Amateur,1
194,17,2.3,Amateur,1


Table 3. Players_filtered with an additional table assigning each experience level a number. 

The data was loaded from a URL to make our code reproducible. To quantify experience level, we assigned each level a number ranging from 1 to 5 in increasing order of experience. Beginner was 1, Amateur was 2, Regular was 3, Pro was 4, and Veteran was 5. 

In [22]:
#split filtered data into training and testing with 70/30 split using random_state = 2000
training_data, testing_data = train_test_split(
    players_filtered, test_size = 0.3, random_state = 2000
)

We initially split our data into a 70/30 training and testing set. These proportions were chosen as we believed 70% of the data would be enough to train the models, while leaving enough data to effectively test the model. Initial visualizations and training of the model were done on the training set, while the actual testing of the models was done on the testing set. We did this so that our models would not be overfitted to the given dataset, and we could assess the model's effectiveness as accurately as possible. 

In [23]:
training_data.describe() 

Unnamed: 0,age,played_hours,experience_num
count,137.0,137.0,137.0
mean,21.218978,4.830657,2.656934
std,8.907792,25.240958,1.573858
min,8.0,0.0,1.0
25%,17.0,0.0,1.0
50%,18.0,0.1,2.0
75%,23.0,0.6,4.0
max,91.0,223.1,5.0


Table 4. A dataframe with descriptive statistics or each variable we are interested in for all played hours. 

In [24]:
training_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 137 entries, 120 to 182
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             137 non-null    int64  
 1   played_hours    137 non-null    float64
 2   experience      137 non-null    object 
 3   experience_num  137 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 5.4+ KB


Table 5. A statistical and structural summary of the data to gain a better understanding of the dataframe.. 

The results demonstrated that in the training dataset, “age” had a mean of 21.2 years (SD=8.9), “playtime” had a mean of 4.8 hours (SD=25.2), and “experience level” had a mean of 2.7 (SD=1.6) when rounded to one decimal place. We used the training dataset because it is part of our initial modelling process. 

In [25]:
players_filtered_0 = training_data[(training_data["played_hours"] <= 160)]
players_filtered_0.describe()
#shows that when outliers removed, mean playtime was 2.7 hours with a std of 13 --> should we make this our filter instead?

Unnamed: 0,age,played_hours,experience_num
count,135.0,135.0,135.0
mean,21.266667,1.92963,2.651852
std,8.964441,7.635259,1.584997
min,8.0,0.0,1.0
25%,17.0,0.0,1.0
50%,18.0,0.1,2.0
75%,23.0,0.6,4.0
max,91.0,56.1,5.0


Table 6. A table of descriptive statistics for the variables we are interested in, excluding those with over 160 played hours. 

In [26]:
training_data = training_data.assign(
    playtime_grouped=pd.cut(
        players_filtered["played_hours"],
        bins=[0, 1, 5, 10, 15, 20, 50, 100, 200, 223.1], 
        labels=["<1", "1–5", "5–10", "10–15", "15-20", "20–50", "50–100", "100-200", "200>"]
    )
)

players_plot = alt.Chart(training_data, title = "Figure 1. Playtime in Hours: Player age and Experience Level").mark_circle(size=40).encode(
    x=alt.X("age:Q").title("Age (years)"),
    y=alt.Y("experience_num:O", 
            title = "Experience (numbered low to high)",
            scale=alt.Scale(reverse=True)
), color=alt.Color("playtime_grouped:N", sort = ["<1", "1–5", "5–10", "10–15", "15-20", "20–50", "50–100", "100-200", "200>"]).scale(scheme="paired").title("Playtime (hours, grouped)")
).properties(width=850, height=200)
players_plot

The main purpose of Fig. 1 was to understand the general trend of the data. The plot was made by converting the continuous variable "playtime" into categorical bins, done by slicing the numeric column into intervals (bins). This was done so that the ranges in playtime could be color-coded and more easily identifiable, as opposed to mainly utilizing opacity for identification, which has very low readability due to overlap and outliers that stretch the axes, rendering smaller differences meaningless. “Age” was plotted on the x-axis and "experience” on the y-axis, while the “grouped playtime” was assigned varying colors. 

Since we predicted a continuous variable (playtime in hours) from quantitative variables, we used two regression models and compared their performance to find the better one.  Experience level was an ordinal variable with 5 ordered categories, so we quantified the categories. Age was already a quantitative variable, so we used it as is. After fitting and cross-validation, we tested the models with our testing data and used the RMSPE to determine which was better.

In [27]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline


# Standardize age and experience_num 
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "experience_num"]),
    remainder="drop"
)

players_pipeline = make_pipeline(
    players_preprocessor,
    KNeighborsRegressor()
)


Our first model was k-nearest neighbours (KNN) regression. For each test player, KNN found the k most similar players in the training set (based on age and experience) and predicted the average playtime of those neighbours. We standardized the numeric predictors, then used 5-fold cross-validation on the training set to choose the number of neighbours k that minimized the root mean squared error.

In [28]:
# GRID SEARCH (1–75 neighbors, step = 3) with 5-fold CV

param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 76, 3)
}

players_gridsearch = GridSearchCV(
    estimator=players_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    return_train_score=False
)
print(players_gridsearch)
# Fit on training data

players_gridsearch.fit(
    training_data[["age", "experience_num"]],   
    training_data["played_hours"]              
)
print(players_gridsearch)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         ['age',
                                                                          'experience_num'])])),
                                       ('kneighborsregressor',
                                        KNeighborsRegressor())]),
             param_grid={'kneighborsregressor__n_neighbors': range(1, 76, 3)},
             scoring='neg_root_mean_squared_error')
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                      

  We first set up a pipeline that standardized the "age" and "experience_num" features using StandardScaler(), then applied a KNeighborsRegressor() model. Scaling is necessary because KNN is distance-based and would otherwise be dominated by variables with larger ranges. Including scaling inside the pipeline ensured proper preprocessing during both cross-validation and testing. 

Next, we converted the cross-validation results from GridSearchCV into a DataFrame and calculated the standard error of the mean test score using the standard deviation across the 5 folds. 

In [29]:
# Gather CV results 

cv_results = pd.DataFrame(players_gridsearch.cv_results_)

cv_results["sem_test_score"] = cv_results["std_test_score"] / (5**0.5)

cv_results = (
    cv_results[
        ["param_kneighborsregressor__n_neighbors",
         "mean_test_score",
         "sem_test_score"]
    ]
    .rename(columns={
        "param_kneighborsregressor__n_neighbors": "n_neighbors"
    })
)

cv_results["mean_test_score"] = -cv_results["mean_test_score"]

cv_results

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,28.452765,6.107046
1,4,27.058662,5.228597
2,7,22.474854,4.61685
3,10,23.987155,4.516612
4,13,23.67393,4.750189
5,16,22.51523,5.272415
6,19,22.360799,5.416053
7,22,22.070326,5.623141
8,25,21.705234,5.802995
9,28,21.739077,5.856564


Table 7. A table of different n-neighbors values and their associated RMSE values. 

For each value of k, the mean_test_score represented the average cross-validated RMSE in hours, where lower values indicated better performance. As k increased from 1 to around 70, the RMSE steadily decreased from roughly 28.5 hours to about 21.4 hours, suggesting that very small k overfit the data, while larger k yielded more reliable predictions. The best performance occurred at k = 73 with an RMSE of approximately 21.28 hours.

Next, we asked the fitted GridSearchCV for the hyperparameters that gave the best score. 

In [30]:
players_gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 73}

The grid search selected 73 neighbours as the best value of k based on the 5-fold cross-validation results from the training data. 

Finally, we used the tuned KNN model to generate predicted playtime values and calculated the resulting RMSPE.

In [31]:
# Make prediction
testing_data["predicted"] = players_gridsearch.predict(
    testing_data[["age", "experience_num"]]
)

# Compute RMSPE
RMSPE = mean_squared_error(
    y_true=testing_data["played_hours"],
    y_pred=testing_data["predicted"]
)**0.5

RMSPE

np.float64(34.4122421618046)

The tuned KNN model produced an RMSPE of about 34.4 hours.

In [32]:
#linear regression
X_train = training_data[["age", "experience_num"]]
y_train = training_data["played_hours"]
X_test = testing_data[["age", "experience_num"]]
y_test = testing_data["played_hours"]

lm = LinearRegression()
players_train_fit = lm.fit(X_train, y_train)
print(players_train_fit.coef_)
print(players_train_fit.intercept_)

[-0.08066138 -0.14859094]
6.937005449675615


A multivariable linear regression model was also used to predict playing time based on experience level and age. We used the unstandardized training data for the linear regression model, as the calculations did not rely on Euclidean distance. Keeping the unstandardized data allowed us to keep the units associated with each variable, making it more meaningful when interpreting the coefficients. 

In [33]:
#finding rmse
train_predictions = players_train_fit.predict(X_train)
lm_rmse = mean_squared_error(train_predictions, y_train)**0.5
print(lm_rmse)

25.137087062308986


In [34]:
#finding rmspe
test_predictions = players_train_fit.predict(X_test)
lm_rmspe = mean_squared_error(test_predictions, y_test)**0.5
print(lm_rmspe)

34.45033724343876


Based on the model, played hours (hrs) = -0.08066 * age (yrs) -0.1485 experience (1 level) + 6.937. Our model had an RMSE of about 25.14 hours and an RMSPE of about 34.45 hours. We expected the RMSPE to be larger than the RMSE because the model was trained on the data used to find the RMSE, while the testing data for the RMSPE was completely new. After using the training data to create a model, we tested the model on the testing data and found the RMSPE to understand how good a fit it is compared to KNN regression. Based on these two values, we were able to determine which model works better for the data, and thus the one we should use for better predicting playing time based on age and experience level.  The RMSPE for the linear regression (34.45) was higher than that of the KNN regression (34.41), making the KNN regression our model of choice. However, the RMSPEs were very close, so the difference is somewhat negligible. Since we could not visualize the models themselves, we created visualizations for the scatter plot that represent the data that was used to create and test this model. 

**Visualizations**

Three different visualizations that illustrate the relationship between player age, experience, and playtime were plotted twice, one set representing all the values of the dataframe while the other excludes outliers in playtime and age for improved interpretability. Side-by-side visualizations were used for ease of comparison between the variables “age” and “experience level” in relation to the response variable, as well as to illustrate the relation between the two explanatory variables. Each plot was assigned appropriate titles, axis labels, and opacity for readability, while the width and heights were adjusted to decrease overlap. The overlap of points in the plot was further reduced by excluding the outliers which stretched the plot’s axes. 

In [35]:
#age_prediction_grid = players_filtered[["age"]].agg(["min", "max"])
#age_prediction_grid["predicted"] = lm.predict(age_prediction_grid)

age_scatter1 = alt.Chart(players_filtered, title="Player age and playtime").mark_point(opacity=0.5).encode(
    x=alt.X("age").scale(zero=False).title("Age (years)"),
    y=alt.Y("played_hours").title("Playtime (hours)")
).properties(width=250, height=500)

experience_scatter1 = alt.Chart(players_filtered, title="Player experience level and playtime").mark_point(opacity=0.5).encode(
    x=alt.X("experience_num:O").scale(zero=False).title("Experience level (lowest to highest)"),
    y=alt.Y("played_hours").title("Playtime (hours)")
).properties(width=250, height=500)

age_exp_scatter = alt.Chart(players_filtered, title="Player age and experience level").mark_point(opacity=0.5).encode(
    x=alt.X("experience_num:O").scale(zero=False).title("Experience level (lowest to highest)"),
    y=alt.Y("age").title("Age (years)")
).properties(width=250, height=500)

display(experience_scatter1 | age_scatter1 | age_exp_scatter)

Figure 2. 3 side by side scatterplots with all the data comparing play time vs experience level, playtime vs age, and age vs experience level. For experience level, 1 is Amateur, 2 is Beginner, 3 is Regular, 4 is Pro, and 5 is Veteran. 

In [36]:
players_filtered_1 = testing_data[(testing_data["played_hours"] <= 5) & (testing_data["age"] <=60)]
players_filtered_1

age_scatter2 = alt.Chart(players_filtered_1, title="Player age and playtime, excluding outliers").mark_point(opacity=0.5).encode(
    x=alt.X("age").scale(zero=False).title("Age (years)"),
    y=alt.Y("played_hours").title("Playtime (hours)")
).properties(width=250, height=500)


experience_scatter2 = alt.Chart(players_filtered_1, title="Player experience level and playtime, excluding outliers").mark_point(opacity=0.5).encode(
    x=alt.X("experience_num:O").scale(zero=False).title("Experience level (lowest to highest)"),
    y=alt.Y("played_hours").title("Playtime (hours)")
).properties(width=250, height=500)

age_exp_scatter = alt.Chart(players_filtered_1, title="Player age and experience level, excluding outliers").mark_point(opacity=0.5).encode(
    x=alt.X("experience_num:O").scale(zero=False).title("Experience level (lowest to highest)"),
    y=alt.Y("age").title("Age (years)")
).properties(width=250, height=500)

display(experience_scatter2 | age_scatter2 | age_exp_scatter)

Figure 3. 3 side by side scatterplots with player time under 5 hours comparing play time vs experience level, playtime vs age, and age vs experience level. For experience level, 1 is Amateur, 2 is Beginner, 3 is Regular, 4 is Pro, and 5 is Veteran. 

## **Discussion**

Our linear regression model returned played hours (hrs) = -0.08066 * age (yrs) -0.1485 experience (1 level) + 6.937
We found that there is a negative relationship between played hours and age, as well as a negative relationship between played hours and experience level.  However,  our RMSPE for the model was 34.45, which is rather high. As such, the relationship is weak. Our knn regression model had an RMSPE of 34.41, making it a slightly better fit than our linear regression model. However, the RMSPE was still quite high. The high RMSPE values were likely because of some of the outliers we had for hours played. The vast majority of players had a playtime of less than 5 hours (75% was 0.6 hours), but a few individuals had over 100 played hours, making it likely that these outliers skewed the results. 

We expected the negative relationship between hours played and age because we thought that video games tend to have a younger audience. However, we were surprised by the negative relationship between hours played and experience level because we thought that better players would play more. 
These findings suggest that basic demographic and self-reported experience information can provide a rough indication of expected engagement, but cannot fully predict individual playtime. For game designers, this means that age and experience could be used as part of a personalization strategy (for example, setting default difficulty or tutorial length), but they should not be relied on for high-stakes decisions about players. 

* Are different groups of players (such as beginners vs. experts) motivated differently to play, and do factors like motivation or social interaction have a stronger impact on engagement than age or experience?
* Do session-based factors like number of sessions or session length predict playtime better than demographics?
* Would adding categorical variables like gender or organization improve prediction performance?
* How much do social features (playing with friends, in-game communities) increase playtime?


## **References**

Smith, Andrew. “Plaicraft.ai Launch - Pacific Laboratory for Artificial Intelligence.” Pacific Laboratory for Artificial Intelligence, 28 Sept. 2023, plai.cs.ubc.ca/2023/09/27/plaicraft/.