In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


# Data Science Project: Can Age and Played Hours predict if a Player is Subscribed?

by Millie Sohn, Markus Chu, Mhad Khan Sherwani and Sai Manas Pandrangi

## Introduction:

The gaming industry is expanding by the year, and efforts are made to study player's actions for a more targetted recruitment. To achieve this, researchers must predict what types of players are more likely to contribute to the games and determine which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how these features differ between various player types. This science project aims to specifically answer whether play hours and the player's age predict whether a player will subscribe to the Minecraft newsletter or not. In order to answer this, we used data obtained from a Minecraft server which contain specific player information and whether they subscribe to the newsletter or not. The dataset, named "players", portrays a list of players and their data (7 variables, 196 observations): player's experience (amateur, beginner, pro, regular, veteran), subscription status (TRUE: subscribed to newsletter, FALSE: not subscribed), hashed Email that uniquely identifies each player, game play hours, player name, player gender, and player age. 
The dataset reveals that 144 players are subscribed, and 52 are not. Play hours range from 0 to 223.1 hours, and age range from 8 to 50 years old.


## Methods & Results:

We will be answering our question using K-nearest neighbors Classification, tuning our model to obtain the highest prediction accuracy. 

The predictors we will be using will be strictly numerical: total time the player spent playing (hours) and player age (years). The categorical variable we will be predicting is "subscribe" (TRUE; subscribed or FALSE; not subscribed). 
Summary of wrangled dataset:
- Player subscribed: (144) and not subscribed (52)
- Play hours range: 0 to 223.1 hours
- Age range: 8 to 50 y.o. Majority are 17 (75).

The players dataset was wrangled to counter some potential issues, including the removal of NA values. The dataset also included some outliers where some players were much older or younger than the mean (which was 20.5 years), or where players played significantly more or less hours than the mean (which was 5.9 hours). However, these outliers were kept to contain the data size and retain information for the model to train off of. 

In [2]:
set.seed(1)

url_pl <- "https://raw.githubusercontent.com/takemil8088/ind-porject/refs/heads/main/players.csv"
players <- read_csv(url_pl) |>
select(subscribe, played_hours, Age) |>
filter(!is.na(subscribe), !is.na(played_hours), !is.na(Age)) |>
mutate(subscribe = as_factor(subscribe)) |>
rename(age = Age)

show_players <- players |> head(5)
show_players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


subscribe,played_hours,age
<fct>,<dbl>,<dbl>
True,30.3,9
True,3.8,17
False,0.0,17
True,0.7,21
True,0.1,21


Visualizations reveal some key information on the relationship between variables and their subscription status.

In [None]:
relation_played_hours_age <- players |>
    select(age, played_hours, subscribe) |>
    ggplot(aes(x = age, y = played_hours, colour = subscribe)) +
    geom_point() +
    labs(x = "Age (Years)",
         y = "Playing Time (Hours)",
         colour = "Subscription",
         title = "Graph 1: Playing Time vs. Age") +
    theme(text = element_text(size = 15))

relation_played_hours_age

age_subscribe_plot <- players |>
    select(age, subscribe) |>
    ggplot(aes(x = age, fill = subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Age (Years)",
         y = "Ratio of Subscribed and Not Subscribed (0-1)",
         fill = "Subscribed",
         title = "Graph 2: Relationship of Age and Game Subscription") +
    theme(text = element_text(size = 15))

age_subscribe_plot

Graph 1 shows how most players with more playing hours tend to subscribe to the newsletter, and the majority are roughly ages 15 to 25 years old. Players who do not subscribe also have fewer play hours (~0-2 hours) and are all aged 17 and up.

Graph 2 reveals how the majority of players aged ~10 to 30 overall subscribe to the newsletter. Older players, 30 and up, tend not to subscribe.

#### KNN Classification Model:
To train our KNN model to obtain higher accuracy, we split our data into 75% for training and 25% for testing. 

In [None]:
split <- initial_split(players, prop = 0.75, strata = subscribe)
train <- training(split)
test <- testing(split)

To choose the best K that improves accuracy, we performed cross-validation. The k_vals dataframe contains a sequence of values for K, from 1 to 100 (stepping by 1). A smaller step size was used for a finer search grid, allowing for a more precise evaluation of the model's performance.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

recipe <- recipe(subscribe ~ played_hours + age, data = train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 1))

p_vfold <- vfold_cv(train, v = 10, strata = subscribe)

knn_results <- workflow() |>
  add_recipe(recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = p_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies

In [4]:
best_k_plot <- accuracies |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    xlab("Number of K") +
    ylab("Accuracy") +
    ggtitle("Graph 3: Plot of estimated Accuracy versus the Number of Neighbors") +
    theme(text = element_text(size = 15))

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)

best_k_plot
best_k

ERROR: Error in eval(expr, envir, enclos): object 'accuracies' not found


The best K was visualized as a plot of estimated accuracy versus the number of neighbors, and quantified to obtain which value has the highest accuracy mean. In this case, graph 3 shows that K = 25 is optimal.

V-fold cross validation was also done to further improve our model. Accuracies of the model using different v-folds were compared, and revealed that a 10-fold cross validation resulted in higher accuracy and lower standard error relative to the 5-fold. Thus, 10-fold cross-validation would improve the model by reducing variability and increasing reliability.

In [None]:
knn_spec_new <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

vfold_5 <- vfold_cv(train, v = 5, strata = subscribe)

vfold_5_fit <- workflow() |>
    add_recipe(recipe) |>
    add_model(knn_spec_new) |>
    fit_resamples(resamples = vfold_5)

vfold_5_metrics <- vfold_5_fit |>
    collect_metrics() |>
    filter(.metric == "accuracy")

vfold_5_metrics

vfold_10_fit <- workflow() |>
    add_recipe(recipe) |>
    add_model(knn_spec_new) |>
    fit_resamples(resamples = p_vfold)

vfold_10_metrics <- vfold_10_fit |>
    collect_metrics() |>
    filter(.metric == "accuracy")

vfold_10_metrics

The quality metrics for the final tuned model were found. Accuracy was 73.5%, Precision was also 73.5%, and Recall was 100%. Relatively high metrics were obtained, especially a recall of 100%, which shows that the model can accurately identify subscribed players. However, the lower accuracy and precision suggest that the model may still misclassify non-subscribers.

In [None]:
knn_fit <- workflow() |>
    add_recipe(recipe) |>
    add_model(knn_spec_new) |>
    fit(data = train)

p_predictions <- predict(knn_fit, test) |>
  bind_cols(test)

prediction_accuracy <- p_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

prediction_precision <- p_predictions|>
    precision(truth = subscribe, estimate = .pred_class, event_level = "second")

prediction_recall <- p_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level = "second")

prediction_accuracy
prediction_precision
prediction_recall

The confusion matrix reveals how 0 FALSE observations (not subscribed) were correctly classified as FALSE, and 36 TRUE (subscribed) observations were correctly classified as TRUE. However, it classified 13 FALSE observations to TRUE.

In [None]:
confusion <- p_predictions |>
             conf_mat(truth = subscribe, estimate = .pred_class)
confusion

Plotting the trained test data visualizes how the classifier always classifies the status of the subscription to "TRUE". Graph 4 below shows that a player will always subscribe to the newsletter.

In [2]:
recipes <- recipe(subscribe ~ played_hours + age, data = test) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors()) |>
  prep()

scaled_players <- bake(recipes, players)

are_grid <- seq(min(scaled_players$age),
                max(scaled_players$age),
                length.out = 100)
smo_grid <- seq(min(scaled_players$played_hours),
                max(scaled_players$played_hours),
                length.out = 100)
asgrid <- as_tibble(expand.grid(age = are_grid,
                                played_hours = smo_grid))

knnPredGrid <- predict(knn_fit, asgrid)

prediction_table <- bind_cols(knnPredGrid, asgrid) |>
  rename(subscribe = .pred_class)

wkflw_plot <-
  ggplot() +
  geom_point(data = scaled_players,
             mapping = aes(x = age,
                           y = played_hours,
                           color = subscribe),
             alpha = 0.75) +
  geom_point(data = prediction_table,
             mapping = aes(x = age,
                           y = played_hours,
                           color = subscribe),
             alpha = 0.02,
             size = 5) +
  labs(color = "Subscription status (T/F)",
       x = "Standardized Player age (years)",
       y = "Standardized Game play time (in hours)", 
       title = "Graph 4: Data with background color indicating the decision of classifier") +
  scale_color_manual(values = c("darkorange", "steelblue")) +
  theme(text = element_text(size = 12))

wkflw_plot

ERROR: Error in prep(step_center(step_scale(recipe(subscribe ~ played_hours + : could not find function "prep"


## Discussion:

Data analysis using our K-nearest neighbors (KNN) classifier found that hours played and player age cannot accurately predict whether a player will subscribe to the newsletter or not. In the plot of the test data, despite having a high accuracy and precision, in addition to a recall of 100%, the model classified all players as likely to subscribe, regardless of their actual subscription. This shows that while the model can perfectly identify subscribed players (as shown in the perfect recall), it cannot accurately differentiate between subscribers and non-subscribers. Thus, this limits the model's predictive power.

This outcome was not expected. Graph 1 shows that players with higher playing time tend to subscribe, and Graph 2 demonstrates that older players are less likely to subscribe. This should mean that played hours and age should be impactful predictors, but it was not. There seemed to be a relationship between the variables and subscription, but the KNN model's overclassification of players as subscribed tells otherwise. This imbalance could be due to the dataset having more subscribed players (144) than non-subscribers (52), which would lead to undersampling and cause the model to lean toward the majority class.

These findings show that played hours and age are not reliable for predicting subscription. In companies for games like Minecraft, this tells marketers to look at other possible predictors, like in-game preferences. Despite the model's perfect recall, it is biased towards the majority class (subscribed players, who outnumber non-subscribers). Overall, this shows that more diverse and balanced data is necessary to create more accurate models to understand player subscription behaviour.

Besides needing to know what other factors can help better understand player subscription, it is important to ask if the dataset is the problem. This means that played hours and age could have been reliable indicators with stronger data. Future research could explore the impact of minimizing the bias in our model, and whether alternative datasets are key. Additionally, exploring more player characteristics like spending habits and social engagement can contribute to predictions, as incorporating more predictors could further improve the model's ability to identify subscribers. Another unique exploration would be seeing if these findings are specific to Minecraft only, or if it is the same for every video game.