**Predicting Newsletter Subscription Among Minecraft Server Players Using Gameplay Activity**

**Background:**
Minecraft is a popular sandbox video game that allows players to explore, build, and interact within a block-based virtual world. Due to its capability to attract a wide range of players, researchers are using it as a platform for studying behavior.

In this project, a research team at UBC has deployed a custom Minecraft server to collect detailed data on player activity.The server logs their sessions. This behavioral dataset uses variables such as player age and playtime to specifically explore whether players choose to subscribe to a newsletter related to the game based on these conditions. By analyzing these patterns, we aim to build a predictive model that can help identify the most engaged users based on their in-game behavior. 

**Question:Is a player’s age or played hours more predictive of whether they subscribe to the game-related newsletter or not?**

**Data Description:**

In [1]:
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(forcats)
library(repr)
library(themis)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [2]:
players = read_csv("players.csv")

players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”
[1mRows: [22m[34m249[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): <!DOCTYPE html>

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


<!DOCTYPE html>
<chr>
"<html dir=""ltr"" lang=""en-CA"">"
<head>
"<meta charset=""utf-8"">"
⋮
</div> <!-- #application -->
</body>
</html>


The players dataset used for this analysis consists of 197 observations and 7 variables. It contains information on players of a Minecraft research server, including demographic details, gameplay statistics, and subscription status.

The key variables are summarized below:

Variable 1 (experience): character — player experience level in minecraft Variable 2 (subscribe): logic — whether or not the player is subscribed to the newsletter (TRUE is subscribed, FALSE is not subscribed) Variable 3 (hashedEmail): character — hashed email of the player to protect privacy Variable 4 (played_hours): numeric (decimal number) — number of hour the player has played Variable 5 (name): character — the name of the player Variable 6 (gender): character — gender of the player Variable 7 (Age): numeric (decimal number) — age of the player in years

The data includes both numeric and categorical variables. The response variable for this analysis is subscribe, which indicates whether a player subscribed to the game-related newsletter.

The data includes both numeric and categorical variables. The response variable for this analysis is subscribe, which indicates whether a player subscribed to the game-related newsletter.

Some considerations about the data:

The dataset may contain missing or inconsistent values, which will be addressed during data cleaning. Certain variables, such as hashedEmail and name, serve as unique identifiers and will not be used as predictors. The data were collected through player activity logs and subscription records from the Minecraft research server. Potential limitations include sample size constraints and possible self-reporting bias in demographic variables like gender and age. 

**Methods Description**

In [3]:
players_clean = players |>
as_tibble() |>
select(Age, played_hours, subscribe) |>
filter(played_hours != 0.0) |>
filter(Age != "NA") |>
 mutate(subscribe = as.factor(subscribe))
players_clean


Age,played_hours,subscribe
<dbl>,<dbl>,<fct>
9,30.3,TRUE
17,3.8,TRUE
21,0.7,TRUE
⋮,⋮,⋮
44,0.1,TRUE
22,0.3,FALSE
17,2.3,FALSE


Here, I take only the three variables that i need for answering my questions. thebn, i filter out any rows with the value such as "NA" or "0.0" that are useless values for the data analysis. Lastly, i make sure subsribe is a factor that can later be predicted properly. 

In [4]:
subscribe_true = players_clean |>
filter(subscribe == "TRUE") |>
count(subscribe)
subscribe_true
subscribe_false = players_clean |>
filter(subscribe == "FALSE") |>
count(subscribe)
subscribe_false

subscribe,n
<fct>,<int>
True,84


subscribe,n
<fct>,<int>
False,25


I first count the number of the two binary types to check if there is a factor imblanace. due to this imbalance, i will oversample subscribe=false in my recipe to balance the two. 

In [12]:
players_recipe <- recipe(subscribe ~ ., data = players_clean) |>
  step_upsample(subscribe, over_ratio = 1, skip = FALSE) |>
prep()
players_clean = bake(players_recipe, players_clean)
players_clean |>
  group_by(subscribe) |>
  summarize(n = n())

subscribe,n
<fct>,<int>
False,84
True,84


In [None]:
here, i rebalanced this factor variable to make sure the imbalance from the data set will not affect my analysis later on. 