In [4]:
library(tidyverse)
library(repr)
library(infer)
library(janitor)
library(tidymodels)
library(cowplot)
library(rsample)   
library(recipes)
source('cleanup.R')


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39m 1.2.1
[32m✔[39m [34mdials       [39m 1.3.0     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.4.0     [32m✔[39m [34mworkflows   [39m 1.1.4
[32m✔[39m [34mparsnip     [39m 1.2.1     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mrecipes     [39m 1.1.0     [32m✔[39m [34myardstick   [39m 1.3.1

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr

# Predicting Newsletter Subscription on a UBC Minecraft Server

## Introduction

### Background  
Frank Wood’s research group at UBC runs a Minecraft server to study player behavior.  
They’d like to target newsletter recruitment to those most likely to subscribe and ensure they have enough server resources for active players.

### Research Question  
**Can a player’s total cumulative play-time (`total_minutes`) and average session length (`mean_session`) predict whether they subscribe to the game’s newsletter?**

- **Response variable:** `newsletter_subscribed` (factor: “no” / “yes”)  
- **Explanatory variables:**  
  - `total_minutes` (numeric; derived from `played_hours`)  
  - `mean_session` (numeric; derived from session timestamps)

In [3]:
players  <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

print("Players Data")
head(players)
dim(players)

print("Sessions Data")
head(sessions)
dim(sessions)


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] "Players Data"


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


[1] "Sessions Data"


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


## Data Description

**players.csv** (196 × 7)  
- `hashedEmail` (character): player identifier  
- `played_hours` (numeric): total play time in hours  
- `subscribe` (categorical): whether the player subscribed to the newsletter  
  - Values: `TRUE` / `FALSE` (will be recast to factor levels “no”/“yes”)

**sessions.csv** (1535 × 5)  
- `hashedEmail` (character): player identifier  
- `start_time`, `end_time` (character): session timestamps  

**Notes & potential issues**  
- Timestamps require parsing; time zone unknown.  
- Players with no recorded sessions will be dropped before modeling.  
- We’ll convert `subscribe` into a factor with levels “no” and “yes.”

In [5]:
sessions_parsed <- sessions |>
    mutate(
    start_ts = dmy_hm(start_time),
    end_ts = dmy_hm(end_time),
    session_minutes = as.numeric(difftime(end_ts, start_ts, units="mins"))
    )
head(sessions_parsed$session_minutes)

In [6]:
session_agg <- sessions_parsed |> 
    group_by(hashedEmail) |>
    summarize(
    mean_session = mean(session_minutes, na.rm=TRUE),
    n_sessions = n()
    )
head(session_agg)

hashedEmail,mean_session,n_sessions
<chr>,<dbl>,<int>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,53.0,2
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,30.0,1
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,11.0,1
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,32.15385,13
0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab9c1ff1a0e7ca200b3a,35.0,2
11006065e9412650e99eea4a4aaaf0399bc338006f85e80cc82d18b49f0e2aa4,10.0,1


In [7]:
players_clean <- players |>
    clean_names() |>
    rename(player_id = hashed_email) |>
    mutate(
    total_minutes = played_hours * 60,
    newsletter_subscribed = factor(subscribe, levels = c(FALSE, TRUE),labels = c("no","yes"))) |>
    select(player_id, total_minutes, newsletter_subscribed)

head(players_clean)

player_id,total_minutes,newsletter_subscribed
<chr>,<dbl>,<fct>
f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,1818,yes
f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,228,yes
b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0,no
23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,42,yes
7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,6,yes
f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0,yes
