# Predicting Newsletter Subscription Based on Player Behaviour
## Introduction
**Background**: UBC's Pacific Laboratory for Artificial Intelligence (PLAI) research group runs a Minecraft server called PLAICraft to study player behaviour. They want to know what player traits and behaviours are linked to subscribing to a newsletter.

**Research Question**: Can we predict whether a player will subscribe to a newsletter based on their demographics and gameplay behaviour?

In [4]:
# Load libraries
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [24]:
# Load data
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Data Cleaning

In [38]:
# Add a session duration column 
sessions <- sessions |>
    mutate(duration_minutes = as.numeric(as_datetime(original_end_time / 1000) - as_datetime(original_start_time / 1000)) / 60)

## Data Cleaning

In [42]:
# Summary table for players
session_summary <- sessions |>
    group_by(hashedEmail) |>
    summarize(total_sessions = n(),
             total_minutes_played = sum(duration_minutes, na.rm = TRUE),
             avg_session_duration = mean(duration_minutes, na.rm = TRUE))

hashedEmail,total_sessions,total_minutes_played,avg_session_duration
<chr>,<int>,<dbl>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,2,166.6667,83.33333
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,1,0.0000,0.00000
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,1,0.0000,0.00000
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,13,666.6667,51.28205
0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab9c1ff1a0e7ca200b3a,2,166.6667,83.33333
11006065e9412650e99eea4a4aaaf0399bc338006f85e80cc82d18b49f0e2aa4,1,0.0000,0.00000
119f01b9877fc5ea0073d05602a353b91c4b48e4cf02f42bb8d661b46a34b760,1,0.0000,0.00000
18936844e06b6c7871dce06384e2d142dd86756941641ef39cf40a9967ea14e3,41,1000.0000,24.39024
1a2b92f18f36b0b59b41d648d10a9b8b20a2adff550ddbcb8cec2f47d4d881d0,1,0.0000,0.00000
1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,1,0.0000,0.00000
