# DSCI 100 Project Final Report (Group 40) 
> Github: [dsci-100-project](https://github.com/wizexplorer/dsci-100-project)

## Predicting Newsletter Subscription from Player Behaviour

In [1]:
# Load required packages
library(tidyverse)
library(repr)
library(tidymodels)

# set seed for reproducability
set.seed(0)

options(repr.plot.width = 12, repr.plot.height = 8)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

### Introduction

#### Background:
Game companies often use newsletters to keep players engaged with
new content, events, and promotions. However, sending messages
to players who are unlikely to subscribe or interact can waste
marketing effort. If we can predict which players are more likely
to subscribe, the company can target those users more efficiently
and potentially improve engagement and revenue.

#### Research questions:
##### Broad question:
  What player characteristics and behaviours are most predictive
  of subscribing to a game-related newsletter?

##### Specific question:
  Can we predict whether a player subscribes to the newsletter
  based on their characteristics and in-game behaviour?   
  (*__TODO__: improve specific question*)


#### Data description:
We use two datasets `players`  and  `sessions`:


__Players dataset__   
The players dataset contains 196 observations of players with 7 attributes:  
  - `experience` (character): Self-reported experience level of the player
  - `subscribe`(logical): Whether player subscribed to the newsletter
  - `hashedEmail` (character): Hashed unique player identifier
  - `played_hours` (double): Total number of hours played by the player
  - `name` (character): Player's chosen in-game name
  - `gender` (character): Player's gender
  - `Age` (double): Player's age in years

At first we see that the data is tidy, but to make it usable for our purposes (classification), we will need to convert the character types to categorical variables (factor):

__Sessions dataset__   
The sessions dataset contains 1535 observations of players' sessions with 5 attributes:  
  - `hashedEmail` (character): Hashed unique player identifier
  - `start_time` (character): Human-readable session start time (format: DD/MM/YYYY HH:MM)
  - `end_time` (character): Human-readable session end time (format: DD/MM/YYYY HH:MM)
  - `original_start_time` (double): Session start time as Unix timestamp (milliseconds)
  - `original_end_time` (double): Session end time as Unix timestamp (milliseconds)

This data is almost tidy! Although, `start_time` and `end_time` could be separated into two variables each, we choose not to do so in this case, because we already have a combined (absolute) version for the times which we can use instead of these variables. Instead, we convert them into datetime objects to compute statistics on the dataset. Also, these variables may be helpful to convert the ineligible Unix time into legible human understandable time later.

From the sessions data, we will derive per-player engagement metrics:
  - `mean_session_duration` (double): average session length in minutes
  - `total_session_duration` (double): total minutes played across sessions
  - `num_sessions` (integer): number of recorded sessions

#### Aim:
Our aim is to build a k-nearest neighbours (k-NN) classifier that
predicts subscribe using:
  Age, experience, gender, played_hours,
  mean_session_duration, total_session_duration, num_sessions,
and missingness indicators for the session variables. We will use
5-fold cross-validation to tune k and focus on recall.


### Methododology

In [2]:
# Load the data
players <- read_csv("https://raw.githubusercontent.com/wizexplorer/dsci-100-project/main/data/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/wizexplorer/dsci-100-project/main/data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
