# **DSCI 100 Project:** *Are the Chronically Online More Likely to Subscribe?*

## Introduction
#### Background
This project and many others from fellow DSCI 100 students will be used to aide a UBC computer science research group led by Frank Wood with the goal of determining how people play video games. The team has set up a MineCraft server in hopes of logging players' actions in the game.
#### Questions
*Broad Question:* What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

*Specific Question:* Are experience and number of sessions played a good predictor of subscription to a game-related newsletter?
#### Data description
This project will use two sets of data that contain player identification and traits and their logged sessions. 

The first file is named "players.csv":
* CSV file, delimited by ",'.
* 196 observations (players)
* 7 variables
* Experience, subscription status, hashed E-mail, name, and gender are categorical.
* Hours played and age are quantitative.
* Experience is split into beginner, amatuer, regular, veteran and pro, but it is hard to know how these rank comparatively. I have ordered them into what I think is appropriate in terms of least to most experience based on naming.

The second file is named "sessions.csv":
* CSV file, delimited by ",".
* 1535 observations (sessions)
* 5 variables
* Hashed E-mail is categorical.
* Start time, end time, original start time, and original end time are quantitative.
* The data is not tidy as the date of the start and end times are in the same cells as the hours for those variables.
* The original start and original end times are in some unknown unit with a magnitude of E+12. It is harder to interpret, but may be easier to use in terms of analysis than the start and end times because it is a single number in each cell.

## Methods & Results
To begin the data analysis, we must first load the relevant packages and datasets.

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(janitor)

players <- read_csv("data/players.csv") |>
    clean_names()
sessions <- read_csv("data/sessions.csv")|>
    clean_names()

head(players)
head(sessions)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

experience,subscribe,hashed_email,played_hours,name,gender,age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashed_email,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


In [2]:
n_sessions <- sessions |>
    group_by(hashed_email) |>
    summarise(number_of_sessions = n())
head(n_sessions)

hashed_email,number_of_sessions
<chr>,<int>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,2
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,1
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,1
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,13
0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab9c1ff1a0e7ca200b3a,2
11006065e9412650e99eea4a4aaaf0399bc338006f85e80cc82d18b49f0e2aa4,1


## Discussion

## References