# Individual Project Planning Stage

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(lubridate)
options(repr.matrix.max.rows = 10)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## Data 

#### players.csv 
This data set contains information about players who used the Minecraft server, PLAICraft. There are 196 observations and 7 variables as follows: 
- `experience` - character: player's experience level (`Beginner`, `Amateur`, `Regular`, `Vetern`, or `Pro`)
- `subscribe` - logical: player's subscription status to a game-related newsletter (`TRUE` or `FALSE`)
- `hashedEmail` - character: hashed player's email
- `played_hours` - double: total hours played by each user 
- `name` - character: player's first name
- `gender` - character: player's gender
- `Age` - double: player's age 
<br>
<br>
Some potential issues with this data are related to the . There are also two `NA`s in the `Age` column that should be noted. 

#### sessions.csv
This data set contains information about sessions played on the Minecraft server, PLAICraft. There are 1535 observations and 5 variables as follows:
- `hashedEmail` - character: hashed player's email
- `start_time` - character:
- `end_time` - character:
- `orginal_start_time` - double:
- `orginal_end_time` - double:
<br>
<br>
Some potential issues with this data are the 

In [2]:
#Loading Data Sets
players_url <- "https://raw.githubusercontent.com/skylv777/Data_Science_Project/refs/heads/main/data/players.csv"
sessions_url <- "https://raw.githubusercontent.com/skylv777/Data_Science_Project/refs/heads/main/data/sessions.csv"
players_data <- read_csv(players_url)
sessions_data <- read_csv(sessions_url)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [3]:
#Tidying sessions_data so there is only one value per cell
sessions_data_tidy <- sessions_data |>
         mutate(start_time = ymd_hms(start_time)) |>
         mutate(end_time = ymd_hms(end_time))
head(sessions_data_tidy)

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<dttm>,<dttm>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2030-06-21 00:18:12,2030-06-21 00:18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2017-06-21 00:23:33,2017-06-21 00:23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,2025-07-21 00:17:34,2025-07-21 00:17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2025-07-21 00:03:22,2025-07-21 00:03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2025-05-21 00:16:01,2025-05-21 00:16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2023-06-21 00:15:08,2023-06-21 00:17:10,1719160000000.0,1719160000000.0


In [4]:
#Combining the two provided data sets for computing summary statistics 
combined_data <- merge(players_data, sessions_data)

#### Summary Statistics for players_data

In [5]:
#Summary Statistics on Quantitative Values of players_data 
players_data_select <- select(players_data, Age, played_hours)

players_data_stats <- 
        bind_rows(map_df(players_data_select, mean, na.rm = TRUE), 
                  map_df(players_data_select, median, na.rm = TRUE),
                  map_df(players_data_select, ~{
    x <- na.omit(.x)
    if (length(x) == 0) return(NA_real_)
    tibble(val = x) |>
      count(val, sort = TRUE) |>
      filter(n == max(n)) |>
      slice_head(n = 1) |>   
      pull(val)}),
                  map_df(players_data_select, min, na.rm = TRUE),
                  map_df(players_data_select, max, na.rm = TRUE), 
                  map_df(players_data_select, sd, na.rm = TRUE), #?seems too high
                  map_df(players_data_select, ~ quantile(.x, probs = 0.25, na.rm = TRUE)[[1]]),
                  map_df(players_data_select, ~ quantile(.x, probs = 0.5, na.rm = TRUE)[[1]]),
                  map_df(players_data_select, ~ quantile(.x, probs = 0.75, na.rm = TRUE)[[1]])) |>
        mutate(Summary = c("Mean", "Median", "Mode", "Minimum", "Maximum", "Standard Deviation", "1st Quartile", "2nd Quartile", "3rd Quartile")) |>
        relocate(Summary) |>
        mutate(across(Age:played_hours, round, 2))

        
players_data_stats

[1m[22m[36mℹ[39m In argument: `across(Age:played_hours, round, 2)`.
[1m[22m[33m![39m The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))”


Summary,Age,played_hours
<chr>,<dbl>,<dbl>
Mean,21.14,5.85
Median,19.0,0.1
Mode,17.0,0.0
Minimum,9.0,0.0
Maximum,58.0,223.1
Standard Deviation,7.39,28.36
1st Quartile,17.0,0.0
2nd Quartile,19.0,0.1
3rd Quartile,22.75,0.6


In [6]:
player_gender_total <- players_data |> 
        summarize(total = n())|>
        pull()

player_gender_count <- players_data |>
        group_by(gender) |>
        summarize(count=n()) |>
        arrange(desc(count)) |>
        mutate(percent = count / player_gender_total * 100)

player_gender_stats <- player_gender_count |>
        mutate(percent = round(percent, digit = 1))

player_gender_stats

gender,count,percent
<chr>,<int>,<dbl>
Male,124,63.3
Female,37,18.9
Non-binary,15,7.7
Prefer not to say,11,5.6
Two-Spirited,6,3.1
Agender,2,1.0
Other,1,0.5


In [7]:
player_experience_total <- players_data |>
        summarize(total = n()) |>
        pull()

player_experience_count <- players_data |>
        group_by(experience) |>
        summarize(count = n()) |>
        arrange(desc(count)) |>
        mutate(percent = count / player_experience_total * 100)

player_experience_stats <- player_experience_count |>
        mutate(percent = round(percent, digits = 1))
            
player_experience_stats

experience,count,percent
<chr>,<int>,<dbl>
Amateur,63,32.1
Veteran,48,24.5
Regular,36,18.4
Beginner,35,17.9
Pro,14,7.1


In [8]:
player_subscribe_total <- players_data |> 
        summarize(total = n()) |>
        pull()

player_subscribe_count <- players_data |>
        group_by(subscribe) |>
        summarize(count = n()) |>
        arrange(desc(count))|>
        mutate(percent = count / player_subscribe_total * 100)

player_subscribe_stats <- player_subscribe_count |>
        mutate(percent = round(percent, digits = 1))

player_subscribe_stats 

subscribe,count,percent
<lgl>,<int>,<dbl>
True,144,73.5
False,52,26.5


#### Summary Statistics for sessions_data

In [9]:
sessions_data_difference <- sessions_data_tidy |>
        mutate(session_length = end_time - start_time / 6000)

sessions_data_select <- select(sessions_data_difference, original_start_time, original_end_time, session_length)

sessions_data_stats <- 
        bind_rows(map_df(sessions_data_select, mean, na.rm = TRUE), 
                  map_df(sessions_data_select, median, na.rm = TRUE))
# ,
#                   map_df(sessions_data_select, ~{
#     x <- na.omit(.x)
#     if (length(x) == 0) return(NA_real_)
#     tibble(val = x) |>
#       count(val, sort = TRUE) |>
#       filter(n == max(n)) |>
#       slice_head(n = 1) |>   
#       pull(val)}),
#                   map_df(sessions_data_select, min, na.rm = TRUE),
#                   map_df(sessions_data_select, max, na.rm = TRUE), 
#                   map_df(sessions_data_select, sd, na.rm = TRUE), 
#                   map_df(sessions_data_select, ~ quantile(.x, probs = 0.25, na.rm = TRUE)[[1]]),
#                   map_df(sessions_data_select, ~ quantile(.x, probs = 0.5, na.rm = TRUE)[[1]]),
#                   map_df(sessions_data_select, ~ quantile(.x, probs = 0.75, na.rm = TRUE)[[1]])) |>
#        mutate(Summary = c("Mean", "Median", "Mode", "Minimum", "Maximum", "Standard Deviation", "1st Quartile", "2nd Quartile", "3rd Quartile")) |>
#        relocate(Summary) 
#|>
       #mutate(across(Age:played_hours, round, 2))
sessions_data_stats

ERROR: [1m[33mError[39m in `mutate()`:[22m
[1m[22m[36mℹ[39m In argument: `session_length = end_time - start_time/6000`.
[1mCaused by error in `Ops.POSIXt()`:[22m
[33m![39m '/' not defined for "POSIXt" objects


### Visualizations
Explain any insights you gain from these plots that are relevant to address your question

In [None]:
experience_bar <- players_data |>
        ggplot(aes(x = experience)) +
        geom_bar(aes(fill = subscribe)) +
        ggtitle("Subscription Status by Experience") +
        labs(x = "Experience Level",
             y = "Amount",
             fill = "Subscription Status")
experience_bar

This plot shows the distribution of experience level as well as the ratio of players at each experience level who are subscribed to game-related newsletters. 

## Question
#### Broad Question: 
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
<br>
<br>
#### Specific Question:
Can hours played, age, and average session length predict whether a user is subscribed to a game-related newsletter? Additionally does this differ by experience?
<br> 
<br>
The data contains the ages, hours played, subscription status, and experience of each user. By combining players.csv and sessions.csv, I can mutate the start and end time of each session to find the length, group_by `hashedEmail` and  determine the average session length for each player. With this information I will be able to train a classification algorithm to ascertain the answer to my question. 

### Methods and Plan

In order to answer my specific question with the given data, a k-nn classification model would be the most appropriate. Out of the methods introduced in DSCI 100, classification is the only one that predicts a class and therefore more fitting to predict subscription status than regression. 


Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?