### Individual Project Planning Stage

In [39]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 10)

### Question
Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
<br>
<br>
Specific Question: Does hours played or average session length better predict whether a user is subscribed to a game-related newsletter and does this differ by gender and experience?
<br> 
<br>
Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

### Data 
- `experience` - character user experience level
- `subscribe`- logical subscription status of users either (`TRUE` or `FALSE`)
- `hashedEmail` - character encrypted user emails for the purpose of 
- `played_hours` - double total hour played by each user 
- `name` - character user name
- `gender` - character user gender 
- `Age` - double user age 
<br>
  
- `hashedEmail` - character
- `start_time` - character
- `end_time` - character
- `orginal_start_time` - double
- `orginal_end_time` - double

0,1
quantile {stats},R Documentation

0,1
x,"numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE."
probs,"numeric vector of probabilities with values in [0,1]. (Values up to ‘⁠2e-14⁠’ outside that range are accepted and moved to the nearby endpoint.)"
na.rm,"logical; if true, any NA and NaN's are removed from x before the quantiles are computed."
names,"logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs."
type,an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
digits,"used only when names is true: the precision to use when formatting the percentages. In R versions up to 4.0.x, this had been set to max(2, getOption(""digits"")), internally."
...,further arguments passed to or from other methods.


In [54]:
players_url <- "https://raw.githubusercontent.com/skylv777/Data_Science_Project/refs/heads/main/data/players.csv"
sessions_url <- "https://raw.githubusercontent.com/skylv777/Data_Science_Project/refs/heads/main/data/sessions.csv"
players_data <- read_csv(players_url)
sessions_data <- read_csv(sessions_url)

sessions_data_tidy <- 
head(players_data)
head(sessions_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


In [41]:
#Combining the two provided data sets for computing summary statistics 
combined_data <- merge(players_data, sessions_data)
head(combined_data)

Unnamed: 0_level_0,hashedEmail,experience,subscribe,played_hours,name,gender,Age,start_time,end_time,original_start_time,original_end_time
Unnamed: 0_level_1,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
1,0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,Regular,True,1.5,Isaac,Male,20,23/05/2024 00:22,23/05/2024 01:07,1716420000000.0,1716430000000.0
2,0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,Regular,True,1.5,Isaac,Male,20,22/05/2024 23:12,23/05/2024 00:13,1716420000000.0,1716420000000.0
3,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,Pro,False,0.4,Lyra,Male,21,28/06/2024 04:28,28/06/2024 04:58,1719550000000.0,1719550000000.0
4,0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,Beginner,True,0.1,Osiris,Male,17,19/09/2024 21:01,19/09/2024 21:12,1726780000000.0,1726780000000.0
5,0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,True,5.6,Winslow,Male,17,30/08/2024 03:40,30/08/2024 04:04,1724990000000.0,1724990000000.0
6,0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,True,5.6,Winslow,Male,17,27/08/2024 19:18,27/08/2024 19:52,1724790000000.0,1724790000000.0


In [51]:
players_data_select <- select(players_data, Age, played_hours)

players_data_stats <- 
        bind_cols(Summary = c("Mean", "Median",  "Minimum", "Maximum", "Standard Deviation","1st Quartile", "2nd Quartile", "3rd Quartile"), 
        bind_rows(map_df(players_data_select, mean, na.rm = TRUE), 
                  map_df(players_data_select, median, na.rm = TRUE),
                  map_df(players_data_select, min, na.rm = TRUE),
                  map_df(players_data_select, max, na.rm = TRUE),
                  map_df(players_data_select, sd, na.rm = TRUE), #?seems too high
                  map_df(players_data_select, IQR, na.rm = TRUE))) |>
        mutate(across(Age:played_hours, round, 2))

 #  "1st Quartile", "2nd Quartile", "3rd Quartile"),
        
players_data_stats

ERROR: [1m[33mError[39m in `bind_cols()`:[22m
[33m![39m Can't recycle `Summary` (size 8) to match `..2` (size 6).


In [46]:
player_gender_total <- players_data |> 
        summarize(total = n())|>
        pull()

player_gender_count <- players_data |>
        group_by(gender) |>
        summarize(count=n()) |>
        arrange(desc(count)) |>
        mutate(percent = count / player_gender_total * 100)

player_gender_stats <- player_gender_count |>
        mutate(percent = round(percent, digit = 1))

player_gender_stats

gender,count,percent
<chr>,<int>,<dbl>
Male,124,63.3
Female,37,18.9
Non-binary,15,7.7
Prefer not to say,11,5.6
Two-Spirited,6,3.1
Agender,2,1.0
Other,1,0.5


In [47]:
player_experience_total <- players_data |>
        summarize(total = n()) |>
        pull()

player_experience_count <- players_data |>
        group_by(experience) |>
        summarize(count = n()) |>
        arrange(desc(count)) |>
        mutate(percent = count / player_experience_total * 100)

player_experience_stats <- player_experience_count |>
        mutate(percent = round(percent, digits = 1))
            
player_experience_stats

experience,count,percent
<chr>,<int>,<dbl>
Amateur,63,32.1
Veteran,48,24.5
Regular,36,18.4
Beginner,35,17.9
Pro,14,7.1


In [48]:
player_subscribe_total <- players_data |> 
        summarize(total = n()) |>
        pull()

player_subscribe_count <- players_data |>
        group_by(subscribe) |>
        summarize(count = n()) |>
        arrange(desc(count))|>
        mutate(percent = count / player_subscribe_total * 100)

player_subscribe_stats <- player_subscribe_count |>
        mutate(percent = round(percent, digits = 1))

player_subscribe_stats 

subscribe,count,percent
<lgl>,<int>,<dbl>
True,144,73.5
False,52,26.5


### Visualizations

### Methods and Plan