# Summary Statistics for Final Project


In [3]:
#Load appropraite libraries and datasets
library(tidyverse)
library(tidymodels)

players_url <- "https://raw.githubusercontent.com/ishirGhatpande/Individual-Project-Planning/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/ishirGhatpande/Individual-Project-Planning/refs/heads/main/sessions.csv"

players <- read_csv(players_url, show_col_types = FALSE)
# sessions <- read_csv(sessions_url, show_col_types = FALSE)

head(players)
# head(sessions)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [4]:
#First, mutate the data to make it into a tidy format
players <- players |>
    rename(age = Age) |>
    select(subscribe, played_hours, age) |>
    mutate(subscribe = as_factor(subscribe)) |>
    mutate(subscribe = fct_recode(subscribe, "Subscribed" = "TRUE", "Not Subscribed" = "FALSE"))
head(players)

subscribe,played_hours,age
<fct>,<dbl>,<dbl>
Subscribed,30.3,9
Subscribed,3.8,17
Not Subscribed,0.0,17
Subscribed,0.7,21
Subscribed,0.1,21
Subscribed,0.0,17


In [5]:
#Second, summarize the tidy dataset, to create our summary statistics.
summary(players)

          subscribe    played_hours          age       
 Not Subscribed: 52   Min.   :  0.000   Min.   : 9.00  
 Subscribed    :144   1st Qu.:  0.000   1st Qu.:17.00  
                      Median :  0.100   Median :19.00  
                      Mean   :  5.846   Mean   :21.14  
                      3rd Qu.:  0.600   3rd Qu.:22.75  
                      Max.   :223.100   Max.   :58.00  
                                        NA's   :2      

In [6]:
#Find percentage of subscribed and not subscribed
num_obs <- nrow(players)
players|>
group_by(subscribe)|>
summarize(
    count = n(),
    percentage = n()/num_obs *100)
num_obs

subscribe,count,percentage
<fct>,<int>,<dbl>
Not Subscribed,52,26.53061
Subscribed,144,73.46939


### Table 1: Summary Statistics of Tidy players.csv Dataset (196 Observations and 3 Variables)
| Variable | Variable Type | Description|# of Subscribed|# of Not Subscribed |Percentage of Subscribed|Percentage of Not Subscribed|Minimum| Median|Mean|Maximum|
| -------- | ------- |---|---|---|---|---|---|---|---|---|
| subscribe | Factor|Player's Subscription Status|52|144|73.47%|26.53%| N/A| N/A|N/A|N/A|
| played_hours| Double |Player's total hours played|N/A|N/A|N/A|N/A| 0.00 |0.10|5.85|223.10|
| age | Double |Player's Age|N/A|N/A|N/A|N/A|9.00|19.00|21.14|58.00|




The dataset contains 196 observations across three variables. The subscribe variable is a factor variable, indicating a player's subscription status. When finding the percentage of players who are subscribed and those who are not. The majority of players (73.47%, 144 players) are subscribed, while 26.53% (52 players) are not. The two double variables are played_hours and age. The played_hours variable is the total amount of a player's played hours; it ranged from a minimum of 0.00 to a maximum of 223.10 hours, with a mean of 5.85 hours and a median of 0.10 hours, suggesting the data is skewed to the right. The age variable is a player's age, ranging from a minimum of 9.00 to a maximum of 58.00 years, with a mean of 21.14 years and a median of 19.00 years.
