In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(dplyr)
library(tidyr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

# (1) Data Description:

In [2]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## players.csv:
- there are 196 observations
- 7 variables
    - experience: charcter, someones experience level (Pro, Vertern, Amateur, etc.)
    - subscribe: bool, if the player is subscribed to the game (True or False)
    - hashedEmail: charcter, an anonymized unique player identifier
    - played_hours: double numeric, the amount of hours a player plays in total
    - name: charcter, the players name
    - gender: charcter, the players gender (Male, Female, Other, Prefer not to say, etc.)
    - Age: double numeric, the players age

## sessions.csv: 
- there are 1,535 observations  
- 5 variables
    - hashedEmail: charcter, an anonymized unique player identifier
    - start_time: charcter, the time and date that the player started their gaming session
    - end_time: charcter, the time and date that the player ended their gaming session
    - original_start_time: double numeric, the players original start time in numbers (time potentially in milliseconds)
    - original_end_time: double numeric, the players original end time in numbers (time potentially in milliseconds)

## issues + potential issues in data: 
- ### players
    - a lot of players have the played_hours time at 0 
    - there are some observations which are unfilled (NA)
    - the age, names and gender of the players are most likely self-reported meaning the players can change their age, names and gender
    - unsure if the experience of a player is based on the game giving them a ranking or the players also self-report
        - if this is self-reported then also may not be accurate
    - this is the same potential issue with played_hours
- ### sessions
    - there are missing values in some players end time (they may not have ended their session before the data was taken)
    - the original times are incedibly large
        - the numbers are basically unchanging from start to end time meaning we do not have the accurate times
    - in some of the sessions, players may be AFK (idle) and the session would keep running, making some of the times inaccurate
    - players could potentially start and end their sessions on different days

In [3]:
#summary statistics
#players (shoes the players average played hours and age)
players_stats_table <- players |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_stats_table
#sessions (shows the average session time based on the original start and end times)
sessions_stats_table <- sessions |>
    select(original_start_time, original_end_time) |> 
    mutate(session_time = original_end_time - original_start_time)|>
    summarize(session_time = round(mean(session_time, na.rm = TRUE), 2)) |>
    pivot_longer(cols = session_time, names_to = "Variable", values_to = "Mean")
sessions_stats_table

Variable,Mean
<chr>,<dbl>
played_hours,5.85
Age,21.14


Variable,Mean
<chr>,<dbl>
session_time,2909328


# (2) Questions:

### broad question: 
(2) We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

### specific question: 
Do more players in each experience level who are male spend more hours playing the game then female, based on the players.csv dataset? 

This data will help address the question of what "kinds" of players are most likely to contribute to a large amount of data where a large amount of data are the players played hours. Being able to corrolate the amount of experience that a player has to the amount of time they spend playing can help with the recruiting efforts of more players in that experience level. 

In the players.csv, I will first group all the experience levels from the experience column and then group each of the experience levels by gender. Then average the hours each player has played, to determine the gender and level which show more hour played. In turn, this would more likely contribute to larger amounts of data collected. 

# (3) Exploratory Data Analysis and Visualization

In [11]:
#load datasets onto R
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

#turn into tidy datasets
players_tidy <- players |>
    filter(!is.na(Age)) |> 
    filter(played_hours > 0)
#head(players_tidy)

sessions_tidy <- sessions |>
    separate(col = start_time, into = c("start_date", "start_time_of_day"), sep = " ") |>
    separate(col = end_time, into = c("end_date", "end_time_of_day"), sep = " ") |> 
    mutate(original_start_time_hour = original_start_time/3600000) |> 
    mutate(original_end_time_hour = original_end_time/3600000) |>
    select(hashedEmail:end_time_of_day, original_start_time_hour, original_end_time_hour)
#head(sessions_tidy)

#mean table
players_means_tidy <- players_tidy |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_means_tidy

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Variable,Mean
<chr>,<dbl>
played_hours,10.51
Age,21.3


The data for players was tidied by removing all the rows with any unknown age values, and filtering for the amount of hours played that are over zero hours. 

This had changed the average played hours by increasing the hours from 5.85 to 10.51 average played hours, with a difference of 4.66 hours. This is more accurate because the untidy mean had lots of rows which stored the velue 0 for played_hours. This had meant that the players had either not played at all in the times that the data was taken, or they did not report their own hours. 

The average age of the players did not change by a lot because only 2 rows were removed. 

# NEED TO DO THIS: 
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question
have a visual and expain what the trends are then add another visual - scatter, histogram, bar (multiple visuals that give u a predictive analysis)


In [10]:
#isolating diff categories with gender and experience, then averaging the hours
experience <- players_tidy |>
    mutate(gender = tolower(gender)) |>
    filter(gender %in% c("male", "female")) |>
    group_by(experience, gender) |> 
    summarise(avg_played_hours = mean(played_hours))
experience

[1m[22m`summarise()` has grouped output by 'experience'. You can override using the
`.groups` argument.


experience,gender,avg_played_hours
<chr>,<chr>,<dbl>
Amateur,female,25.5375000
Amateur,male,6.9280000
Beginner,female,0.7857143
⋮,⋮,⋮
Regular,male,18.4214286
Veteran,female,1.4666667
Veteran,male,0.5235294


In [None]:
#create bar plot

In [None]:
#create histogram? 

# (4) Methods and Plan

### Proposed Method and Why it is Chosen

I propose to use a linear regression model to analyze how a players characteristics (experience level, gender) relate to the average number of played hours.

This method allows us to quantify the relationship between one continuous dependent variable (played_hours) and categorical or numerical predictors.

EXPLAIN: Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

# (5) GitHub Repository

https://github.com/viola-t/indiv_planning_report_9.git