In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(dplyr)
library(tidyr)

# (1) Data Description:

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## players.csv:
- there are 196 observations
- 7 variables
    - experience: charcter, someones experience level (Pro, Vertern, Amateur, etc.)
    - subscribe: bool, if the player is subscribed to the game (True or False)
    - hashedEmail: charcter, an anonymized unique player identifier
    - played_hours: double numeric, the amount of hours a player plays in total
    - name: charcter, the players name
    - gender: charcter, the players gender (Male, Female, Other, Prefer not to say, etc.)
    - Age: double numeric, the players age

## sessions.csv: 
- there are 1,535 observations  
- 5 variables
    - hashedEmail: charcter, an anonymized unique player identifier
    - start_time: charcter, the time and date that the player started their gaming session
    - end_time: charcter, the time and date that the player ended their gaming session
    - original_start_time: double numeric, the players original start time in numbers (time potentially in milliseconds)
    - original_end_time: double numeric, the players original end time in numbers (time potentially in milliseconds)

## issues + potential issues in data: 
- ### players
    - a lot of players have the played_hours time at 0 
    - there are some observations which are unfilled (NA)
    - the age, names and gender of the players are most likely self-reported meaning the players can change their age, names and gender
    - unsure if the experience of a player is based on the game giving them a ranking or the players also self-report
        - if this is self-reported then also may not be accurate
    - this is the same potential issue with played_hours
- ### sessions
    - there are missing values in some players end time (they may not have ended their session before the data was taken)
    - the original times are incedibly large
        - the numbers are basically unchanging from start to end time meaning we do not have the accurate times
    - in some of the sessions, players may be AFK (idle) and the session would keep running, making some of the times inaccurate
    - players could potentially start and end their sessions on different days

In [None]:
#summary statistics
#players (shoes the players average played hours and age)
players_stats_table <- players |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_stats_table
#sessions (shows the average session time based on the original start and end times)
sessions_stats_table <- sessions |>
    select(original_start_time, original_end_time) |> 
    mutate(session_time = original_end_time - original_start_time)|>
    summarize(session_time = round(mean(session_time, na.rm = TRUE), 2)) |>
    pivot_longer(cols = session_time, names_to = "Variable", values_to = "Mean")
sessions_stats_table

# (2) Questions:

### broad question: 
(2) We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

### specific question: 
Can the experience that a player has predict the times in hours that they spend playing the game, based on the players.csv dataset? 

This data will help address the question of what "kinds" of players are most likely to contribute to a large amount of data where a large amount of data are the players played hours. Being able to corrolate the amount of experience that a player has to the amount of time they spend playing can help with the recruiting efforts of more players in that experience level. 

In the players.csv, I will first group all the experience levels from the experience column. Then average the hours each player has played. 

# (3) Exploratory Data Analysis and Visualization

In [None]:
#load datasets onto R
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

#turn into tidy datasets
players_tidy <- players |>
    filter(!is.na(Age)) |> 
    filter(played_hours > 0)
players_tidy

sessions_tidy <- sessions |>
    separate(col = start_time, into = c("start_date", "start_time_of_day"), sep = " ") |>
    separate(col = end_time, into = c("end_date", "end_time_of_day"), sep = " ") |> 
    mutate(original_start_time_hour = original_start_time/3600000) |> 
    mutate(original_end_time_hour = original_end_time/3600000) |>
    select(hashedEmail:end_time_of_day, original_start_time_hour, original_end_time_hour)
sessions_tidy

#mean table
players_means_tidy <- players_tidy |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_means_tidy

The data for players was tidied by removing all the rows with any unknown age values, and filtering for the amount of hours played that are over zero hours. This had changed the average played hours by increasing the hours by about 5 hours. This is more accurate because the untidy mean had used a lot of data which were 0. The 

Variable	Mean
<chr>	<dbl>
played_hours	5.85
Age	21.14

# (4) Methods and Plan

have a visual and expain what the trends are then add another visual - scatter, histogram, bar (multiple visuals that give u a predictive analysis)

# (5) GitHub Repository