In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(dplyr)
library(tidyr)
library(ggplot2)

# (1) Data Description:

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## players.csv:
- there are 196 observations
- 7 variables
    - experience: charcter, someones experience level (Pro, Vertern, Amateur, etc.)
    - subscribe: bool, if the player is subscribed to the game (True or False)
    - hashedEmail: charcter, an anonymized unique player identifier
    - played_hours: double numeric, the amount of hours a player plays in total
    - name: charcter, the players name
    - gender: charcter, the players gender (Male, Female, Other, Prefer not to say, etc.)
    - Age: double numeric, the players age

## sessions.csv: 
- there are 1,535 observations  
- 5 variables
    - hashedEmail: charcter, an anonymized unique player identifier
    - start_time: charcter, the time and date that the player started their gaming session
    - end_time: charcter, the time and date that the player ended their gaming session
    - original_start_time: double numeric, the players original start time in numbers (time potentially in milliseconds)
    - original_end_time: double numeric, the players original end time in numbers (time potentially in milliseconds)

## issues + potential issues in data: 
- ### players
    - a lot of players have the played_hours time at 0 
    - there are some observations which are unfilled (NA)
    - the age, names and gender of the players are most likely self-reported meaning the players can change their age, names and gender
    - unsure if the experience of a player is based on the game giving them a ranking or the players also self-report
        - if this is self-reported then also may not be accurate
    - this is the same potential issue with played_hours
- ### sessions
    - there are missing values in some players end time (they may not have ended their session before the data was taken)
    - the original times are incedibly large
        - the numbers are basically unchanging from start to end time meaning we do not have the accurate times
    - in some of the sessions, players may be AFK (idle) and the session would keep running, making some of the times inaccurate
    - players could potentially start and end their sessions on different days

In [None]:
#summary statistics
#players (shoes the players average played hours and age)
players_stats_table <- players |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_stats_table
#sessions (shows the average session time based on the original start and end times)
sessions_stats_table <- sessions |>
    select(original_start_time, original_end_time) |> 
    mutate(session_time = original_end_time - original_start_time)|>
    summarize(session_time = round(mean(session_time, na.rm = TRUE), 2)) |>
    pivot_longer(cols = session_time, names_to = "Variable", values_to = "Mean")
sessions_stats_table

# (2) Questions:

### broad question: 
(2) We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

### specific question: 
Do more players in each experience level who are male spend more hours playing the game then female, based on the players.csv dataset? 

This data will help address the question of what "kinds" of players are most likely to contribute to a large amount of data where a large amount of data are the players played hours. Being able to corrolate the amount of experience that a player has to the amount of time they spend playing can help with the recruiting efforts of more players in that experience level. 

In the players.csv, I will first group all the experience levels from the experience column and then group each of the experience levels by gender. Then average the hours each player has played and the age of each group of players, to determine the gender, age and experience level to predict the average hours played. In turn, this would more likely contribute to larger amounts of data collected. 

# (3) Exploratory Data Analysis and Visualization

In [None]:
#load datasets onto R
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

#turn into tidy datasets
players_tidy <- players |>
    filter(!is.na(Age)) |> 
    filter(played_hours > 0)
# head(players_tidy)

sessions_tidy <- sessions |>
    separate(col = start_time, into = c("start_date", "start_time_of_day"), sep = " ") |>
    separate(col = end_time, into = c("end_date", "end_time_of_day"), sep = " ") |> 
    mutate(original_start_time_hour = original_start_time/3600000) |> 
    mutate(original_end_time_hour = original_end_time/3600000) |>
    select(hashedEmail:end_time_of_day, original_start_time_hour, original_end_time_hour)
#head(sessions_tidy)

#mean table
players_means_tidy <- players_tidy |>
    select(played_hours, Age) |> 
    summarize(played_hours = round(mean(played_hours, na.rm = TRUE), 2), Age = round(mean(Age, na.rm = TRUE), 2)) |>
    pivot_longer(cols = played_hours:Age, names_to = "Variable", values_to = "Mean")
players_means_tidy

The data for players was tidied by removing all the rows with any unknown age values, and filtering for the amount of hours played that are over zero hours. 

This had changed the average played hours by increasing the hours from 5.85 to 10.51 average played hours, with a difference of 4.66 hours. This is more accurate because the untidy mean had lots of rows which stored the velue 0 for played_hours. This had meant that the players had either not played at all in the times that the data was taken, or they did not report their own hours. 

The average age of the players did not change by a lot because only 2 rows were removed. 

In [None]:
#isolating diff categories with gender and experience, then averaging the hours
experience <- players_tidy |>
    select(experience, gender, played_hours, Age) |>
    mutate(gender = tolower(gender)) |>
    filter(gender %in% c("male", "female")) |>
    group_by(experience, gender) |> 
    # summarise(avg_played_hours = mean(played_hours), avg_age = mean(Age)) |>
    summarise(avg_played_hours = mean(played_hours), age = Age, pl) |>
    mutate(experience = as_factor(experience), 
           experience = fct_relevel(experience, "Beginner", "Amateur", "Regular", "Pro", "Veteran")) |>
experience

In [None]:
options(repr.plot.width=7, repr.plot.height=7)
# scatter plot
experience_scatter <- experience |>
    ggplot(aes(x = experience, y = age, color = gender)) +
    geom_point(alpha = 0.7, size = 5)+
    labs(title = "Average Age of Players by Experience Level and Gender", x = "Experience Level", y = "Average Age (Years)", fill = "Players Gender")
experience_scatter

In [None]:
#histogram
experience_histo <- experience|>
    ggplot(aes(x=played_hours, fill=experience))+
    geom_histogram(alpha=0.7, position="identity")+
    facet_grid(rows = vars(experience)) +
    geom_vline(xintercept=21.30, linetype = "dashed") + #the dotted line is the average "tidy" age 
    labs(x="Age (Years)", y="Amount of People", title="The Age of a player by their Experience Level", fill ="Experience Level")
experience_histo

In [None]:
# bar plot
experience_bar <- experience |>
    ggplot(aes(x=experience, y=avg_played_hours, fill=gender)) +
    geom_bar(stat = "identity", position = "dodge")+
    labs(title = "Average Hours Played by Experience Level and Gender", x = "Experience Level", y = "Average Played Hours (Hours)", color = "Players Gender")
experience_bar

The experience level that spends the most time playing on average are the Regulars. The graphs also showed me that there are only males that have the experience level of "Pro", who are also all below the average age of 21.3. Most players also seem to range from an average of 0 to 20 hours played, while the players that do surpass the average 20 hours, are females in the Amateur and Regular category. With the highest average hours played stemming from females identified with Regular experience levels. Most of the players are in the age range of late teens to mid twenties, with the most variety of ages coming from "Regular" players. 

# (4) Methods and Plan

### Proposed Method and Why it is Chosen

The method I would use is a linear regression model to analyze how a players characteristics (experience level, gender, and age) relate to the average number of played hours, which then contributes to the amount of data they give to the research server.  

Linear Regression is ideal because it predicts a numberical value based on the relationship between a continuous dependent variable (played_hours) with categorical predictors (players characteristics). 

There would be assumptions that in order for this model to work, there could be no age NA and played_hours at 0 (this is where tidy data is also useful). 

# (5) GitHub Repository

https://github.com/viola-t/indiv_planning_report_9.git