# PlaiCraft DSCI 100 Individual Planning

In this individual planning, I will be analyzing and visualizing data submitted by Frank Wood's Computer Science research group to prepare for the group aspect of the project where we will answer a predictive question. First, I will be loading in the tidyverse, ggplot2, and lubridate R package to use for later.

In [None]:
library(tidyverse)
library(ggplot2)
library(lubridate)

### 1. Inspecting the Datasets

Let's read in our two datasets and inspect what data we have to work with.

In [None]:
players_data <- read_csv("data/players.csv")
players_data

#### Description of the set:

The data in the "players.csv" file provided for PlaiCraft has 7 variables and 196 observations as listed at the top of the data table (196x7). The variables, their data type, and description of what they represent are listed below:
1. `experience`: *chr (character)* - determines the "level" of the players' gaming experience as a Beginner, Amateur, Veteran, or Pro (the most experienced)
2. `subscribe`: *lgl (logical)* - indicated with TRUE for "yes" or FALSE for "no", determining whether or not the player is subscribed to PlaiCraft
3. `hashedemail`: *chr (character)* - email address of player, but in hashed form with a sequence of letters and numbers
4. `played_hours`: *dbl (decimal)* - how many hours the player played PlaiCraft for (assuming over a few days)
5. `name`: *chr (character)* - first name of player
6. `gender`: *chr (character)* - gender of player
7. `age`: *dbl (decimal)* - age of player


One issue seen is that it is unclear over how many days this report has been collecting the data for. There is no indication and no metadata of how long these hours_played were collected from. I also see another issue with the subscribe column. I assume this column is based on subscription of the game, as in who signed their email up for to play PlaiCraft and who is playing as a guest. However, it is unclear whether this is the case or not or if the subscription is for another thing related to the game.

Now let's read in next dataset:

In [None]:
sessions_data <- read_csv("data/sessions.csv")
sessions_data

#### Description of the set:
The data in the "sessions.csv" file provided for PlaiCraft has 5 variables and 1535 observations as indicated by the top of the data table (1535x5). The variables, their data type, and description of what they represent are listed below: 
1. `hashedemail`: *chr (character)* - the players' email, but in hashed form
2. `start_time`: *chr (character)* - start time of players' gametime in "mm/dd/yyyy" and "time (in 24 hour interval)"
4. `end_time`: *chr (character)* - end time of players' gametime in "mm/dd/yyyy" and "time (in 24 hour interval)"
5. `original_start_time`: *dbl (decimal)* - players' start time in numerical form
6. `original_end_time`: *dbl (decimal)* - players' end time in numerical form


One issue seen is that it is difficult at first to determine what is meant by the original_start_time and original_end_time. Especially with the large values given, there is no extra information given about what these values specify and how they connect to the original_start_time and original_end_time. With this, the original_start_time and original_end_time column values are in a format that we won't be able to use effectively for analysis. Thus, we will have to perform some wrangling.

### 2. The Question
Now that we have taken a view at our data. Let's state the question.

The broad question I will be addressing for this project is question #3: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

With this, my specific predictive question will be: 
##### Can players' start time predict how many simultaneous players will be online so that we can determine if we have enough licenses to go out at a certain hour of the day?

### 3. Exploratory Data Analysis and Visualization
It was demonstrated above that the data can be loaded into R, so now we will tidy up the data. Tidy data follows the format of one variable per column, one observation per row, and one value in each cell. The "players.csv" follows these rules, so let's tidy up the "sessions.csv".

In [None]:
sessions_data_tidy <- separate(sessions_data, 
                                end_time, 
                                into = c("date", "end_time"),
                                sep = " ") |>
                      separate(start_time,
                                into = c("date", "start_time"),
                                sep = " ") |>
                       mutate(start_time = as.numeric(hm(start_time))/3600) |>
                       mutate(end_time = as.numeric(hm(end_time))/3600) |>
                       mutate(start_time_hr = as.integer(start_time)) |>
                       mutate(end_time_hr = as.integer(end_time))
sessions_data_tidy

Now let's compute the mean for each quantitative variable in the "players.csv" data set.

In [None]:
players_data_tidy <- players_data |>
                        select(played_hours, Age) |>
                        summarize(across(played_hours:Age, ~ mean(.x, na.rm = TRUE)))
players_data_tidy                        

Next, let's create some exploratory visualizations.

#### Visualization #1

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)
played_hours_age_plot <- ggplot(players_data, aes(x = Age, y = played_hours, colour = gender)) +
                            geom_point() +
                            labs(x = "Age (Years)", y = "Hours of PlaiCraft Played", colour = "Gender") +
                            ggtitle("The Effect of Age on Hours Played of PlaiCraft Players") +
                            theme(text = element_text(size = 12))
played_hours_age_plot

#### Visualization #2

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)
age_histogram <- ggplot(players_data, aes(x = Age)) +
                    geom_histogram() +
                    labs(x = "Age (Years)", y = "Count") +
                    ggtitle("Distribution of PlaiCraft's Players' Ages") +
                    scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75)) +
                    theme(text = element_text(size = 15))
age_histogram

In this visualization, we can see that the highest number of PlaiCraft players in the dataset are of the age of approximately 16 yars old. A lot of the players' ages are condensed around the teenager to young adult ages of 15 - 25. There are however, a few beyond this at the adult and senior ages.

#### Visualization #3

In [None]:
distinct(players_data, gender)

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
gender_data <- players_data |>
                group_by(gender) |>
                summarize(total_gender = n())
gender_bar <- ggplot(gender_data, aes(x = gender, y = total_gender)) +
                geom_bar(stat = "identity") +
                labs(x = "Gender", y = "Count") +
                ggtitle("PlaiCraft Players by Gender") +
                scale_y_continuous(breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130)) +
                theme(text = element_text(size = 12))
gender_bar

In this visualization, we can see that the majority of PlaiCraft players identify as male, with female coming in second, and non-binary coming in third.

#### Visualization #4

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)
experience_subscribe_bar <- ggplot(players_data, aes(x = experience, fill = subscribe)) +
                                geom_bar(position = "fill") +
                                labs(x = "Gaming Experience", y = "Ratio of Players", fill = "Are they Subscribed?") +
                                ggtitle("Subscription of Players Based on Gaming Experience") +
                                theme(text= element_text(size = 12))
experience_subscribe_bar

In this visualization, it seems like regular players are most likely to subscribe to PlaiCraft compared to the other gaming experiences. Although, across all gaming experiences, there are similar proportions of subscribed and not subscribed.

In [None]:
options(repr.plot.width = 15, repr.plot.height = 6)
start_time_players_total <- group_by(sessions_data_tidy, start_time_hr) |>
                                summarize(total_players = n())

start_time_players_line <- ggplot(start_time_players_total, aes(x = start_time_hr, y = total_players)) +
                                geom_line(alpha= 0.6) +
                                labs(x = "Hour of Day", y = "Total Players Starting") +
                                scale_x_continuous(breaks = seq(0, 24, by = 5)
start_time_players_line

### 4. Methods and Plans

The method I would use to address my question of interest is a predictive method, specifically, regression. This method is appropriate because if I had a certain amount of licenses and wanted to at what time of the day I should release them to maximize and release all the licenses I have, a predicitive method would be able to do just that. With basing my classifer off of the explanatory variables of number of players online and time of day, I will be able to 