# PlaiCraft DSCI 100 Individual Planning

In this individual planning, I will be analyzing and visualizing data submitted by Frank Wood's Computer Science research group to prepare for the group aspect of the project where we will answer an agreed upon predictive question. First, I will be loading in the tidyverse, ggplot2, and lubridate R packages to use for later.

In [None]:
library(tidyverse)
library(ggplot2)
library(lubridate)
set.seed(9999)

### 1. Inspecting the Datasets and Summaries

Firstly, let's read in the two datasets and inspect what data we have to work with.

In [None]:
players_data <- read_csv("data/players.csv")
players_data

#### Description of the set:

The data in the "players.csv" file provided for PlaiCraft has 7 variables and 196 observations as listed at the top of the data table (196x7). The variables, their data type, and description of what they represent are listed below:
1. `experience`: *chr (character)* - determines the "level" of the players' gaming experience as a Beginner, Amateur, Veteran, or Pro (the most experienced)
2. `subscribe`: *lgl (logical)* - indicated with TRUE for "yes" or FALSE for "no", determining whether or not the player is subscribed to PlaiCraft
3. `hashedEmail`: *chr (character)* - email address of player, but in hashed form with a sequence of letters and numbers
4. `played_hours`: *dbl (decimal)* - how many hours the player played PlaiCraft for
5. `name`: *chr (character)* - first name of player
6. `gender`: *chr (character)* - gender of player
7. `age`: *dbl (decimal)* - age of player


One issue seen is that it is unclear over how many days this report has been collecting the data for. There is no indication nor metadata for how long these hours were collected for. I also see another issue with the subscribe column. I assume this column is based on subscription of the game, as in who signed their email up for to play PlaiCraft and who is playing as a guest. However, it is unclear whether this is the case or not or if the subscription is for another thing related to the game.

Before moving onto the next datset, let's summarize the columns to get a better understanding of the data.

#### Summary #1

In [None]:
experience_count <- players_data |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent = count/sum(count) * 100) |>
                        mutate(percent = round(percent, 2))
experience_summary

The largest portion of players who contributed to the data are labelled as amateurs in terms of experience and make up 32.14%. The lowest portion is pros, making up 7.14%.

#### Summary #2

In [None]:
played_hours_summary <- players_data |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      median = median(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      max = max(played_hours),
                                      min = min(played_hours)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))    
played_hours_summary                      

The mean amount of hours played is 5.85 hours, the median is 0.1 hours, the sum is 1145.80 hours, the maximum is 223.10 hours and the minimum is 0 hours.

#### Summary #3

In [None]:
subscribe_count <- players_data |>
                        group_by(subscribe) |>
                        summarize(count = n())

subscribe_summary <- subscribe_count |>
                        ungroup() |>
                        mutate(percent = count/sum(count) * 100) |>
                        mutate(percent = round(percent, 2))
subscribe_summary

There's 144 subscribed players which accounts for 73.47%, and 52 non subscribed players which accounts for 26.53% in the dataset.

#### Summary #4

In [None]:
gender_count <- players_data |>
                    group_by(gender) |>
                    summarize(count = n())

gender_summary <- gender_count |>
                        ungroup() |>
                        mutate(percent = count/sum(count) * 100) |>
                        mutate(percent = round(percent, 2))
gender_summary

The largest portion of the dataset is male making up 63.27%, and the smallest is other making up 0.51%.

#### Summary #5

In [None]:
age_summary <- players_data |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              median = median(Age, na.rm = TRUE),
                              max = max(Age, na.rm = TRUE),
                              min = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:min, ~ round(.x, 2)))    
age_summary   

The mean age of Plaicraft players in the dataset is 21.14 years, the median is 19 years, the oldest age (max) is 58 years and the youngest age is 9 years old.

Now let's read in the next dataset:

In [None]:
sessions_data <- read_csv("data/sessions.csv")
sessions_data

#### Description of the set:
The data in the "sessions.csv" file provided for PlaiCraft has 5 variables and 1535 observations as indicated by the top of the data table (1535x5). The variables, their data type, and description of what they represent are listed below: 
1. `hashedEmail`: *chr (character)* - the players' email, but in hashed form
2. `start_time`: *chr (character)* - start time of players' gametime session in "mm/dd/yyyy" and "time (in 24 hour interval)"
4. `end_time`: *chr (character)* - end time of players' gametime session in "mm/dd/yyyy" and "time (in 24 hour interval)"
5. `original_start_time`: *dbl (decimal)* - players' start time in UNIX format (milliseconds)
6. `original_end_time`: *dbl (decimal)* - players' end time in UNIX format (milliseconds)


One issue seen in the data is the formatting of the start_time, end_time, original_start_time, and original_end_time columns. There is more than one value in the start_time and end_time which is not favourable as a tidy dataset and for applying functions to the dataset later. The original_start_time and original_end_time columns are also in an uncommonly used format and difficult to interpet.
\
\
Before we summarize the data, I will wrangle this dataset to become tidy so that we can make the most out of it. Tidy data follows the format of one variable per column, one observation per row, and one value in each cell. Let's tidy up the "sessions.csv". The "players.csv" follows the tidy rules, so we don't have to tidy it. I will transform the start_time and end_time columns to be a decimal type instead of character. I will also add an extra column for the start_time in hours as integers for later analysis. All of this pre wrangling will make summarizing and the creation of visualizations later more efficient.

In [None]:
sessions_data_tidy <- sessions_data |>
                      separate(end_time, 
                                into = c("date", "end_time"),
                                sep = " ") |>
                      separate(start_time,
                                into = c("date", "start_time"),
                                sep = " ") |>
                       mutate(start_time = as.numeric(hm(start_time))/3600) |> #hm function is from the lubridate R package and extracts the time in seconds
                       mutate(end_time = as.numeric(hm(end_time))/3600) |> #we devide it by 3600 to get the time in hours
                       mutate(start_time_hr = as.integer(start_time))
sessions_data_tidy

#### Summary #6

In [None]:
date_total <- sessions_data_tidy |>
                    group_by(date) |>
                    summarize(count = n())

date_summary <- date_total |>
                    arrange(desc(count)) |>
                    slice_max(count) |>
                    slice_min(count)
                    
date_summary

#### Summary #7

In [None]:
start_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(start_time, na.rm = TRUE),
                                      median = median(start_time, na.rm = TRUE),
                                      max = max(start_time, na.rm = TRUE),
                                      min = min(start_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
start_time_summary

The mean start time is 10.69, corresponding to 10:41am and the median start time is 6.53, corresponding to 6:31am. The latest start time (max) is 23.98, corresponding to 11:58pm and the earliest start time (min) is 0, corresponding to 12:00am.

In [None]:
end_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(end_time, na.rm = TRUE),
                                      median = median(end_time, na.rm = TRUE),
                                      max = max(end_time, na.rm = TRUE),
                                      min = min(end_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
end_time_summary

The mean end time is 10.09, corresponding to 10:05am and the median end time is 6.25, corresponding to 6:15am. The latest end time (max) is 23.98, corresponding to 11:58pm and the earliest end time (min) is 0, corresponding to 12:00am.

### 2. The Question
Now that we have taken a view at our data. Let's state the question.

The broad question I will be addressing for this planning is question #3: "We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability." 

With this, the specific predictive question I formulated is: 
##### Can players' Plaicraft session start time predict how many simultaneous players will be online so that we can determine if we have enough licenses to go out at any time of the day?

The data provided will help me address the question of interest in that I can mutate the time values in the "start_time" column of the sessions.csv dataset into a numerical (decimal) data type and in intervals of for example, 15 minutes, and summarize the data to make a column based on how many players were playing at those intervals. I will then do the necessary training and validation of the data to create an optimal knn non-linear regression model. This 

### 3. Exploratory Data Analysis and Visualization


Next, let's create some exploratory visualizations to understand the data more and possibly seek out any helpful relationships. Visualizations make it much easier to interpret datasets and make large datasets into concise figures.

#### Visualization #1

In [None]:
options(repr.plot.width = 11, repr.plot.height = 9)
played_hours_age_plot <- ggplot(players_data, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y axis, the 0 values won't be infinity
                            geom_point(alpha = 0.5) +
                            labs(x = "Age (Years)", y = "Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("The Effect of Age on Hours Played of PlaiCraft Players") +
                            scale_y_log10() +
                            theme(text = element_text(size = 12))
played_hours_age_plot

In this visualization, there doesn't seem to be any relationship between the age of the players and the hours of PlaiCraft played. The points are pretty condensed near the bottom of the graph, with some outliers near the top half. The graph does seem to insinuate that teens and young adults tend to play for a greater amount of hours, specifically ages of around 18 - 20 years old. This is not only seen by the outliers, but nearby the condensed area as well.

#### Visualization #2

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)
players_experience_age <- group_by(players_data, Age) |>
                            summarize(count = n())

experience_age_histogram <- ggplot(players_experience_age, aes(x = Age, y = count)) +
    geom_histogram(bins = 20) +
    facet_grid(rows = vars(experience)) +
    labs(x = "Total Sessions Played", y = "Total Players", fill = "Are They Subscribed?") +
    ggtitle("Subscription of Players Based on Age") +
    theme(text = element_text(size = 15))

experience_hours_histogram

#### Visualization #2

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)
age_histogram <- ggplot(players_data, aes(x = Age)) +
                    geom_histogram(bins = 20) +
                    labs(x = "Age (Years)", y = "Count") +
                    ggtitle("Distribution of PlaiCraft's Players' Ages") +
                    scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75)) +
                    theme(text = element_text(size = 15))
age_histogram

In this visualization, we can see that the highest number of PlaiCraft players in the dataset are approximately 16 years old. A lot of the players' ages are condensed around the teenager to young adult ages of 16 - 25. There are however, a few beyond this at the young children, adult and senior ages.

#### Visualization #4

In [None]:
options(repr.plot.width = 15, repr.plot.height = 6)
start_time_players_total <- group_by(sessions_data_tidy, start_time_hr) |>
                                summarize(total_players = n())

start_time_players_line <- ggplot(start_time_players_total, aes(x = start_time_hr, y = total_players)) +
                                geom_line(alpha= 0.6) +
                                scale_x_continuous(breaks = seq(0, 24)) +
                                labs(x = "Hour of Day", y = "Total Players Online") +
                                ggtitle("Total Players Online at Each Hour of the Day") +
                                theme(text = element_text(size = 15))
start_time_players_line

This visualization is the most helpful for our specific question. It is just a rough estimate of how many players are online at a certain hour as I converted the time of day values to integers. So, it may not be the exact amount of players, but it should be relatively close. We can see that from 2 - 5am, there are the most number of players online on PlaiCraft. The number of players online is the lowest between 8am-2pm, and increases again from 3pm to the end of the day. I can connect these observations to the visualization above in that I can assume that young adults are most likely in class or working. It makes sense why visualization #2 increases from 3pm to the end of the day all the way to 5am the next day because a lot of workers and students who are young adults get their free time late at night.

#### Visualization #5

The following graphs provide information that is not as important for my specific question, but it is worth while to create these visualizations to further understand and explore both datasets as a whole. It may be needed for when I come together with my group as well.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
gender_data <- players_data |>
                group_by(gender) |>
                summarize(total_gender = n())
gender_bar <- ggplot(gender_data, aes(x = gender, y = total_gender)) +
                geom_bar(stat = "identity") +
                labs(x = "Gender", y = "Count") +
                ggtitle("PlaiCraft Players by Gender") +
                scale_y_continuous(breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130)) +
                theme(text = element_text(size = 12))
gender_bar

In this visualization, we can see that the majority of PlaiCraft players identify as male, with female coming in second, and non-binary coming in third.

#### Visualization #6

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)
experience_subscribe_bar <- ggplot(players_data, aes(x = experience, fill = subscribe)) +
                                geom_bar(position = "fill") +
                                labs(x = "Gaming Experience", y = "Ratio of Players", fill = "Are they Subscribed?") +
                                ggtitle("Subscription of Players Based on Gaming Experience") +
                                theme(text= element_text(size = 12))
experience_subscribe_bar

In this visualization, it seems like regular players are most likely to subscribe to PlaiCraft compared to the other gaming experiences. Although, across all gaming experiences, there are similar proportions of subscribed and not subscribed.

### 4. Methods and Plans

The method I would use to address my question of interest is non-linear regression. I will be working on the sessions.csv dataset as it contains all the variables I need in order to analyze and answer my question. I can use the start_time values and wrangle the dataet to achieve the total number of players online at intervals of the day; similar to visualization #2 above. My question is more dependent on time of sessions and how many players are playing. Thus, the players.csv dataset does not contain any variables that will effectively aid my analysis.

The model chosen is appropriate because one, I am looking to predict a numerical value, which compared to classification, would not be able to do. Two, with basing my classifer off of the explanatory variable of "time of day" and response variable of "number of players online", I would be able to efficiently input any hour of the day and collect the average output that the classifier computed based on the closest *k* I choose and previous data of how many players are expected to be online. I can then decide myself if the number of players online is exceeding the number of licenses I currently have at hand for that time of day and decide what the best course of action is for the licenses.

Fortunately, the non-linear regression model does not require much assumptions compared to the linear one where it assumes that the data is linear. A reason for its minimal assumptions is due to its nearest neighbours system.

The weaknesses of the method selected are that. Fortunately, this dataset is not too large so far, so it will not take long to compute an average outcome. However, it can become a problem for when I want to add a lot more data in to get a more accurate outcome.

To process the data, I will first split the dataset into a training and testing set using the function *initial_split* and then *training* and *testing*. The proportions would be 75% training and 25% testing.
Next, I will use cross-validation on the training set to choose an appropriate *k* value. I will use the *nearest_neighbor* function with the arguments of "rectangular" weight_func and "tune()" for neighbors. The engine and mode will be "kknn" and "regression" respectively. Furthermore, I will create a recipe using the *recipe* function. My response variable will be the count of players online and the explanatory variable will be the time of day. The data argument will be the training set. I will scale and center all predictors so that the variables are on a comparable scale and none will overtake another. This is especially important for a KNN classifer system.
The cross-validation comes next. It is recommended to use 5 - 10 folds, so I will go with 5. I will use the *vfold_cv* function with the arguments of 5 folds and the count of players online being the strata. Likewise, I will create a workflow with the recipe and model I created.
I will then make check the a tibble that will contain the range of *k* values from 1 to () by (). Additionally, I will use *tune_grid* with the arguments of my cross validation set as the resamples and grid as my tibble. Then, I will use *collect_metrics*, filter for "rmse" and use *slice_min* for the mean to get the most optimal *k*.
Testing comes next. I will repeat the steps to make the model, but change "tune()" for the optimal *k*. I will create a workflow with the same recipe, new model, and fit the data into the training set. Then, I can use *predict* on the testing data that was split beforehand. Use *bind_cols* to attach that prediction column to the testing data and then see the metrics with the arguments of truth being total count of players online and estimate being ".pred".



### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly. https://r4ds.had.co.nz/.