## Project Planning Stage (Individual)
#### Peter Wojnicki | 78625613 | Group 9 | Section 008

### <u>(1) Data Description:</u>

<center><b>Table 1.</b> Overview of Data with Number of Columns and Observations</center>

| Source File Name | Number of Columns | Total Number of Observations |
|:-:|:-:|:-:|
|players.csv| 7 | 196 |
|sessions.csv| 5 | 1535 | 

#### **Players Dataset**

**experience (chr):** "Amateur", "Beginner", "Regular", "Pro", "Veteran" assigned based on experience.

**subscribe (lgl):** "True" or "False" if subscribed.

**hashedEmail (chr):** Hashed Email of player.

**played_hours (dbl):** Number of hours played.

**name (chr):** Name of player.

**gender (chr):** "Female" or "Male" for gender of player.

**Age (dbl):** Age of player.

#### Potential Issues:
- experience is self-declared and might have some bias
- joining players with sessions might pose issues

#### **Sessions Dataset**

**hashedEmail (chr):** Hashed Email of player.

**start_time (chr):** Start time of session in format of DAY/MONTH/YEAR HOUR:MINUTE.

**end_time (chr):** End time of session in format of DAY/MONTH/YEAR HOUR:MINUTE.

**original_start_time (dbl):** Start time recorded in UNIX time (milliseconds).

**original_end_time (dbl):** End time recorded in UNIX time (milliseconds).

#### Potential Issues:
- start_time and end_time are dates and times that both need to be wrangled into a usable form
- original_start_time and original_end_time are in milliseconds

### <u>(2) Questions:</u>

I hope to address demand forecasting and use this data to predict highest demand periods for licences and server demand. Since we have the start and end times of the sessions, we can understand which days and times that players start playing the game. Studying these dates and times could offer vital clues when most players will login and when resources should be appropriately allocated to avoid outages and disruption of services.

### <u>(3) Exploratory Data Analysis and Visualization</u>

<center><b>Table 2.</b> Summary statistics for variables of interest</center>

| Variable Name | Mean | Median | Minimum | Maximum | Number of Observations |
|:-:|:-:|:-:|:-:|:-:|:-:|
|Player Age (years) | 21.14 | 19 | 9 | 58 | 196 |
|Played Hours (hours) | 5.85 | 0.1 | 0 | 223.1 | 196 |
|Elapsed Session Time (minutes) | 50.86 | 30 | 3 | 259 | 1535 |
|Start Date (Month Day)| June 24 | June 24 | April 06 | Sept 26 | 1535 |
|End Date (Month Day)| June 24 | June 23 | April 06 | Sept 26 | 1535 |


#### **Histograms for Quantitative Data**

<p>The following histograms reveal some interesting patterns within the data that might help to shed some light on demand forecasting. From the ages histogram we can see that most players are around their early 20s and the average played hours is close to 5.85 hours with some outliers like 259. The data spans from April 6th to September 26 with the average players starting to play around June 24. When investigating the start time histogram (Histogram 1.5), we can see that most players start playing around midnight to 6 am in the morning while barely anyone played from around 9 am to 3 pm.</p>

<center><div><img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/age_plot.png" alt = "Plot for Distribution of Age" width = "400" height = "400">
<img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/played_hours_plot.png" alt = "Plot for Distribution of Hours Played" width = "400" height = "400"></div>
<div><img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/session_elapsed_plot.png" alt = "Plot for Distribution of Session Time" width = "400" height = "400">
<img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/session_time_plot.png" alt = "Plot for Distribution of Start Date" width = "400" height = "400"></div>
<div><img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/session_time_plot1.png" alt = "Plot for Distribution of End Date" width = "400" height = "400">
<img src = "https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/refs/heads/main/session_time_plot2.png" alt = "Plot for Distribution of Start Time Regardless of Date" width = "400" height = "400"></div></center>

### <u>(4) Methods and Plan</u>

I will apply a k-nearest neighbor (kNN) regression using the players and sessions dataset to predict future demands based on past demands. Since kNN is a non-parametric method then we do not need many assumptions about our data since start times for players can be nonlinear and vary throughout different times of the day, holidays, and weekends. Another important assumption is that the future data can be predicted by this data. A linear regression might pose an issue since we cannot guarantee a linear relationship. Therefore, kNN would be more appropriate for predicting a complex nonlinear variable like demand since it is a nonparametric method. A major weakness is that our data only spans around 5 months and we might not be able to account for months outside of April to September. kNN can be slow for larger data sets and we might have to optimize the model for this data. To compare and select the model, we will tune the k-value for RMSE and use a 10 or 5-fold cross validation. For processing of the data we can use 75% of the data for training and 25% for testing. The training data can be used to tune the k-value and then we can calculate the RMPSE for our test data after choosing the optimal k. We will split the data after tidying and wrangling the data for the model. The biggest issue I foresee is choosing what time period to predict (the weekly, hourly, or daily demand) and wrangling the data to avoid biases or errors.


In [None]:
### Run this cell before continuing.
library(tidyverse)

#### Read the Datasets from URLs

In [None]:
players_data <- read_csv("https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/85fbc982690237a92b10d714a1b540644c562325/data/players.csv")
sessions_data <- read_csv("https://raw.githubusercontent.com/wojpc/wojpc-dsci100-project-008-09/85fbc982690237a92b10d714a1b540644c562325/data/sessions.csv")

#### Number of Columns and Preview Data

In [None]:
nrow(players_data)
nrow(sessions_data)

head(players_data)
head(sessions_data)

#### Check for Duplicated Names or Emails in Players Data

In [None]:
dups_name <- duplicated(players_data$name)
dups_hashed <- duplicated(players_data$hashedEmail)

#### Summary Statistics for Quantitative Variables in Players Dataset

In [None]:
players_played_mean <- mean(players_data$played_hours, na.rm = TRUE)
players_played_min <- min(players_data$played_hours, na.rm = TRUE)
players_played_max <- max(players_data$played_hours, na.rm = TRUE)
players_played_med <- median(players_data$played_hours, na.rm = TRUE)

players_age_mean <- mean(players_data$Age, na.rm = TRUE)
players_age_min <- min(players_data$Age, na.rm = TRUE)
players_age_max <- max(players_data$Age, na.rm = TRUE)
players_age_med <- median(players_data$Age, na.rm = TRUE)

players_played_mean
players_played_med
players_played_min
players_played_max

players_age_mean
players_age_med
players_age_min
players_age_max

#### Grouping Qualitative Variables

In [None]:
# Aggregate by experience level
players_exp <- players_data |>
    group_by(experience) |>
    summarize(total_exp = n())

# Aggregate by subscription status
players_subbed <- players_data |>
    group_by(subscribe) |>
    summarize(total_exp = n())

# Aggregate by gender
players_gender <- players_data |>
    group_by(gender) |>
    summarize(total_exp = n())

#### Left Join Data on Hashed Email and Wrangle Time
A new column called session_time is added to show elapsed time of session and times are converted from strings to more usable data. Sessions data is now combined with player information. original_start_time and original_end_time were useless so I removed them to make the table tidier and less redundant. 

In [None]:
sessions_players_joined <- sessions_data |>
  left_join(players_data, by = "hashedEmail")

sessions_players_elapsed <- sessions_players_joined |>
    mutate(end_time = as.POSIXct(end_time, format = "%d/%m/%Y %H:%M"),
           start_time =  as.POSIXct(start_time, format = "%d/%m/%Y %H:%M")) |>
    mutate(session_time_elapsed = as.numeric(end_time - start_time)) |>
    select(-hashedEmail, -original_start_time, -original_end_time)

head(sessions_players_elapsed)
nrow(sessions_players_elapsed)
nrow(sessions_players_elapsed)

#### Summary Statistics for Elapsed Session Time in Joined Dataset

In [None]:
sessions_players_elapsed_mean <- mean(sessions_players_elapsed$session_time_elapsed, na.rm = TRUE)
sessions_players_elapsed_min <- min(sessions_players_elapsed$session_time_elapsed, na.rm = TRUE)
sessions_players_elapsed_max <- max(sessions_players_elapsed$session_time_elapsed, na.rm = TRUE)
sessions_players_elapsed_med <- median(sessions_players_elapsed$session_time_elapsed, na.rm = TRUE)

sessions_players_sdate_mean <- mean(sessions_players_elapsed$start_time, na.rm = TRUE)
sessions_players_sdate_min <- min(sessions_players_elapsed$start_time, na.rm = TRUE)
sessions_players_sdate_max <- max(sessions_players_elapsed$start_time, na.rm = TRUE)
sessions_players_sdate_med <- median(sessions_players_elapsed$start_time, na.rm = TRUE)

sessions_players_edate_mean <- mean(sessions_players_elapsed$end_time, na.rm = TRUE)
sessions_players_edate_min <- min(sessions_players_elapsed$end_time, na.rm = TRUE)
sessions_players_edate_max <- max(sessions_players_elapsed$end_time, na.rm = TRUE)
sessions_players_edate_med <- median(sessions_players_elapsed$end_time, na.rm = TRUE)

sessions_players_elapsed_mean
sessions_players_elapsed_med 
sessions_players_elapsed_min 
sessions_players_elapsed_max

sessions_players_sdate_mean 
sessions_players_sdate_med
sessions_players_sdate_min 
sessions_players_sdate_max 

sessions_players_edate_mean 
sessions_players_edate_med
sessions_players_edate_min 
sessions_players_edate_max 

#### Histograms for Quantitative Data

In [None]:
options(repr.plot.width = 9, repr.plot.height = 9)

age_plot <- players_data |> 
    ggplot(aes(x = Age)) +
    geom_histogram() +
    labs(x = "Age (years)",
        y = "Count",
        title = "Histogram 1.0. Distribution of Age of Players") +
    scale_x_continuous(n.breaks = 20) +
    theme_bw() + 
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5))
    
played_hours_plot <- players_data |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 10) +
    labs(x = "Hours Played (hours)",
        y = "Count",
        title = "Histogram 1.1. Distribution of Hours Played Per Player") +
    scale_x_continuous(n.breaks = 10) +
    theme_bw() +
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5))
    
session_elapsed_plot <- sessions_players_elapsed |> 
 ggplot(aes(x =  session_time_elapsed)) +
    geom_histogram(binwidth = 10) +
    labs(x = "Session Time (minutes)",
        y = "Count",
        title = "Histogram 1.2. Distribution of Session Time in Minutes") +
    scale_x_continuous(n.breaks = 20) +
    theme_bw() +
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5))

start_date <- min(sessions_players_elapsed$start_time)
end_date <- max(sessions_players_elapsed$start_time)
session_time_plot <- sessions_players_elapsed |> 
    ggplot(aes(x = start_time)) +
    geom_histogram() +
    labs(x = "Start Date (Month Day)",
        y = "Count",
        title = "Histogram 1.3. Distribution of Starting Date for Sessions") +
    scale_x_datetime(date_labels = "%b %d",
                    date_breaks = "14 days",
                    limits =  as.POSIXct(c(start_date, end_date))) +
    theme_bw() +
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5))

start_date1 <- min(sessions_players_elapsed$end_time)
session_time_plot1 <- sessions_players_elapsed |> 
    ggplot(aes(x = end_time)) +
    geom_histogram() +
    labs(x = "End Date (Month Day)",
        y = "Count",
        title = "Histogram 1.4. Distribution of Ending Date for Sessions") +
    scale_x_datetime(date_labels = "%b %d",
                    date_breaks = "14 days",
                    limits =  as.POSIXct(c(start_date1, end_date))) +
    theme_bw() +
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5))

# Isolate the times when playing starts and ignore dates
starting_times <- sessions_players_elapsed |>
    mutate(start_times = as.POSIXct(format(start_time, "%H:%M"),
      format = "%H:%M", tz = "UTC"))

session_time_plot2 <- starting_times |> 
    ggplot(aes(x = start_times)) +
    geom_histogram() +
    labs(x = "Session Start Time (HH:MM)",
        y = "Count",
        title = "Histogram 1.5. Distribution of Starting Times for Sessions") +
    scale_x_datetime(date_labels = "%H:%M",
                    date_breaks = "1 hour") +
    theme_bw() +
    theme(text = element_text(size = 15),
         plot.title = element_text(hjust = 0.5),
         axis.text.x = element_text(angle = -45,
                                  vjust = 0.1))



ggsave("age_plot.png", plot = age_plot, width = 9, height = 9, dpi = 300)
ggsave("played_hours_plot.png", plot = played_hours_plot, width = 9, height = 9, dpi = 300)
ggsave("session_elapsed_plot.png", plot = session_elapsed_plot, width = 9, height = 9, dpi = 300)
ggsave("session_time_plot.png", plot = session_time_plot, width = 9, height = 9, dpi = 300)
ggsave("session_time_plot1.png", plot = session_time_plot1, width = 9, height = 9, dpi = 300)
ggsave("session_time_plot2.png", plot = session_time_plot2, width = 9, height = 9, dpi = 300)

age_plot
played_hours_plot
session_elapsed_plot
session_time_plot
session_time_plot1
session_time_plot2
# bind_cols(sessions_data$start_time, starting_times$start_times)
