Research Question: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

In [None]:
#Loading the necessary libraries for our data analysis
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
install.packages("lubridate")
library(lubridate)

(1) Data Description

In [None]:
#Loading Data into local files from Github
url_Players <- "https://raw.githubusercontent.com/rmackean/DSCI100_Individual_Project_RobM/refs/heads/main/players.csv"
url_Sessions <- "https://raw.githubusercontent.com/rmackean/DSCI100_Individual_Project_RobM/refs/heads/main/sessions.csv"
destination_Players <- "./Players.csv"
destination_Sessions <- "./Sessions.csv"
download.file(url_Players,destination_Players)
download.file(url_Sessions,destination_Sessions)
#Reading Datasets and Collecting summary statistics
Players <- read_csv("Players.csv")
Sessions <- read_csv("Sessions.csv")
#Summary Stats for Players Dataset


Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics (report values to 2 decimal places), number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

Dataset Variables


Players Dataset 
Players: 196 Unique Players
Variables: 5, 2 quantitative, 3 qualitative
Collection Problems: There may be issues with collection in the identity variables if it was a survey as that introduces the potential for respondants to lie.

Age: Players Age's Range from 9 to 58, with a vast majority around age 19. This is a quantitative variable in our dataset. It may be problematic as the distribution is extremely dense between 17 and 22 with much more limited data in the top and bottom quartiles. There is also an extreme spike at 17 which may also effect the data

In [None]:
Players_Age_Summary<- summary(Players$Age)
Players_Age_Summary


In [None]:
Age_Plot <- ggplot(Players, aes(x=Age))+geom_histogram(bins = 50)+labs(x="Age of Players in Years",y="Number of Players",title = "Age distribution of Players")+
  theme(plot.title = element_text(size = 20))+
  theme(axis.title = element_text(size = 15)) +
  theme(axis.text = element_text(size = 15))
Age_Plot

Experience: The players had a wide variety of experience with a majority falling in either Amateur or Veteran. This distribution may be problematic to work with as there are very few entries in the Pro section compared to the other 4. Experience in this dataset is represented as a qualitative variable with 5 levels.

In [None]:
Players_Experience_Summary <- table(Players$experience)
Players_Experience_Summary

Gender: Similarly to our other qualitative variable, the difference in entries between Male and every other option may be problematic in classification or kknn regressions. The variable has 7 levels and

In [None]:
Players_gender_Summary<- table(Players$gender)
Players_gender_Summary

Played Hours: A quantitative variable with a range from 0 to 223.1 hours played. Like our other quantitative variable, this variable is problematic because its distribution is heavily skewed. The mean is significantly greater then the 3rd quartile(10x), so our data is very heavily skewed right. This may be problematic when trying to utilize the data for regression or classification.

In [None]:
Players_played_hours_Summary<- summary(Players$played_hours)
Players_played_hours_Summary


In [None]:
Played_Hours_Plot <- ggplot(Players, aes(x=played_hours))+geom_dotplot(binwidth = 5)+labs(x="Played Hours",y="Number of Players",title = "Distribution of Players by Hours Played")+
  theme(plot.title = element_text(size = 20))+
  theme(axis.title = element_text(size = 15)) +
  theme(axis.text = element_text(size = 15))
Played_Hours_Plot

Sessions Dataset 
Sessions: 1535 Unique Sessions
Variables: 4 quantitative
Collection Problems: There may be issues with collection in the identity variables if it was a survey as that introduces the potential for respondants to lie.
#Summary Stats for Sessions Dataset
Sessions <- mutate(Sessions,start_time = dmy_hm(Sessions$start_time))
Sessions <- mutate(Sessions,end_time = dmy_hm(Sessions$end_time))


start_time: The start time variable is a character entry for date, while it is qualitative in its formatting in our CSV it is a qualitative variable. It presents a challenge because it requires us to convert it to a quantitative variable.

end_time: The end time variable, like start time, is a character entry for date, while it is qualitative in its formatting in our CSV it is a qualitative variable. It presents a challenge because it requires us to convert it to a quantitative variable.

original_start_time: The original start time variable solves this problem of quantative vs qualitative by setting a base time and counting up from there, the problem with this is its not intuitive what that time number means

original_end_time: The original end time variable solves the same problem of quantative vs qualitative that original start time does, the problem, just like start time, is its not intuitive what that time number means.

(2,4) Questions and Methods

Can we use previous start time, end time, and date data to predict the maximum nuumber of active users at any given time.

In order to answer this question we will need to wrangle our data to figure out how many users were online on a specific day at a specific time. Then we need to use that data to identify trends and group data together, for example if day of the week matters, or if month matters. After that we would run a regression of player count onto date and time and produce a confidence interval around our regression. Then we can use the upper limit of that confidence interval to know how many liscences to have ready for any given day and time. We would likely need to assume that month to month does not change the demand and only group by day of the week and time, this is due to our limited dataset. We would also want to use a kknn regression model instead of linear because the small changes in day or time might result in extremely large spikes or drops to player count(Ex. 4-5 pm with people off of work, Thursday vs Friday with people not having work to do Friday night). A linear regression would be too smooth and not account for these extreme differences. 

(3) Exploratory Data Analysis and Visualization

In this assignment, you will:

Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question


In [None]:
#Sessions Playtime Analysis


Sessions <- mutate(Sessions,Start_hour = hour(Sessions$start_time))
Sessions <- mutate(Sessions,End_hour = hour(Sessions$end_time))
Sessions <- mutate(Sessions,Start_date = date(Sessions$start_time))
Sessions <- mutate(Sessions,end_date = date(Sessions$end_time))
Sessions <- mutate(Sessions,Play_Time_Min = time_length(start_time%--%end_time, unit = "minute"))
Sessions_Play_Time_Min <- summary(Sessions$Play_Time_Min)
Sessions_Play_Time_Min


(5) GitHub Repository