# Data Science Project: Planning Stage (Individual)
**Course: DSCI 100-008**  
**Student: Roger Zane (42644237)**


In [None]:
library(tidyverse)

players  <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/sessions.csv")


In [None]:
players
sessions

## Data Description

The `players.csv` dataset contains 196 unique players and 7 variables:
- the player’s name (`name`)
- game experience (`experience`)
- subscription status to the newsletter (`subscribe`, TRUE/FALSE)
- the player’s hashed email (`hashedEmail`)
- total hours the player has spent on the server (`played_hours`, numeric)
- the player’s age in years (`Age`, numeric)
- the player’s gender (`gender`)

The `sessions.csv` dataset contains 1,535 play sessions and 5 variables:
- the player’s hashed email (`hashedEmail`)
- login time of the session (`start_time`)
- logout time of the session (`end_time`)
- original record for the session start (`original_start_time`)
- original record for the session end (`original_end_time`)

These data were collected from the PLAICraft Minecraft research server: player characteristics were self-reported, and each login–logout event was automatically logged by the server as a play session.  

There are some data quality issues to consider. First, some variables contain missing values (for example `Age`). Second, many players have `played_hours` equal to zero, which may correspond to people who signed up but barely played.   

These issues could bias estimates of player behaviour and peak usage, and may lead to misleading conclusions in the later prediction stage.


## Broad & Specific question

I focus on the demand forecasting question: which time windows are most likely to have large numbers of simultaneous players, so that the research team can ensure adequate server capacity and licenses.

My specific question is: Can time-of-day information, such as the hour of day and whether the time falls on a weekday or a weekend, be used to predict the average number of concurrent players on the server in a given one-hour window?

To address this question, I will use the `sessions.csv` data. First, I will convert the `start_time` variable to a date-time object and use `mutate` to extract the calendar date and the hour of day (0–23) for each session. Then, following the group-by and summarize patterns we used in class, I will `group_by` date and hour and `summarize` the number of unique `hashedEmail` values to obtain the number of players online in each hour. This will give an hourly-level dataset with one row per hour and a numerical response variable equal to the (average) number of concurrent players. In later stages of the project, this hourly-level dataset can be used to fit a predictive model, such as a k-nearest neighbours regression, with the time-of-day variables as explanatory predictors.

