# Data Science Project: Planning Stage (Individual)
**Course: DSCI 100-008**  
**Student: Roger Zane (42644237)**


In [None]:
library(tidyverse)

players  <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/sessions.csv")


In [None]:
players
sessions

## 1. Data Description

The `players.csv` dataset contains 196 unique players and 7 variables:
- the player’s name (`name`)
- game experience (`experience`)
- subscription status to the newsletter (`subscribe`, TRUE/FALSE)
- the player’s hashed email (`hashedEmail`)
- total hours the player has spent on the server (`played_hours`, numeric)
- the player’s age in years (`Age`, numeric)
- the player’s gender (`gender`)

The `sessions.csv` dataset contains 1,535 play sessions and 5 variables:
- the player’s hashed email (`hashedEmail`)
- login time of the session (`start_time`)
- logout time of the session (`end_time`)
- original record for the session start (`original_start_time`)
- original record for the session end (`original_end_time`)

These data were collected from the PLAICraft Minecraft research server: player characteristics were self-reported, and each login–logout event was automatically logged by the server as a play session.  

There are some data quality issues to consider. First, some variables contain missing values (for example `Age`). Second, many players have `played_hours` equal to zero, which may correspond to people who signed up but barely played.   

These issues could bias estimates of player behaviour and peak usage, and may lead to misleading conclusions in the later prediction stage.


## 2. Broad & Specific question

I focus on the demand forecasting question: which time windows are most likely to have large numbers of simultaneous players, so that the research team can ensure adequate server capacity and licenses.

My specific question is: Can the hour of day (explanatory variable) be used to predict the average number of concurrent players on the server (response variable) in a given one-hour window?

Specifically, I will use the `sessions.csv` dataset and convert the `start_time` and `end_time` variables to date–time objects, extract the date and hour of day, and then count how many distinct `hashedEmail` values are online in each hour.

## 3. Exploratory Data Analysis and Visualization


In [None]:
players_mean_played <- players |>
  summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))

players_mean_age <- players |>
  summarize(mean_Age = mean(Age, na.rm = TRUE))

players_means <- players_mean_played |>
  bind_cols(players_mean_age)

players_means

The table above reports the mean value of each quantitative variable in the `players.csv` dataset.