# Data Science Project: Planning Stage (Individual)
**Course: DSCI 100-008**  
**Student: Roger Zane (42644237)**


In [None]:
library(tidyverse)

players  <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/rogerzch/dsci100-planning/main/sessions.csv")


In [None]:
players
sessions

## 1. Data Description

The `players.csv` dataset contains 196 unique players and 7 variables:
- the player’s name (`name`)
- game experience (`experience`)
- subscription status to the newsletter (`subscribe`, TRUE/FALSE)
- the player’s hashed email (`hashedEmail`)
- total hours played (`played_hours`, numeric, mean = 5.85)
- the player’s age in years (`Age`, numeric, mean = 21.14)
- the player’s gender (`gender`)

The `sessions.csv` dataset contains 1,535 play sessions and 5 variables:
- the player’s hashed email (`hashedEmail`)
- login time of the session (`start_time`)
- logout time of the session (`end_time`)
- original record for the session start (`original_start_time`)
- original record for the session end (`original_end_time`)

These data were collected from the PLAICraft Minecraft research server: player characteristics were self-reported, and each login–logout event was automatically logged by the server as a play session.  

There are some data quality issues to consider. First, some variables contain missing values (for example `Age`). Second, many players have `played_hours` equal to zero, which may correspond to people who signed up but barely played.   

These issues could bias estimates of player behaviour and peak usage, and may lead to misleading conclusions in the later prediction stage.


## 2. Broad & Specific question

I focus on the demand forecasting question: which time windows are most likely to have large numbers of simultaneous players, so that the research team can ensure adequate server capacity and licenses.

My specific question is: Can the hour of day (explanatory variable) be used to predict the number of unique players who start a session in that hour on the server (response variable)?

Specifically, I will use the `sessions.csv` dataset and use the `start_time` and `end_time` variables to extract the hour of day, and then count how many distinct `hashedEmail` values are online in each hour.

## 3. Exploratory Data Analysis and Visualization


In [None]:
players_mean_played <- players |>
  summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))

players_mean_age <- players |>
  summarize(mean_Age = mean(Age, na.rm = TRUE))

players_means <- players_mean_played |>
  bind_cols(players_mean_age)

players_means

The table above reports the mean value of the two quantitative variable in the `players.csv` dataset.    


In [None]:
sessions_hourly <- sessions |>
  mutate(hour = substr(start_time, 12, 13)) |>
  group_by(hour, hashedEmail) |>
  summarize(n_sessions = n()) |>
  group_by(hour) |>
  summarize(n_players = n())

sessions_hourly

The table above summarizes how many players are online in each hour of the
day. I used the minimum necessary wrangling steps to put the dataset into a tidy format for my question. The values in the `hour` column represent the start of each one-hour window (for example, `00` is 00:00–00:59). The hour starting at 11:00 has zero players in this dataset, so there is no row for `11`.    


In [None]:
hourly_bar_chart <- ggplot(sessions_hourly, aes(x = hour, y = n_players)) +
  geom_bar(stat = "identity") +
  xlab("Start of hour") +
  ylab("Number of players in this hour")

hourly_bar_chart

The bar chart above shows how many players are online in each hour of the
day.    


In [None]:
age_hours_plot <- ggplot(players, aes(x = Age, y = played_hours)) +
  geom_point() +
  labs(
    x = "Age (years)",
    y = "Total hours played on the server"
  )

age_hours_plot

This scatter plot shows the relationship between player age and total hours played on the server.

## Methods and Plan