# Predicting Player Login Patterns: Forecasting Hourly Login Counts on a Minecraft Research Server

# Introduction

For any online game server, it’s useful to know when players are most likely to log in. This helps with planning maintenance, managing server load, and understanding general player activity patterns. In many games, logins tend to follow clear time-of-day or day-of-week patterns, so it’s reasonable to ask whether we can predict player activity ahead of time.

In this project, we focus on the question:

**Can we predict how many unique players will log into the server during a given hour?**

To explore this, we use an anonymized play session log from a Minecraft research server hosted by the University of British Columbia. Each row in the dataset represents one gameplay session, including when a player logged in and when they logged out. The session log file (`sessions.csv`) contains 1,535 rows and the following variables:

- **hashedEmail** — anonymized player ID  
- **start_time** — session start timestamp (YYYY-MM-DD HH:MM:SS)  
- **end_time** — session end timestamp (YYYY-MM-DD HH:MM:SS)  
- **original_start_time** — raw start timestamp before adjustment  
- **original_end_time** — raw end timestamp before adjustment  

Using these timestamps, we transform the data into an hourly-level dataset. For each session start time, we extract several useful variables:

- **date** — calendar date of the login  
- **hour** — hour of day (0–23)  
- **weekday** — day of week (Mon–Sun)  
- **is_weekend** — TRUE/FALSE indicator for weekends  
- **n_logins** — number of unique players starting a session during each hour (our target variable)

This processed dataset is what we use for our exploratory analysis and for building a model to forecast hourly login counts.



## Methods

This section describes the steps used to prepare the dataset, explore basic login patterns, and build predictive models to forecast hourly login counts.

### 1. Loading libraries and importing the data

We begin by loading the packages required for data wrangling, working with timestamps, visualization, and modeling. We then import the play-session log (`sessions.csv`), where each row represents one gameplay session on the Minecraft research server.


In [None]:
# Load packages and set seed
set.seed(18)
library(tidyverse)
library(lubridate)
library(tidymodels)
options(repr.matrix.max.rows = 6)

# Read the session log dataset
sessions <- read_csv("data/sessions.csv")
sessions


## 2. Cleaning and preparing the timestamp variables

The session timestamps are stored as character strings, so we convert them into proper datetime objects. We also remove rows with missing timestamps to ensure that each session is associated with a valid start time.


In [None]:
# Convert timestamp strings to datetime and remove missing values
sessions_clean <- sessions |>
  mutate(
    start_time = dmy_hm(start_time),
    end_time   = dmy_hm(end_time)
  ) |>
  drop_na(start_time)


## 3. Creating hourly features and aggregating login counts

To study login patterns and build our prediction model, we transform the session-level data into an hourly dataset.  
For each session, we extract:

- calendar date  
- hour of the day  
- weekday label  
- weekend indicator  

We then compute the number of **unique players** who started a session in each hour. This becomes our target variable `n_logins`.


In [None]:
# Create hourly features and compute hourly unique login counts
login_counts <- sessions_clean |>
  mutate(
    date      = as_date(start_time),
    hour      = hour(start_time),
    weekday   = wday(start_time, label = TRUE, abbr = TRUE),
    is_weekend = weekday %in% c("Sat", "Sun")
  ) |>
  group_by(date, hour, weekday, is_weekend) |>
  summarise(
    n_logins = n_distinct(hashedEmail),
    .groups = "drop"
  )

login_counts


## 4. Splitting the data into training and testing sets

Because the data represent a time series of hourly counts, we split the dataset chronologically:  
the first 75% of the hours form the training set, and the remaining 25% are used for testing.  
This allows us to evaluate how well the model predicts future login behavior.


In [None]:
# Order data by time before splitting
login_counts <- login_counts |> arrange(date, hour)

# Time-based split
data_split <- initial_time_split(login_counts, prop = 0.75)
train_data <- training(data_split)
test_data  <- testing(data_split)

train_data |> head()


## 5. Exploratory visualization

To get a sense of when players tend to log in, we plot the average number of logins for each hour of the day. This helps us see basic daily patterns before fitting any models.


In [None]:
# Use training data for EDA
eda_plot <- train_data |>
  group_by(hour) |>
  summarise(mean_logins = mean(n_logins), .groups = "drop") |>
  ggplot(aes(x = hour, y = mean_logins)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Figure 1. Average Number of Logins by Hour of Day",
    x = "Hour of Day (0–23)",
    y = "Average Login Count"
  )

eda_plot


## 6. Building the Prediction Model

To forecast hourly login counts, we fit a k-nearest neighbours (kNN) regression model.  
The model uses three time-based features:

- hour of day  
- weekday  
- weekend indicator  

These features capture the main temporal patterns in the dataset.  
We train the model on the first 75% of the observations and evaluate it on the remaining 25%.


In [None]:
# Build recipe
login_recipe <- recipe(n_logins ~ hour + weekday + is_weekend, data = train_data) |>
  step_dummy(all_nominal_predictors())

# Specify kNN model
knn_spec <- nearest_neighbor(
  mode = "regression",
  neighbors = 5
) |>
  set_engine("kknn")

# Workflow
knn_workflow <- workflow() |>
  add_recipe(login_recipe) |>
  add_model(knn_spec)

# Fit the model
knn_fit <- knn_workflow |>
  fit(data = train_data)

# Predict on the test set
knn_predictions <- predict(knn_fit, test_data) |>
  bind_cols(test_data)

# Compute RMSE for the kNN model
knn_rmse <- knn_predictions |>
  rmse(truth = n_logins, estimate = .pred)

knn_rmse


## 8. Model Evaluation

To evaluate how well the k-nearest neighbours (kNN) regression model predicts hourly login counts,  
we compute the Root Mean Squared Error (RMSE) on the test dataset.  
A lower RMSE indicates better predictive performance.

We plot the RMSE value for the kNN model to summarize its accuracy.


In [None]:
# Prepare data for plotting
combined_metrics <- tibble(
  model = "kNN",
  estimate = knn_rmse$.estimate
)

combined_metrics


In [None]:
# Plot RMSE for the kNN model
ggplot(combined_metrics, aes(x = model, y = estimate, fill = model)) +
  geom_col() +
  labs(
    title = "Figure 2. Model RMSE",
    x = "Model",
    y = "RMSE"
  ) +
  theme_minimal()


### Summary of Model Evaluation

The kNN model achieves an RMSE of about 0.85 on the test set. This suggests that the model captures general login patterns, but its overall predictive accuracy is still limited by the small number of features.
