# Predicting Player Login Patterns: Forecasting Hourly Login Counts on a Minecraft Research Server

# Introduction

For any online game server, it’s useful to know when players are most likely to log in. This helps with planning maintenance, managing server load, and understanding general player activity patterns. In many games, logins tend to follow clear time-of-day or day-of-week patterns, so it’s reasonable to ask whether we can predict player activity ahead of time.

In this project, we focus on the question:

**Can we predict how many unique players will log into the server during a given hour?**

To explore this, we use an anonymized play session log from a Minecraft research server hosted by the University of British Columbia. Each row in the dataset represents one gameplay session, including when a player logged in and when they logged out. The session log file (`sessions.csv`) contains 1,535 rows and the following variables:

- **hashedEmail** — anonymized player ID  
- **start_time** — session start timestamp (YYYY-MM-DD HH:MM:SS)  
- **end_time** — session end timestamp (YYYY-MM-DD HH:MM:SS)  
- **original_start_time** — raw start timestamp before adjustment  
- **original_end_time** — raw end timestamp before adjustment  

Using these timestamps, we transform the data into an hourly-level dataset. For each session start time, we extract several useful variables:

- **date** — calendar date of the login  
- **hour** — hour of day (0–23)  
- **weekday** — day of week (Mon–Sun)  
- **is_weekend** — TRUE/FALSE indicator for weekends  
- **n_logins** — number of unique players starting a session during each hour (our target variable)

This processed dataset is what we use for our exploratory analysis and for building a model to forecast hourly login counts.



## Methods

This section describes the steps used to prepare the dataset, explore basic login patterns, and build predictive models to forecast hourly login counts.

### 1. Loading libraries and importing the data

We begin by loading the packages required for data wrangling, working with timestamps, visualization, and modeling. We then import the play-session log (`sessions.csv`), where each row represents one gameplay session on the Minecraft research server.


In [None]:
# Load packages and set seed
set.seed(18)
library(tidyverse)
library(lubridate)
library(tidymodels)
options(repr.matrix.max.rows = 6)

# Read the session log dataset
sessions <- read_csv("data/sessions.csv")
sessions


### 2. Cleaning and preparing the timestamp variables

The session timestamps are stored as character strings. To work with them properly, we convert the timestamp columns into datetime objects and remove rows with missing values to ensure each session has a valid start time.



In [None]:
# Convert timestamp strings to datetime and remove missing values
sessions_clean <- sessions |>
  mutate(
    start_time = dmy_hm(start_time),
    end_time   = dmy_hm(end_time)
  ) |>
  drop_na(start_time)


### 3. Creating hourly features and aggregating login counts

To prepare the data for analysis, we convert the session-level records into an hourly dataset.  
For each session start time, we extract:

- the date
- the hour of the day
- the weekday
- whether the day is a weekend

We then count the number of **unique players** who started a session in each hour.  
The resulting variable `n_logins` represents the number of players logging in during that hour and will be used as the target for prediction.

After creating these features, we can see that the dataset now contains one row per hour instead of one row per session.


In [None]:
# Create hourly features and compute hourly unique login counts
login_counts <- sessions_clean |>
  mutate(
    date = as_date(start_time),
    hour = hour(start_time),
    weekday = wday(start_time, label = TRUE, abbr = TRUE),
    is_weekend = weekday %in% c("Sat", "Sun")
  ) |>
  group_by(date, hour, weekday, is_weekend) |>
  summarise(n_logins = n_distinct(hashedEmail), .groups = "drop")

login_counts


### 4. Splitting the data into training and testing sets

Since the data form a time series of hourly login counts, we split the dataset in time order. The first 75% of the rows are used as the training set, and the last 25% are kept as the test set. This allows us to evaluate how well the model can predict future login behaviour.


In [None]:
login_counts <- login_counts |>
  arrange(date, hour)

data_split <- initial_time_split(login_counts, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

head(train_data)


### 5. Exploratory visualization

To see when players tend to log in, we use the training data to plot the average number of logins for each hour of the day. This gives a simple view of daily patterns before fitting any models.

Figure 1 shows the mean hourly login count across all days in the training set.


In [None]:
eda_plot <- train_data |>
  group_by(hour) |>
  summarise(mean_logins = mean(n_logins)) |>
  ggplot(aes(x = hour, y = mean_logins)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Figure 1. Average Number of Logins by Hour of Day",
    x = "Hour of Day (0–23)",
    y = "Average Login Count"
  )

eda_plot


### 6. Building the prediction model

To forecast hourly login counts, we fit a k-nearest neighbours (kNN) regression model.  
The model uses three time-based features:

- hour of day  
- weekday  
- weekend indicator  

These features describe when each hour occurs and capture the main temporal patterns in the data. We train the model on the first 75% of the observations (the training set) and then use it to predict the login counts in the remaining 25% (the test set).


In [None]:
login_recipe <- recipe(n_logins ~ hour + weekday + is_weekend, data = train_data) |>
  step_dummy(all_nominal_predictors())

knn_spec <- nearest_neighbor(
  mode = "regression",
  neighbors = 5
) |>
  set_engine("kknn")

knn_workflow <- workflow() |>
  add_recipe(login_recipe) |>
  add_model(knn_spec)

knn_fit <- knn_workflow |>
  fit(data = train_data)

knn_predictions <- predict(knn_fit, test_data) |>
  bind_cols(test_data)

knn_rmse <- knn_predictions |>
  rmse(truth = n_logins, estimate = .pred)

knn_rmse


### 7. Model evaluation

To evaluate the kNN regression model, we first compute the RMSE on the test set. To visualize the overall model performance in a simple way, we compare the average actual login count to the average predicted login count in the test data. Figure 2 shows both values side by side.


In [None]:
mean_values <- tibble(
  type = c("Actual", "Predicted"),
  value = c(mean(knn_predictions$n_logins), mean(knn_predictions$.pred))
)

ggplot(mean_values, aes(x = type, y = value, fill = type)) +
  geom_col() +
  labs(
    title = "Figure 2. Mean Actual vs Predicted Login Count",
    x = "",
    y = "Mean login count"
  ) +
  theme_minimal()

### Summary of model evaluation

The kNN model produces predicted login counts that are similar to the overall average of the actual data in the test set. While the model captures basic patterns, the RMSE and the comparison in Figure 2 show that its predictions are approximate and not exact, which is expected given the small number of features used.
