Title:
            Predicting Usage of a Video Game Research Server


INTRODUCTION

**`Background`** 
Video games have become a widely popular form of entertainment and a valuable source of behavioral data for researchers. Understanding how players interact with games can inform a variety of domains, from improving game design and user engagement to studying cognitive processes and soc/ial behavior in virtual environments.A research group in UBC’s Computer Science department is studying how people play video games by collecting data from a custom MineCraft server. Players' demographics, skill levels, and in-game behaviors are recorded to help answer questions about engagement and resource needs.

Running the server requires careful planning, especially in recruiting active participants and managing server capacity. One key question is whether it’s possible to predict which players will subscribe to the project’s newsletter, as a sign of ongoing interest and engagement. This project explores that question using real gameplay and demographic data.



**`Question:`** 
Can player demographics and gameplay behavior, such as age, gender, skill level, and total hours played—predict whether a player will subscribe to the game-related newsletter in the UBC Minecraft server dataset?



**`data description`**

To address the predictive question—Can player demographics and gameplay behavior predict newsletter subscription?—we used two datasets collected by a research group in the UBC Computer Science department. These datasets were gathered from a Minecraft server that logs player sessions and stores basic demographic information. 
### 1. Player Information (`players.csv`)

This dataset includes one row per player with demographic and experience-level data.

**Summary:**
- **Number of observations:** 196 players
- **Key issues:** 2 missing values in the `Age` column; some players have no playtime recorded.

| Variable             | Type                   | Description                                               |
|----------------------|------------------------|-----------------------------------------------------------|
| `hashedEmail`        | Identifier             | Anonymized player ID used to link datasets                |
| `experience`         | Categorical (string)   | Player’s skill level (e.g., Pro, Veteran, Amateur)        |
| `subscribe`          | Logical (TRUE/FALSE)   | Whether the player subscribed to the newsletter           |
| `played_hours`       | Numeric (float)        | Originally reported total hours played                    |
| `name`               | String                 | First name (not used in modeling)                         |
| `gender`             | Categorical (string)   | Player’s gender                                           |
| `Age`                | Numeric (float)        | Age of the player in years                                |

### 2. Session Logs (`sessions.csv`)

This dataset contains gameplay session data, with multiple entries per player.

**Summary:**
- **Number of observations:** 1,535 sessions
- **Key issues:** 2 missing `end_time` values (incomplete sessions)

| Variable               | Type                 | Description                                              |
|------------------------|----------------------|----------------------------------------------------------|
| `hashedEmail`          | Identifier           | Links session to a player                                |
| `start_time`           | String (datetime)    | Start of the session                                     |
| `end_time`             | String (datetime)    | End of the session (may be missing)                      |
| `original_start_time`  | Numeric (Unix time)  | The exact time the session started, stored as a Unix timestamp (not used)                     |
| `original_end_time`    | Numeric (Unix time)  | The exact time the session started, stored as a Unix timestamp (not used)                     |


### Potential Issues

- Subscription behavior may be influenced by external factors not captured in the dataset (e.g., incentives, prior interest).
- Some players have very little gameplay data or only appeared once.
- Small sample size (n = 196) may limit generalizability.

 ## Methods & Results

To answer whether player demographics and gameplay behavior can predict newsletter subscription, we built a k-nearest neighbors (k-NN) classification model using the `tidymodels` framework in R. The full workflow is described below.
### 1. Loading and Wrangling the Data

We started by loading the two datasets: `players.csv` and `sessions.csv`. Each session's duration was calculated and summed per player to compute total playtime.



In [8]:
# Load libraries
library(tidyverse)
library(lubridate)
library(tidymodels)

# Load datasets
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

# Calculate session length and total playtime per player
sessions <- sessions |>
  mutate(start_time = dmy_hm(start_time),
         end_time = dmy_hm(end_time),
         session_length = as.numeric(difftime(end_time, start_time, units = "hours")))

playtime_summary <- sessions |>
  group_by(hashedEmail) |>
  summarize(calculated_played_hours = sum(session_length, na.rm = TRUE))

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
