# Predicting Minecraft Server Subscription from Player Behaviour

## Introduction

Many online games use a hybrid business model where users can play for free but may optionally pay for premium features or subscriptions. Understanding which players are most likely to subscribe is valuable for both server administrators and game developers: it can inform marketing strategies, server design, and allocation of limited development resources.

In this project, we use data from a Minecraft research server to investigate which kinds of players are more likely to subscribe to the server. The main focus is on how demographic and play-related characteristics are associated with the probability of subscription.

### Research Question

**Can we predict whether a player will subscribe to the Minecraft research server from their demographic characteristics and in-game behaviour (such as experience level, play time, and number of sessions)?**

We treat subscription status as the outcome (response) and use a set of player-level features as predictors. We then build and evaluate a k-nearest neighbours (KNN) classification model to answer this question.

### Dataset

We use two datasets collected from a Minecraft research server:

1. **`players.csv` (player-level data)**  
   Each row represents a unique player. Key variables include:
   - `experience` (categorical): self-reported Minecraft experience level (`"Pro"`, `"Veteran"`, `"Regular"`, `"Amateur"`).
   - `subscribe` (logical): whether the player subscribed to the server (`TRUE` / `FALSE`). This is our **response variable**.
   - `hashedEmail` (character): anonymized player identifier used to link with the sessions dataset.
   - `played_hours` (numeric): total hours the player spent on the server.
   - `name` (character): player’s first name (not used in the analysis).
   - `gender` (categorical): player’s gender.
   - `Age` (numeric): player’s age in years.

   The dataset contains **196 observations** and **7 variables**.

2. **`sessions.csv` (session-level data)**  
   Each row contains information about a single play session on the server. Important variables include:
   - `hashedEmail` (character): player ID, linking sessions to `players.csv`.
   - Additional session fields (such as timestamps or durations), which we use indirectly by aggregating number of sessions per player.

By joining these two datasets on `hashedEmail`, we obtain a player-level dataset with both demographics and behavioural information (e.g., number of sessions).

In the following sections, we describe our methods, present exploratory summaries and visualizations, fit a KNN classifier, and interpret its performance in the context of our research question.


In [None]:

library(tidyverse)
library(tidymodels)

# players_url <- "https://raw.githubusercontent.com/ishirGhatpande/Individual-Project-Planning/refs/heads/main/players.csv"
# sessions_url <- "https://raw.githubusercontent.com/ishirGhatpande/Individual-Project-Planning/refs/heads/main/sessions.csv"

# players <- read_csv(players_url, show_col_types = FALSE)
# sessions <- read_csv(sessions_url, show_col_types = FALSE)

# head(players)
# head(sessions)
