# Data Science Project: Planning Stage - **UBC Minecraft Research Server**

**Students:** Shaurya V. Shastri, Catherine Harris, Jessica Wang                                              
**Date:** 07-12-2025         
**Course:** DSCI100-009

---
GitHub Repository: https://github.com/symkk79/dsci_100_project.git

## 1. Introduction
In the past decade, we have seen a sharp rise in online gaming communities, which made understanding player engagement an essential part of managing servers and designing outreach strategies. Knowing which kinds of players are more likely to stay involved can help developers plan resources and tailor recruitment and communication campaigns.

The broad question we are focusing on is: 
> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

We aim to answer the following question:
> Can the age of the player predict if the player subscribes to a game-related newsletter in players.csv?

This analysis explores the relationship between player activity and continued interestrecordse game community.

### Dataset Description
The data set we use is from a UBC Computer Science Minecraft research server, which record player activity for the purpose of studying engagement patterns. Dataset of Players.csv describes the characteristics of each player and whether they chose to subscribe to a game-related newsletter.

**Variables**
| Variable   | Type        | Meaning |
| ---------- | ----------- | ------- |
|experience  | categorical | What category of experience the player falls into|
|subscribe   | categorical | Whether or not the player is subscribed to a game-related newsletter|
|hashedEmail | categorical | The email of the player|
|played_hours| quantitative| The amount of hours played| 
|name        | categorical | The name of the player|
|gender      | categorical | The gender of the player|
|Age         | quantitative| The age of the player |


<center>Figure 1.1: Variable Explanation </center>

## 2. Method
The methods section will outline the full analysis workflow, including:
- Data Processing
- Exploratory Data Visualization
- Data Analysis

First, we began by importing the `players` dataset directly from the provided GitHub URL. Next, we generated basic summary statistics, including the number of observations, mean age, and mean hours played. These summaries allowed us to understand the distribution of the data and identify any potential anomalies, such as extreme values.



We then conducted exploratory data visualization to examine potential relationships between age of the players and their subscription status. This included plotting:
- Age vs. subscription

### 2.1 Data Processing

Data import stage includes: 
1. Loading the dataset from the URL;
2. Producing summary statistics
3. Data Wrangling
4. Splitting the data into training & testing sets

#### 0. Setting Up ####

In [None]:
library(tidyverse)
library(purrr)
library(ggplot2)
library(tidymodels)

#### 1. Loading Dataset from URLs ####

In [None]:
players_URL <- "https://raw.githubusercontent.com/symkk79/dsci-100-project-planning-dataset/main/players.csv"
players_data <- read_csv(players_URL)

players_data

#### 2. Generating Summary Statistics  ####
We generate basic summary statistics for the players dataset, including counts and mean values for key variables such as age and hours played. This helps us understand the overall distribution of the data and identify any potential anomalies before modelling.

In [None]:
summary_data <- players_data |>
                summary(digit = 3) 

summary_data

In [None]:
observation_count <- players_data |>
                    count()
observation_count

In [None]:
players_mean <- players_data |>
                select(played_hours, Age) |>
                map_dfr(mean, na.rm = TRUE)
players_mean

| Variable     | Mean |
| ------------ | ---- |
| Hours Played | 6    |
| Age          | 21   |

#### 3. Data Wrangling
For data analysis, we selected `Age` and `subscribe` variables, removed missing values, and converted `subscribe` into a factor. This generates a clean dataset with one predictor and one categorical outcome.

In [None]:
wrangled_player_data <- players_data |>
    select(Age, subscribe) |>
    drop_na() |>
    mutate(subscribe = as_factor(subscribe)) 
wrangled_player_data

#### 4. Splitting the data into training and testing data
We first set a seed to ensure that the results are reproducible.
Then, using initial_split(), we divided the cleaned dataset into two parts:
- 75% training data used to build and tune the model
- 25% testing data used to evaluate model performance

We also used stratified sampling based on the subscribe variable. This ensures that both the training and testing sets maintain the same proportion of subscribers and non-subscribers as the original dataset, preventing biased model evaluation.

In [None]:
set.seed(1234)

player_split <- initial_split(wrangled_player_data, prop = 0.75, strata = subscribe)  
player_train <- training(player_split)   
player_test <- testing(player_split)

### 2.2 Data Visualization

The Data visualization includes Visualizing the relationship between age and subscription status


#### Age VS Subscription

In [None]:
age_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = Age, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Player's Age", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("Age of player and whether they subscribed")
age_vs_subscription_graph

<center>Figure 2.1: Visualization of the age of players and whether they subscribed </center>
This histogram displays how subscription status varies across ages. Younger players appear slightly more likely to subscribe, while older players subscribe less frequently, though the relationship is weak.

### 2.3 Data Analysis
In this section, we analyze whether a playerâ€™s age can be used to predict their newsletter subscription status using a K-Nearest Neighbours (KNN) classification model. The analysis includes two steps: 
1. Training and tuning the KNN model using 10-fold cross-validation to identify the optimal number of neighbours
2. Evaluating the final model on the test dataset using accuracy, precision, recall, and a confusion matrix to assess predictive performance.

#### 1. Training and Tuning the KNN Model with Cross-Validation

We are building a full modelling pipeline and using cross-validation to choose the best value of 
k for our KNN classifier.

First, we start by fixing the random seed so the results are reproducible, and then creating a recipe that uses Age to predict `subscribe` and standardizes the predictor (centering and scaling) so distance calculations in KNN are meaningful.

Next, we specify a KNN classification model with a tunable number of neighbours, and set up 10-fold cross-validation stratified by `subscribe` to keep the class balance consistent across folds.

We define a grid of candidate k-values from 5 to 15 and use a workflow to combine the recipe and model. The `tune_grid()` function fits the model for each k on each fold, and `collect_metrics()` returns the accuracy for every value of k.

Then we filter to the accuracy metric and plot accuracy against the number of neighbours to visualize performance. Finally, we are able to extract the k that achieves the highest mean accuracy (`best_k`). This tuned k will be used for the final model.

In [None]:
set.seed(453)

training_recipe <- recipe(subscribe ~ Age, data = player_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

training_model <- nearest_neighbor(weight_func = "rectangular", neighbor = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

training_v_fold <- vfold_cv(player_train, v = 10, strata = subscribe)

k_values <- tibble(neighbors = seq(from = 5, to = 15, by = 1))

training_workflow <- workflow() |>
                    add_recipe(training_recipe) |>
                    add_model(training_model) |>
                   tune_grid(resamples = training_v_fold, grid = k_values) |>
                   collect_metrics()

accuracies <- training_workflow |>
                filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Number of Neighbours', y = 'Accuracy') +
                  ggtitle("Accuracy of Model With Different K Values")
                  scale_x_continuous(breaks = 5:15)
cross_val_plot
                  

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)



In [None]:
set.seed(1122) 

player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

player_fit <- workflow() |>
    add_recipe(training_recipe) |>
    add_model(player_spec) |>
    fit(data = player_train)
player_fit

<center>Figure 3.1: Accuracy of the KNN model across different k-values </center>

#### 2. Predicting on the Test Data and Evaluating Model Performance

In this step, we use the tuned KNN model (stored in `player_fit`) to make predictions on the held-out test set and then quantify how well the model performs.

We first call `predict()` on `player_test` and bind the predictions back to the original test data, creating a single tibble that contains both the true subscription status and the predicted class.

Using this object, we compute several evaluation metrics: overall accuracy, precision, and recall.

Finally, we construct a confusion matrix to show the counts of true positives, true negatives, false positives, and false negatives, giving a detailed view of where the model is making mistakes.

In [None]:
player_test_predictions <- predict(player_fit, player_test) |>
  bind_cols(player_test)


test_accuracies <- player_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy") 


test_precision <- player_test_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level="first") 

test_recall <- player_test_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level="first")

test_accuracies
test_precision
test_recall

conf_mat <- player_test_predictions |>
                conf_mat(truth = subscribe, estimate = .pred_class)

conf_mat