# Data Science Project: Planning Stage - **UBC Minecraft Research Server**

**Students:** Shaurya V. Shastri, Catherine Harris, Jessica Wang                                              
**Date:** 07-12-2025         
**Course:** DSCI100-009

---
GitHub Repository: https://github.com/symkk79/dsci_100_project.git

## 1. Introduction
In the past decade, we have seen a sharp rise in online gaming communities, which made understanding player engagement an essential part of managing servers and designing outreach strategies. Knowing which kinds of players are more likely to stay involved can help developers plan resources and tailor recruitment and communication campaigns.

The broad question we are focusing on is: 
> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

We aim to answer the following question:
> Can the age of the player predict if the player subscribes to a game-related newsletter in players.csv?

This analysis explores the relationship between player activity and continued interestrecordse game community.

### Dataset Description
The data set we use is from a UBC Computer Science Minecraft research server, which record player activity for the purpose of studying engagement patterns. Dataset of Players.csv describes the characteristics of each player and whether they chose to subscribe to a game-related newsletter.

**Variables**
| Variable   | Type        | Meaning |
| ---------- | ----------- | ------- |
|experience  | categorical | What category of experience the player falls into|
|subscribe   | categorical | Whether or not the player is subscribed to a game-related newsletter|
|hashedEmail | categorical | The email of the player|
|played_hours| quantitative| The amount of hours played| 
|name        | categorical | The name of the player|
|gender      | categorical | The gender of the player|
|Age         | quantitative| The age of the player |


<center>Figure 1.1: Variable Explanation </center>

## 2. Method
The methods section will outline the full analysis workflow, including:
- Data Import and Wrangling
- Exploratory Data Visualization
- Data Analysis

First, we began by importing the `players` dataset directly from the provided GitHub URL. After loading the data, we checked variable types and cleaned the dataset by mutating the `experience` variable from text categories into ordered numeric values, making it easier to use in summary statistics and visualizations.

Next, we generated basic summary statistics, including the number of observations, mean age, and mean hours played. These summaries allowed us to understand the distribution of the data and identify any potential anomalies, such as extreme values.

We then conducted exploratory data visualization to examine potential relationships between player characteristics and their subscription status. This included plotting:
- Experience level vs. subscription
- Gender vs. subscription
- Age vs. subscription
- Hours played vs. subscription (with extreme values filtered out)

### 2.1 Data Import & Wrangling

Data import stage includes: 
1. Loading the dataset from the URL;
2. Mutating the variables in the dataset into a more workable format
3. Producing summary statistics

#### 0. Setting Up ####

In [None]:
library(tidyverse)
library(purrr)
library(ggplot2)
library(tidymodels)

#### 1. Loading Dataset from URLs ####

In [None]:
players_URL <- "https://raw.githubusercontent.com/symkk79/dsci-100-project-planning-dataset/main/players.csv"
players_data <- read_csv(players_URL)

players_data

#### 2. Mutating variables to proper value ####

We recoded the `experience` variable from text categories into ordered numeric values (1â€“5). This makes the variable easier to work with during visualization and modeling because numeric values allow for comparisons and distance-based methods, such as KNN.

In [None]:
summary_data <- players_data |>
    mutate(
    experience = case_when(
    experience == "Beginner" ~ 1,
    experience == "Amateur"  ~ 2,
    experience == "Regular"  ~ 3,
    experience == "Veteran"  ~ 4,
    experience == "Pro"      ~ 5,
    )
  )

summary_data

#### 3. Generating Summary Statistics  ####
We generate basic summary statistics for the players dataset, including counts and mean values for key variables such as age and hours played. This helps us understand the overall distribution of the data and identify any potential anomalies before modelling.

In [None]:
summary_data <- players_data |>
                summary(digit = 3) 

summary_data

In [None]:
observation_count <- players_data |>
                    count()
observation_count

In [None]:
players_mean <- players_data |>
                select(played_hours, Age) |>
                map_dfr(mean, na.rm = TRUE)
players_mean

| Variable     | Mean |
| ------------ | ---- |
| Hours Played | 6    |
| Age          | 21   |

### 2.2 Data Visualization

The Data visualization includes:
1. Visualizing the relationship between experience and subscription status
2. Visualizing the relationship between gender and subscription status
3. Visualizing the relationship between age and subscription status


In [None]:
experience_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = subscribe, fill = experience)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Level of Experience") +
                            ggtitle("How the experience of player influences if they subscribed")
experience_vs_subscription_graph

<center>Figure 2.1: Visualization of how the experience of players influences whether they subscribe </center>
This bar plot compares subscription rates across experience levels. The distribution looks similar across all categories, suggesting that experience does not meaningfully influence whether players subscribe.

In [None]:
gender_vs_subscription_graph <- players_data |>
                            filter(gender != "Prefer not to say") |>
                            ggplot(aes(x = subscribe, fill = gender)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Gender") +
                            ggtitle("Gender of player and whether they subscribed")
gender_vs_subscription_graph

<center>Figure 2.2: Visualization of the gender of players and whether they subscribe </center>
This plot shows the number of subscribers by gender. The proportions across gender groups are also nearly the same, indicating that gender has little relationship with newsletter subscription.

In [None]:
age_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = Age, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Player's Age", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("Age of player and whether they subscribed")
age_vs_subscription_graph

<center>Figure 2.3: Visualization of the age of players and whether they subscribed </center>
This histogram displays how subscription status varies across ages. Younger players appear slightly more likely to subscribe, while older players subscribe less frequently, though the relationship is weak.

<center>Figure 2.3: Visualization of the age of players and whether they subscribed </center>
This histogram displays how subscription status varies across ages. Younger players appear slightly more likely to subscribe, while older players subscribe less frequently, though the relationship is weak.

In [None]:
age_vs_subscription_graph <- players_data |>
                            filter(played_hours < 3) |>
                            ggplot(aes(x = played_hours, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Playing hours", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("Hours spent playing and if players subscribed")
age_vs_subscription_graph

<center>Figure 2.4: Visualization of hours played by the players and whether they subscribed</center>
After filtering out extreme values, this plot shows subscription patterns for players with fewer than 3 hours of play. The distributions for subscribers and non-subscribers are very similar, suggesting hours played is not a strong predictor.

In [None]:
wrangled_player_data <- players_data |>
    select(Age, subscribe) |>
    drop_na() |>
    mutate(subscribe = as_factor(subscribe)) 
    

set.seed(1234)

player_split <- initial_split(wrangled_player_data, prop = 0.75, strata = subscribe)  
player_train <- training(player_split)   
player_test <- testing(player_split)


In [None]:

set.seed(453)

training_recipe <- recipe(subscribe ~ Age, data = player_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

training_model <- nearest_neighbor(weight_func = "rectangular", neighbor = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

training_v_fold <- vfold_cv(player_train, v = 10, strata = subscribe)

k_values <- tibble(neighbors = seq(from = 5, to = 15, by = 1))

training_workflow <- workflow() |>
                    add_recipe(training_recipe) |>
                    add_model(training_model) |>
                   tune_grid(resamples = training_v_fold, grid = k_values) |>
                   collect_metrics()

accuracies <- training_workflow |>
                filter(.metric == 'accuracy')

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)

In [None]:

set.seed(1122) 

player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

player_fit <- workflow() |>
    add_recipe(training_recipe) |>
    add_model(player_spec) |>
    fit(data = player_train)
player_fit
     

In [None]:

player_test_predictions <- predict(player_fit, player_test) |>
  bind_cols(player_test)

test_accuracies <- player_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy") 
test_accuracies

#test_precision <- player_test_predictions |>
 #   precision(truth = subscribe, estimate = .pred_class, event_level="first") |>
  #  recall(truth = subscribe, estimate = .pred_class, event_level="first")
#test_accuracies
     