### Data Science 100 Project

# Classification: Predict a Top 10 Player Based on Characteristics

Author: Rhodelle Lavarias | 77150522

## Introduction

Background: Minecraft is an open-world sandbox game created in 2011 by the company Mojang Studio, which allows players to explore and create, either by themselves or with other players. Having a global player base of millions of people over the world, it has become one of the most recognisable and influential video games of all time. A research group led by Frank Wood in UBC's Computer Science program have created a Minecraft server, while also recording players actions and characteristics as they play on the server. The group wants to study player behaviour as running a large server comes with many challenges with resources. They must efficiently allocate limited resources such as server hardware in order to support the amount of players they have playing. Therefore, they must also target recruitment efforts towards valuable users, which would be those who play a significant amount and can accordingly contribute high-quantity data which would be able to further help their efforts towards obtaining meaningful data and patterns. To support this effort, the group wants to find out through data how different types of players engage within the game.

My project will focus on addressing the second of three broad questions by the group:
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

To explore this topic I aim to categorize the top 10% of players based on how many hours they have played and build a classification model to predict whether a player belongs to the top 10% based on demographic and experience data. This could help find patterns in characteristics of players who engage the most and allow the group to accordingly aim their outreach effort towards the right direction.

## Research Question

The question I will investigate is:

**Can we predict whether a player is in the top 10% of total hours played based on their age, gender, and experience level?**

This question will help us to find patterns in player engagement and build a predictive model to predict played hours with k-nn classification with tidymodels.


I chose to use k-nearest neighbours classification to answer the research question as this approach is a simple and effective non-parametric method. This model works by comparing a new player to their k most similar players in the training data and then assigning the majority class label among those neighbours. We also do not need strong assumptions for this model.

## Data Description

In this project, I will be using the `players.csv` dataset collected from the Minecraft server hosted by the UBC group.

**Summary of Dataset:**
- **File name**: `players.csv`
- **Number of observations**: 196
- **Number of variables**: 7
- **Unit of observation**: One row per unique player
- **Goal**: Predict whether a player belongs to the **top 10%** in terms of total playtime

In [None]:
players <- read.csv("data/players.csv")

| Variable Name   | R Type      | Variable Type       | Level of Measurement | Description |
|-----------------|-------------|----------------------|-----------------------|-------------|
| `experience`    | `<chr>`     | Qualitative          | Ordinal               | Player’s skill level |
| `subscribe`     | `<lgl>`     | Qualitative          | Nominal               | TRUE/FALSE – whether the player subscribed to the newsletter |
| `hashedEmail`   | `<chr>`     | Qualitative          | Nominal               | Player ID (used to match the same players in other data) |
| `played_hours`  | `<dbl>`     | Quantitative         | Ratio                 | Total hours spent in-game  |
| `name`          | `<chr>`     | Qualitative          | Nominal               | Player’s name|
| `gender`        | `<chr>`     | Qualitative          | Nominal               | Gender |
| `Age`           | `<dbl>`     | Quantitative         | Ratio                 | Age in years  |

An immediate issue which is visible in the data is that `experience` is a categorical variable. We just convert this into a factor as there are rankings to this variable.
Some potential unseen issues could be that there could be self-reporting bias if the players are inputting the information themselves, as well as sampling bias as players who may join this sercer could behave differently than if they were to join a regular server.

Notes on data collection:
- All data was collected from **a custom Minecraft server** meant to log player behaviour and characteristics
- `played_hours` is likely tracked through server session logs
- `experience`, `gender`, and `age` were likely provided by players, likely via registration

## Method

First we will load our libraries, `tidyverse` and `tidymodels`, and our data, `players_csv`and assign to an object named `playerdata`:

In [None]:
library(tidyverse)
library(tidymodels)
playerdata <- read_csv("data/players.csv")
glimpse(playerdata)

We then must clean and prepare the data. We will be separating the players by the number of hours they play and the top 10% are labelled with `top` as they will be considered a top 10% player.


Clean and prepare:

In this step we are now using the quantile function to order the player data in ascending order based on player hours, and then finding the top ten percent. Then, we create a new column indicating if they are considered a top 10% player, whether they have `top` or `not_top`. 

In addition, we are also converting `gender`, `experience` and `top_player` to be factor variables.

In [None]:
threshold <- quantile(playerdata$played_hours, 0.9)

exp_levels <- c("Amateur", "Regular", "Veteran", "Pro")

players_clean <- playerdata |>
  mutate(
    experience = factor(experience,
                        levels = exp_levels,
                        ordered = TRUE),
         experience_num = as.integer(experience),
    top_player  = as.factor(if_else(played_hours >= threshold, "top", "not_top")),
    gender      = as.factor(gender)
  )

In [None]:

players_clean <- players_clean |>
  mutate(
    top_player = factor(
      top_player,
      levels = c("top", "not_top")   # ensure “top” is the first level
    )
  )


Exploratory data summary:

In [None]:
players_clean |>
  summarise(
    n_players    = n(),
    min_age      = min(Age, na.rm = TRUE),
    mean_age     = mean(Age, na.rm = TRUE),
    max_age      = max(Age, na.rm = TRUE),
    sd_age       = sd(Age, na.rm = TRUE),
    min_hours    = min(played_hours, na.rm = TRUE),
    mean_hours   = mean(played_hours, na.rm = TRUE),
    max_hours    = max(played_hours, na.rm = TRUE),
    sd_hours     = sd(played_hours, na.rm = TRUE)
  ) |>
  print()

Bar chart of avg hours by experience:

In [None]:
avg_hours <- players_clean |>
  group_by(experience) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))


figure_1 <- ggplot(avg_hours, aes(x = experience, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Figure 1. Average Played Hours by Experience Level",
    x = "Experience Level",
    y = "Mean Total Hours Played"
  ) +
  theme_minimal()
figure_1

Bar Chart by Gender:

In [None]:
avg_hours_gender <- players_clean |>
  group_by(gender) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

In [None]:
figure_2 <- ggplot(avg_hours_gender, aes(x = gender, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "forestgreen") +
  labs(
    title = "Figure 2. Average Played Hours by Gender",
    x     = "Gender",
    y     = "Mean Total Hours Played"
  ) +
  theme_minimal()
figure_2

In [None]:
figure_3 <- ggplot(players_clean, aes(x = played_hours)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ gender) +
  labs(
    title = "Figure 3. Distribution of Played Hours by Gender",
    x     = "Total Hours Played",
    y     = "Count"
  ) +
  theme_minimal()
figure_3

Bar graph of the mean of total hours played by age group:

In [None]:
players_binned <- players_clean |>
  mutate(age_group = cut(Age, breaks = seq(10, 50, by = 5), right = FALSE,
                         labels = c("10–14","15–19","20–24","25–29","30–34","35–39","40–44","45–50")))

# 2. Summarize: mean hours per bin
bin_summary <- players_binned |>
  group_by(age_group) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

# 3. Plot
figure_4 <-ggplot(bin_summary, aes(x = age_group, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "orchid") +
  labs(
    title = "Figure 3. Mean Total Hours Played by Age Group",
    x     = "Age Group",
    y     = "Mean Total Hours Played"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
figure_4

In [None]:
figure_5 <- ggplot(players_binned, aes(x = age_group)) +
  geom_bar(fill = "coral") +
  labs(
    title = "Figure X. Number of Players by Age Group",
    x     = "Age Group",
    y     = "Number of Players"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
figure_5

Modeling: k-nn

In [None]:
# Remove any rows with missing predictors or missing target
players_model <- players_clean |>
  drop_na(Age, experience_num, gender, top_player)


In [None]:
set.seed(999)
split   <- initial_split(players_model, prop = 0.8, strata = top_player)
trainingset   <- training(split)
testingset <- testing(split)

player_recipe <- recipe(top_player ~ Age + gender + experience_num, data = trainingset) |>
  step_normalize(all_numeric_predictors())

knn_spec <- nearest_neighbor(neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

wf <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec)

folds  <- vfold_cv(trainingset, v = 5, strata = top_player)
k_grid <- tibble(neighbors = 1:20)
tuned  <- tune_grid(wf, resamples = folds, grid = k_grid)
best_k <- select_best(tuned, "accuracy")
best_k

final_wf  <- finalize_workflow(wf, best_k)
final_fit <- fit(final_wf, data = trainingset)



In [None]:
# 1. Finalize the workflow with the best k
final_wf <- finalize_workflow(wf, best_k)

# 2. Fit the final model on the entire training set
final_fit <- fit(final_wf, data = trainingset)

# 3. Make class predictions on the test set
test_results <- predict(final_fit, testingset, type = "class") |>
  bind_cols(testingset)

# 4. Evaluate performance
test_results |>
  metrics(truth = top_player, estimate = .pred_class) |>
  filter(.metric %in% c("accuracy", "precision", "recall"))

#  4a. Confusion matrix
test_results |>
  conf_mat(truth = top_player, estimate = .pred_class)

It correctly classified 29 non-top players.

It misclassified 4 non-top players as top.

It failed to identify any of the actual top players (0 true positives, 0 predicted tops).

tune for recall:


In [None]:
recall_metrics <- metric_set(recall)
# 2) Tune k to maximize recall
set.seed(999)
tuned_recall <- tune_grid(
  wf,                
  resamples = folds, 
  grid      = k_grid,
  metrics   = recall_metrics
)

# 2a. Inspect best k by recall
best_k_recall <- tuned_recall |>
select_best("recall")
best_k_recall

# 3) Finalize and fit the recall‐optimized workflow
final_wf_rec <- finalize_workflow(wf, best_k_recall)

final_fit_rec <- fit(
  final_wf_rec,
  data = trainingset  
)

# 4) Predict and evaluate on the test set

test_results_rec <- predict(final_fit_rec, testingset, type = "class") |>
  bind_cols(testingset)

# 4a. Compute accuracy, precision, and recall
test_results_rec |>
  metrics(truth = top_player, estimate = .pred_class) |>
  filter(.metric %in% c("accuracy", "precision", "recall"))

# 4b. Show the confusion matrix
test_results_rec |>
  conf_mat(truth = top_player, estimate = .pred_class) |>
autoplot(type = "heatmap") +
  labs(title = "Recall-Tuned k-NN Confusion Matrix") +
  theme_minimal()