### Data Science 100 Project

# Classification: Predict a Top 10% Player Based on Characteristics

Author: Rhodelle Lavarias | 77150522

## Introduction

Background: Minecraft is an open-world sandbox game created in 2011 by the company Mojang Studio, which allows players to explore and create, either by themselves or with other players. Having a global player base of millions of people over the world, it has become one of the most recognisable and influential video games of all time. A research group led by Frank Wood in UBC's Computer Science program have created a Minecraft server to record data about players' actions and characteristics as they play on the server. The group wants to study player behaviour as running a large server comes with many challenges with resources. They must efficiently allocate limited resources such as server hardware in order to support the amount of players they have playing. Therefore, they must also target recruitment efforts towards valuable users. Here, we classify valuable users as those who play a significant amount and would accordingly contribute a high quantity of data which would benefit their efforts towards obtaining meaningful data and patterns. To support this effort, the group wants to find out through data how different types of players engage within the game.

My project will focus on addressing the second of three broad questions by the group:
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

To explore this topic, I aim to categorize the top 10% of players based on how many hours they have played and build a classification model to predict whether a player belongs to the top 10% based on demographic and experience data. This could help find patterns in characteristics of players who engage the most and allow the group to accordingly aim their outreach effort towards the right direction.

## Research Question

The question I will investigate is:

**Can we predict whether a player is in the top 10% of total hours played based on their age, gender, and experience level?**

This question will help us to find patterns in player engagement and build a predictive model to predict played hours with k-nn classification within tidymodels.


I chose to use k-nearest neighbours classification to answer the research question as this approach is a simple and effective non-parametric method. This model works by comparing a new player to their k most similar players in the training data and then assigning the majority class label among those neighbours. We also do not need strong assumptions for this model.

## Data Description

In this project, I will be using the `players.csv` dataset collected from the Minecraft server hosted by the UBC group.

**Summary of Dataset:**
- **File name**: `players.csv`
- **Number of observations**: 196
- **Number of variables**: 7
- **Unit of observation**: One row per unique player
- **Goal**: Predict whether a player belongs to the **top 10%** in terms of total playtime

In [None]:
players <- read.csv("data/players.csv")

| Variable Name   | R Type      | Variable Type       | Level of Measurement | Description |
|-----------------|-------------|----------------------|-----------------------|-------------|
| `experience`    | `<chr>`     | Qualitative          | Ordinal               | Player’s skill level |
| `subscribe`     | `<lgl>`     | Qualitative          | Nominal               | TRUE/FALSE – whether the player subscribed to the newsletter |
| `hashedEmail`   | `<chr>`     | Qualitative          | Nominal               | Player ID (used to match the same players in other data) |
| `played_hours`  | `<dbl>`     | Quantitative         | Ratio                 | Total hours spent in-game  |
| `name`          | `<chr>`     | Qualitative          | Nominal               | Player’s name|
| `gender`        | `<chr>`     | Qualitative          | Nominal               | Gender |
| `Age`           | `<dbl>`     | Quantitative         | Ratio                 | Age in years  |

An immediate issue which is visible in the data is that `experience` is a categorical variable. We can just convert this as a factor for data analysis.
Some potential unseen issues could be that there could be self-reporting bias if the players are inputting the information themselves, as well as sampling bias as players who may join this sercer could behave differently than if they were to join a regular server.

Notes on data collection:
- All data was collected from **a custom Minecraft server** meant to log player behaviour and characteristics
- `played_hours` is likely tracked through server session logs
- `experience`, `gender`, and `age` were likely provided by players, likely via registration

## Method

To answer our research question of: "Can we predict whether a player is in the top 10% of total hours played based on their age, gender, and experience level?" we will be creating visualizations with various characteristics to find patterns, as well as creating a model to predict a top player by using these characteristics. To do this we will be using the k-Nearest Neighbours (k-NN) classification to find if we can detect players considered to be in the top 10% of total hours (`played_hours`) based on their characteristics of `age`, `gender`, and `experience`.

Steps:
1. First, we will load the libraries and the data. After loading the data, we must clean and prepare it to ensure all variables can be used correctly. This includes creating a new column named `top_player` which classifies players based on their `played_hours`, where the players within the top 10% of hours are considered a top player (`top`) and the remaining 90% is not (`not_top`).
2. Second, we will summarize the data by finding the maximum, minimum, and the mean of the numerical variables in a table. Then, we will use a variety of visualization plots to explore relationships in our variables and use this to guide our modelling.
3. Third, we will split the data into training and testing sets. Then we preprocess and create a recipe normalizing the numerical predictors to scale `Age` and `experience_num` and ensure categorical predictors remain as factors for the `kknn` engine.
4. Fourth, we will build a k-NN workflow with `neighbors = tune()`. Within the workflow, we will perform an 80/20 stratified split on `top_player`, run 5-fold cross-validation over k = 1:20 to first optimize for accuracy. Then, we select the best 'k' by accuracy, finalize the workflow and fit on the full training set.
5. Finally, we predict on the test-set and report the accuracy as well as the confusion matrix.
6. (additional) We will tune our model to then focus on `recall` and refit the workflow accordingly to evaluate the new metrics on the test set as well as a confusion matrix to see if the model can be improved.

## Results

### Step 1: Load, clean, prepare data

In [None]:
#Loading in libraries and data
library(tidyverse)
library(tidymodels)
playerdata <- read_csv("data/players.csv")
glimpse(playerdata) 

In [None]:
#Use the quantile function to order the player data in ascending order based on player hours, and then finding the top ten percent. 
#Then, create a new column indicating if they are considered a top 10% player, whether they have `top` or `not_top`. 
threshold <- quantile(playerdata$played_hours, 0.9)

#In addition, convert `gender`, `experience` and `top_player` to be factor variables (and, when desired, an `experience_num` score)
#I am also adding levels to the experience variable but this is not needed in the overall data analysis
exp_levels <- c("Amateur", "Regular", "Veteran", "Pro")

players_clean <- playerdata |>
  mutate(experience = factor(experience, levels = exp_levels, ordered = TRUE),
        experience_num = as.integer(experience),
        top_player  = as.factor(if_else(played_hours >= threshold, "top", "not_top")),
        gender      = as.factor(gender)) |>
  select(Age, gender, experience_num, played_hours, top_player)

In [None]:

players_clean <- players_clean |>
  mutate(top_player = factor(top_player, levels = c("top", "not_top")))


### Step 2: Exploratory data Summary and Visualizations

In [None]:
#Compute a summary table (min, mean, max, SD) for `Age` and `played_hours`
players_clean |>
  summarise(
    n_players    = n(),
    min_age      = min(Age, na.rm = TRUE),
    mean_age     = mean(Age, na.rm = TRUE),
    max_age      = max(Age, na.rm = TRUE),
    sd_age       = sd(Age, na.rm = TRUE),
    min_hours    = min(played_hours, na.rm = TRUE),
    mean_hours   = mean(played_hours, na.rm = TRUE),
    max_hours    = max(played_hours, na.rm = TRUE),
    sd_hours     = sd(played_hours, na.rm = TRUE)
  ) |>
  print()

#### 2A: Exploring the relationship between played hours and experience level

In [None]:
avg_hours <- players_clean |>
  group_by(experience) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))


figure_1 <- ggplot(avg_hours, aes(x = experience, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Figure 1. Average Played Hours by Experience Level",
    x = "Experience Level",
    y = "Mean Total Hours Played"
  ) +
  theme_minimal()
figure_1

**Figure 1: Average Played Hours by Experience Level- The graph above depicts that those who are "regulars" at the game have spent on average the most amount of hours on the game. There is a large difference between the most and second most, with "regulars" in first being followed by "amateurs" and then by "pros".**

#### 2B: Exploring the relationship between played hours and gender

In [None]:
avg_hours_gender <- players_clean |>
  group_by(gender) |>
  summarise(mean_hours = mean(played_hours))

In [None]:
figure_2a <- ggplot(avg_hours_gender, aes(x = gender, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "forestgreen") +
  labs(
    title = "Figure 2a. Average Played Hours by Gender",
    x     = "Gender",
    y     = "Mean Total Hours Played"
  ) +
  theme_minimal()
figure_2a

**Figure 2a: Average Played Hours by Gender- In the graph above, we can see that on average, those who identify as non-binary have on average the most hours played, with females in second and those who are agender in third.**

In [None]:
figure_2b <- ggplot(players_clean, aes(x = played_hours)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ gender) +
  labs(
    title = "Figure 2b. Distribution of Played Hours by Gender",
    x     = "Total Hours Played",
    y     = "Count"
  ) +
  theme_minimal()
figure_2b

**Figure 2b: Distribution of Played Hours by Gender- In the plots above we can make more sense of Figure 2a. Although males seem to have a low amount of hours on average, there is a much larger male demographic on the server than the rest of the genders, where a vast majority of them play fewer hours. Another important point to make here is that there is quite a tiny part of the demographic which is non-binary, and so the one observation with over 200 hours had a greater effect on their overall mean. The same case could also be made with the female demographic but to a lesser extremity.**

#### 2C: Exploring the relationship between played hours and age

In [None]:
figure_3a <- ggplot(players_clean, aes(x = Age, y = played_hours)) +
  geom_point(alpha = 0.6, size = 2, color = "steelblue") +
  labs(
    title = "Figure 3a: Scatter Plot of Age vs. Total Hours Played",
    x     = "Age (years)",
    y     = "Total Hours Played"
  )
figure_3a

**Figure 3a: Scatter Plot of Age vs. Total Hours Played- In the plot above, it is hard to determine a relationship between age and total hours played as it looks like the points are mostly scattered, but also quite condensed in the 15-28 age range. Other than than, a relationship cannot be found.**

In [None]:
#bin players into different age groupings:10–14,15–19,20–24,25–29,30–34,35–39,40–44,45–50
players_binned <- players_clean |> 
  mutate(age_group = cut(Age, breaks = seq(10, 50, by = 5), right = FALSE,
                         labels = c("10–14","15–19","20–24","25–29","30–34","35–39","40–44","45–50")))

bin_summary <- players_binned |>
  group_by(age_group) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

figure_4 <-ggplot(bin_summary, aes(x = age_group, y = mean_hours)) +
  geom_bar(stat = "identity", fill = "orchid") +
  labs(
    title = "Figure 3. Mean Total Hours Played by Age Group",
    x     = "Age Group",
    y     = "Mean Total Hours Played" ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
figure_4

**Figure 3b. Mean Total Hours Played by Age Group - The graph above shows us that those aged 15-19 on average play the most hours, in second being the 45-50 category and in third being the 20-24 age category. This suggests that mid-teens are the most engaged group. This could also be for 45-50 but this could be because of a small sample size.**

### Step 3: Split into training and testing set and preprocess data

In [None]:
# Remove any rows with missing predictors or missing target
players_model <- players_clean |>
  drop_na(Age, experience_num, gender, top_player)


In [None]:
set.seed(999)
#split data into separate sets for training and testing
split   <- initial_split(players_model, prop = 0.8, strata = top_player) #80/20 stratified split 
trainingset   <- training(split)
testingset <- testing(split)
#Define a recipe using `top_player` as classifer, Age + gender + experience_num as predictors 
player_recipe <- recipe(top_player ~ Age + gender + experience_num, data = trainingset) |>
  step_normalize(all_numeric_predictors()) #scale `Age` and `experience_num`

### Step 4 and 5: Build a k-NN workflow, fit on training set, and predict on testing set

In [None]:
set.seed(999)
#Built a k-NN workflow with `neighbors = tune()` for classification
knn_spec <- nearest_neighbor(neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

wf <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec)

#5-fold cross-validation over `k = 1:20`, optimized for accuracy
folds  <- vfold_cv(trainingset, v = 5, strata = top_player) 
k_grid <- tibble(neighbors = 1:20)
tuned  <- tune_grid(wf, resamples = folds, grid = k_grid)
best_k <- select_best(tuned, "accuracy")
best_k
#finalize the workflow, fit on the full training set. 
final_wf  <- finalize_workflow(wf, best_k)
final_fit <- fit(final_wf, data = trainingset)



In [None]:
#Finalize the workflow with the best k
final_wf <- finalize_workflow(wf, best_k)

#Fit the final model on the entire training set
final_fit <- fit(final_wf, data = trainingset)

#Make class predictions on the test set
test_results <- predict(final_fit, testingset, type = "class") |>
  bind_cols(testingset)

#Compute accuracy, precision, recall
my_metrics <- metric_set(accuracy, precision, recall)

test_results |>
  my_metrics(truth = top_player, estimate = .pred_class)

#Confusion matrix
test_results |>
  conf_mat(truth = top_player, estimate = .pred_class)

It correctly classified 29 non-top players.

It misclassified 4 non-top players as top.

It failed to identify any of the actual top players (0 true positives, 0 predicted tops).

tune for recall:


In [None]:
recall_metrics <- metric_set(recall)
# Re-ran `tune_grid()` to pick `k` that maximizes recall.
set.seed(999)
tuned_recall <- tune_grid(
  wf,                
  resamples = folds, 
  grid      = k_grid,
  metrics   = recall_metrics
)

#Inspect best k by recall
best_k_recall <- tuned_recall |>
select_best("recall")
best_k_recall

#Finalize and fit the recall‐optimized workflow
final_wf_rec <- finalize_workflow(wf, best_k_recall)

final_fit_rec <- fit(final_wf_rec, data = trainingset)

#Predict and evaluate on the test set

test_results_rec <- predict(final_fit_rec, testingset, type = "class") |>
  bind_cols(testingset)

#Compute accuracy, precision, and recall
test_results_rec |>
  my_metrics(truth = top_player, estimate = .pred_class)

#confusion matrix
test_results_rec |>
  conf_mat(truth = top_player, estimate = .pred_class) |>
autoplot(type = "heatmap") +
  labs(title = "Recall-Tuned k-NN Confusion Matrix") +
  theme_minimal()

In [None]:
test_results_rec %>%
  recall(truth = top_player, estimate = .pred_class)