# **DSCI 100 Final Project**

**Section**: 004\
**Group**: 30\
**Members**: Victoria Chen (66263492), Saije Hans (73101313), Zhian Zhou (77230522), Charlotte Chen (60779865)

# <u> Introduction<u>

A research group in Computer Science at UBC has set up a MineCraft server to tackle the problem of predicting usage of a video game research server. By using the information in the given datasets - players.csv and sessions.csv - the following question can be answered: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

In this study, we aim to answer the question: Can played hours and age predict the value of subscription in players? generated by specifying the broad question above. Our dataset of choice is the players.csv table; this database contains information related to connecting player characteristics and subscriptions together which is required to answer our question. We have chosen three specific variables to use to reach a conclusion: played_hours, age, and subscribe. 

1. played_hours: gaming hours of each individual
2. age: age of individual players in years
3. subscribe: whether or not the individual is subscribed to a game related newspaper

The aim of our project is to propose a model that concludes whether or not the number of hours played on this server as well as the age of the individual playing can be used to determine if a player will subscribe to the newsletter. We chose to avoid predictors in the dataset that were not relevant to the exact characteristics we wanted to experiment with to evaluate if it contributed to the likelihood of subscribing or not. For instance, the dataset included the following in addition to the variables listed above: 

- hashed email
- player name
- player experience
- player gender

These were all excluded in our observations. Thus we believe that the played hours, age, and subscription values are the best fit to conclude if a player is subscribed or not. 

## Background

## Question

### Broad Question

> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Question

> Can `played_hours` and `Age` predict the value of `subscribe` in `players.csv`?

## Dataset Description

# <u>Methods & Results<u>

## 1. Loading Data

In [None]:
# Load tidyverse package
library(tidyverse)

In [None]:
# Download and read dataset
download.file("https://raw.githubusercontent.com/vichen15/dsci100-004-30-final-project/refs/heads/main/players.csv", destfile = "players.csv")
players <- read_csv("players.csv")
head(players)

## 2. Wrangling Data

In [None]:
# Select relevant variables
players <- players |>
  select(
      Age, played_hours, subscribe
  )

In [None]:
# Factorize response variable
players <- players |>
  mutate(
    subscribe = factor(subscribe)
  )

In [None]:
# Remove NA values from data
players <- players |> 
  filter(!is.na(Age), !is.na(played_hours))

head(players)

## 3. Data Summary

In [None]:
# Calculate summary statistics for each variable
summary(players)

Using the `summary` command, we notice that:
- The most prevalent experience level of players is "Amateur", followed by "Veteran".
- The proportion of individuals subscribed to the newsletter is 144 / 196 = 73.47%.
- The average number of hours played is 5.846, yet the median is only 0.1.
- The mean (21.14) and median (19.00) ages are both approximately 20 years old.

## 4. Exploratory Visualization

In [None]:
# Set-up
options(repr.plot.width = 15, repr.plot.height = 5)
library(patchwork)

# Remove NA values from data
players <- players |> 
  filter(!is.na(Age), !is.na(played_hours))

# Graph the relationship between time played, player age, and subscription status
time_age_1 <- players |>
  ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point() +
  labs(title = "Time played vs. Player age",
       x = "Player's age (years)", 
       y = "Time played on server (hours)",
       color = "Subscribed to newsletter") +
  scale_color_manual(values = c("red", "green"))

# Remove extreme outliers from the data
played_hours_trimmed <- players |>
  filter(played_hours <= 100) |>
  filter(played_hours > 0)

# Graph the relationship again with the trimmed data
time_age_2 <- played_hours_trimmed |>
  ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point() +
  labs(title = "Time played vs. Player age (outliers removed)",
       x = "Player's age (years)", 
       y = "Time played on server (hours)",
       color = "Subscribed to newsletter") +
  scale_color_manual(values = c("red", "green"))

# Print graphs side-by-side
time_age_1 + time_age_2

**Time played vs. Player age**
- We first started by creating a scatterplot by graphing `played_hours` against `Age`. We coloured the points red or green based on the value of `subscribe`.
- We found that almost all of the points were near zero on the y-axis and that more points were closer to zero on the x-axis as well.
- There were some high outliers of `played_hours` in this graph that made the rest of the values difficult to see.
- We were unable to make any conclusions from this graph.

**Time played vs. Player age (outliers removed)**
- To create this graph, we removed all points where `played_hours` exceeded 100 to get rid of the outliers.
- What we found after zooming in was that the majority of the data actually sat below 10 hours on the y-axis.
- we found that there was a cluster of points between 0-10 hours on the y-axis and between 10-30 years on the x-axis.

In [None]:
library(tidyverse)
table_1_data <- "players.csv"
table_1 <- read_csv(table_1_data)
table_1_tidy <- read_csv("players.csv") |>
  select(-hashedEmail, -name)
figure_1 <- ggplot(table_1_tidy, aes(x = Age, y = played_hours, fill = Age)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(
    x = "Player's Age (Years)",
    y = "Average Played Hours",
    title = "Figure 1: Average Played Hours by Gender"
  ) +
  theme_minimal()
figure_1

## 5. Data Analysis

In [None]:
set.seed(2025)
library(rsample)
library(tidyverse) 
library(tidymodels)

#players


In [None]:
players_split <- initial_split(players, prop = 0.75 , strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)


In [None]:
#find best k
players_recipe <- recipe(subscribe ~ ., data = players) |>   
step_scale(all_predictors()) |>   
step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>   
set_engine("kknn") |>   
set_mode("classification")

players_vfold <- vfold_cv(players, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))


knn_results <- workflow() |>   
add_recipe(players_recipe) |>   
add_model(knn_spec) |>   
tune_grid(resamples = players_vfold, grid = k_vals) |>   
collect_metrics()


accuracies <- knn_results |>   
filter(.metric == "accuracy")

best_k <- accuracies |>         
arrange(desc(mean)) |>         
head(1) |>         
pull(neighbors) 

best_k


In [None]:
#train new model with the best k
knn_new_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 6)|>   
set_engine("kknn") |>   
set_mode("classification")

knn_fit <- workflow() |>   
add_recipe(players_recipe) |>   
add_model(knn_new_spec) |>   
fit(data = players_train)

players_test_predictions <- predict(knn_fit, players_test) |>   
bind_cols(players_test)

accuracy <- players_test_predictions|>
metrics(truth = subscribe, estimate = .pred_class) |>   
filter(.metric == "accuracy")

accuracy


In [None]:
#precision and recall
level <- players_test_predictions |> 
pull(Class) |> 
levels()

level


## 6. Analysis Visualization

# <u>Discussion<u>

## Summary

## Expectations vs. Results

## Impact

## Future Questions

# <u>References<u>