In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [3]:
set.seed(1)

url_pl <- "https://raw.githubusercontent.com/takemil8088/ind-porject/refs/heads/main/players.csv"
players <- read_csv(url_pl) |>
select(experience,subscribe,played_hours,gender,Age) |>
filter(!is.na(experience),!is.na(subscribe),!is.na(played_hours),!is.na(gender),!is.na(Age)) |>
mutate(subscribe = as_factor(subscribe))
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,played_hours,gender,Age
<chr>,<fct>,<dbl>,<chr>,<dbl>
Pro,TRUE,30.3,Male,9
Veteran,TRUE,3.8,Male,17
Veteran,FALSE,0.0,Male,17
Amateur,TRUE,0.7,Female,21
Regular,TRUE,0.1,Male,21
Amateur,TRUE,0.0,Female,17
Regular,TRUE,0.0,Female,19
Amateur,FALSE,0.0,Male,21
Amateur,TRUE,0.1,Male,17
Veteran,TRUE,0.0,Female,22


In [13]:
split <- initial_split(players, prop = 0.75, strata = subscribe)
train <- training(split)
test <- testing(split)

In [14]:
vfold <- vfold_cv(train, v = 5, strata = subscribe)

recipe <- recipe(subscribe ~ played_hours + Age, data = train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(recipe) |>
  add_model(knn_spec) |>
  fit_resamples(resamples = vfold)|>
                  collect_metrics()

knn_fit

p_vfold <- vfold_cv(train, v = 10, strata = subscribe)

vfold_metrics <- workflow() |>
                  add_recipe(recipe) |>
                  add_model(knn_spec) |>
                  fit_resamples(resamples = p_vfold) |>
                  collect_metrics()

vfold_metrics

.metric,.estimator,mean,n,std_err,.config
<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
accuracy,binary,0.5586371,5,0.04645523,Preprocessor1_Model1
roc_auc,binary,0.4940476,5,0.04679344,Preprocessor1_Model1


.metric,.estimator,mean,n,std_err,.config
<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
accuracy,binary,0.5862271,10,0.02813311,Preprocessor1_Model1
roc_auc,binary,0.5298864,10,0.04608404,Preprocessor1_Model1


In [17]:
knn_spe <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

knn_results <- workflow() |>
  add_recipe(recipe) |>
  add_model(knn_spe) |>
  tune_grid(resamples = p_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies

neighbors,.metric,.estimator,mean,n,std_err,.config
<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
1,accuracy,binary,0.4551282,10,0.041352745,Preprocessor1_Model01
6,accuracy,binary,0.6338462,10,0.021128487,Preprocessor1_Model02
11,accuracy,binary,0.580989,10,0.041527343,Preprocessor1_Model03
16,accuracy,binary,0.669707,10,0.020215006,Preprocessor1_Model04
21,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model05
26,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model06
31,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model07
36,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model08
41,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model09
46,accuracy,binary,0.7312088,10,0.005090213,Preprocessor1_Model10


In [18]:
best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

In [21]:
precipe <- recipe(subscribe ~ played_hours + Age, data = train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_sp <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fitp <- workflow() |>
  add_recipe(precipe) |>
  add_model(knn_sp) |>
  fit(data = train)

knn_fitp

══ Workflow [trained] ══════════════════════════════════════════════════════════
[3mPreprocessor:[23m Recipe
[3mModel:[23m nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_scale()
• step_center()

── Model ───────────────────────────────────────────────────────────────────────

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(21,     data, 5), kernel = ~"rectangular")

Type of response variable: nominal
Minimal misclassification: 0.2689655
Best kernel: rectangular
Best k: 21

In [25]:
p_predictions <- predict(knn_fitp, test) |>
  bind_cols(test)

p_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

p_predictions |> pull(subscribe) |> levels()

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.7346939


In [26]:
p_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level = "second")

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
precision,binary,0.7346939


In [27]:
p_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level="second")

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
recall,binary,1


Introduction:

provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question

The gaming industry is expanding by the year, and efforts are made to study player's actions for a more targetted recruitment. To achieve this, researchers must predict what types of players are more likely to contribute to the games and determine which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how these features differ between various player types. This science project aims to specifically answer whether play hours and the player's age predict whether a player will subscribe to the Minecraft newsletter or not. In order to answer this, we used data obtained from a Minecraft server which contain specific player information and whether they subscribe to the newsletter or not. The dataset, named "players", portrays a list of players and their data (7 variables, 196 observations): player's experience (amateur, beginner, pro, regular, veteran), subscription status (TRUE: subscribed to newsletter, FALSE: not subscribed), hashed Email that uniquely identifies each player, game play hours, player name, player gender, and player age. 
The dataset reveals that 144 players are subscribed, and 52 are not. Play hours range from 0 to 223.1 hours, and age range from 8 to 50 years old.
