
Index for Ordering Project
1. Load Dataset
2. Tuning to get optimal K (highest accuracy)
3. Present final statistics with K
4. Showing why 10-fold (higher accuracy) is better than 5-fold
5. Precision, Recall, etc.

In [8]:
# 1. Load Dataset
library(tidyverse)
library(repr)
library(tidymodels)

set.seed(1)

url_pl <- "https://raw.githubusercontent.com/takemil8088/ind-porject/refs/heads/main/players.csv"
players <- read_csv(url_pl) |>
    select(experience, subscribe, played_hours, gender, Age) |>
    filter(!is.na(experience), !is.na(subscribe), !is.na(played_hours), !is.na(gender), !is.na(Age)) |>
    mutate(subscribe = as_factor(subscribe)) |>
    rename(age = Age)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [23]:
# 2. Tuning to get optimal K (highest accuracy)
split <- initial_split(players, prop = 0.80, strata = subscribe)
train <- training(split)
test <- testing(split)

recipe <- recipe(subscribe ~ played_hours + age, data = train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spe <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

p_vfold <- vfold_cv(train, v = 10, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 1))

knn_results <- workflow() |>
  add_recipe(recipe) |>
  add_model(knn_spe) |>
  tune_grid(resamples = p_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)

best_k

In [21]:
# bro why does seq(from = 1, to = 50, by = 1) give K = 18 instead of K = 21