In [None]:
#Load the data
library(tidyverse)
library(repr)
library(tidymodels)

players <- read_csv("data/players.csv")
summary(players)
sessions <- read_csv("data/sessions.csv")
summary(sessions)


**Data Description**


**Players:**
- There are a total of 196 observations which include 7 different variables
- Summary statistics:

|  Variable  |  Minimum   | 1st Quartile | 3rd Quartile |   Median   |    Mean    |  Maximum   |    NA's    |
| ---------- | ---------- |  ----------  |  ----------  | ---------- | ---------- | ---------- | ---------- |
| played_hours | 0.00 | 0.00 | 0.60 | 0.10 | 5.85 | 223.10 | 0 |
| Age | 9.00 | 17.00 | 22.75 | 19.00 | 21.14 | 58.00 | 2 |

|  Variable  |    Type    |  Meaning   |
| ---------- | ---------- | ---------- |
| experience | Character  | Gaming experience <br> (beginner, amateur, pro, veteran) |
|  subscribe |   Logical  | Tells us if player is subscribed to newsletter|
| hashedEmail | Character | Anonymous unique user id used to link players to their sessions |
| played_hours | Numeric | Total number of hours the player has spent on the server |
| name | Character | First name of player |
| gender | Character | Gender of player |
| Age | Numeric | Age of player |

Issues:
- Age variable has two missing values
- Most players recorded almost no playtime while a few accumulated very high totals
     - This may influence model performance
- Information is self reported is self-reported so tthere is a possibility that players maay provide inaccurate ages or experience levels
- There are more true than false for the response variable subscribe
- hashedEmail is anonymized which can make it harder to verfiy duplicate players
      

**Question**

Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific: Can players experience and total hours played predict whether a player subscribes to the newsletter using the players dataset?

- players dataset contains the response variable subscribe and the explanatory variables played_hours and experience
     - goal is to predict whether a player subscribes to the newsletter based on the number of hours played and their skill level
- Subscribe is categorical
    - k-nearest neighbors (k-NN) classification is appropriate

Plan:
- tidy the players dataset by making sure that experience is stored as a categorical variable and that played_hours is numeric
- check for missing values and remove them
- don't need both datasets since all the variables needed are in the players dataset
- apply the k-NN classification to answer our question after wrangling the data

**Exploratory Data Analysis and Visualization**

| Variable |   Mean  |
| -------- | --------|
|   Age    |   21.14 |
| played_hours | 5.85 |



In [None]:
# Load the data and tidy
players <- players |>
  mutate(
    subscribe  = as_factor(subscribe)
  )
players

In [None]:
# Mean value for Age and played_hours
mean_players <- players |>
                summarize(played_hours_mean = round(mean(played_hours, na.rm = TRUE), 2),
                          Age_mean = round(mean(Age, na.rm = TRUE), 2))
mean_players

In [None]:
# Distribution of Total Hours Played

total_hours_played_plot <- players |>
                        ggplot(aes(x = played_hours)) +
                        geom_histogram(bins = 30, fill = "blue") +
                        labs(x = "Total Hours Played", y = "Count") +
                        ggtitle("Distribution of Total Hours Played")
total_hours_played_plot

- Majority of players have very low playtimes
    - very few players recording extremely high values
        - can influence the behaviour of k-NN so need to take extra steps to ensure that all the weight is distributed evenly

In [None]:
# Distribution of Experience
library(tidyverse)
experience_plot <- players |>
                ggplot(aes(x = experience)) +
                geom_bar(fill = "blue") +
                labs(x = "Experience", y = "Number of Players") +
                ggtitle("Distribution of Experience")
experience_plot

- Experience levels are unevenly distributed
- Some categories have far fewer players such as pro
    - may affect how k-NN measures across the different experience groups

In [None]:
# Relationship between plated_hours and subscription status
library(tidyverse)
played_hours_vs_subscribe_plot <- players |>
                                ggplot(aes(x = subscribe, y = played_hours)) +
                                geom_boxplot(fill = "blue") +
                                labs(x = "Subscription Status", y = "Total Hours Played") +
                                ggtitle("Total Hours Played vs Subscription Status")
played_hours_vs_subscribe_plot

- Players who are subscribed overall show higher playtime values compared to those not subscribed
    - suggests that time spent playing may be related to subscription behaviour
        - supports our choice of predictors

**Methods and Plan**

To answer my question
- plan to use k-nearest neighbors classification
- need to assume that the main idea behind it is that players who are close to each other based on experience and total hours played are likely to behave in similar ways as in if they would be subscribed or not
- However there are a few downsides to k-NN.
    - method depends heavily on how variables are scaled
        - will need to standardize played_hours
    - imbalance in hours played and subscription levels can affect the results of our k-NN classification
    - need to prepare the experience variable for it to be included in the model
- then use cross validation to acquire the best k value
    - the data will first be split into a training set and a test set using prop = 0.75 which will give a 75/25 split
- on the training set,  run 5-fold cross-validation to compare different k values
    - then select the one with the lowest classification error which will be used for the testing data

Before doing classification:
- tidy the data
    - converting experience into a factor
    - standardizing the numeric variables
    - removing missing values such as in Age

In [None]:

players_split <- initial_split(players, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)


player_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)
player_recipe <- recipe(subscribe ~ played_hours + experience, data = players_train) |>
                    step_scale(all_numeric_predictors()) |>
                    step_center(all_numeric_predictors())



player_knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                set_engine("kknn") |>
                set_mode("classification")
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

players_results <- workflow() |>
                    add_recipe(player_recipe) |>
                    add_model(player_knn_tune) |>
                    tune_grid(resamples = player_vfold, grid = k_vals) |>
                    collect_metrics()

players_results

In [None]:
accuracies <- players_results |>
            filter(.metric %in% "accuracy")

accuracies_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) + 
                    geom_point() + 
                    geom_line() + 
                    labs(x = "Neighbors", y = "Accuracy Estimate")


accuracies_versus_k



In [None]:
best_k <- accuracies |>
            arrange(desc(mean)) |>
            head(1) |>
            pull(neighbors)
best_k

In [None]:
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
                set_engine("kknn") |>
                set_mode("classification")

players_workflow <- workflow() |>
                    add_recipe(player_recipe) |>
                    add_model(player_spec) |>
                    fit_resamples(resamples = player_vfold)

players_metrics <- collect_metrics(players_workflow)