# Data Science project

This project explores the MineCraft server and player data collected from Frank Woods reseach group in Computer Science at UBC. 

https://plai.cs.ubc.ca/
https://www.cs.ubc.ca/~fwood/



# Predictive question: Is there a relationship between hours played and age that predicts wheather a player has a newsletter subscription?

In the `player.csv` data set, 

- experience: a rank ascending from begeinner, amature, regular, pro and veteran
- subscribe: whether the individual is subscribed or not to the newsletter
- hashedEmail: individuals encrypted email address
- played_hours: how many hours an individual player within XXXXX
- name: individuals first name
- gender: individuals gender
- Age: individuals gender



In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

# Data Exploration and Visualization

In [None]:
set.seed(1)
player_data <- read_csv("data/players.csv") |>
    mutate(subscribe = as_factor(subscribe))
player_data

In [None]:
select_player_data <- player_data |> 
    select(subscribe, Age, played_hours)
select_player_data

summary(select_player_data)

In [None]:
options(repr.plot.height = 6, repr.plot.width = 9)

player_data_plot <- select_player_data |>
    ggplot(aes(x = played_hours, y = Age, color = subscribe)) +
    geom_point(alpha = 0.4) +
    labs(x = "Number of hours played", 
         y = "Age", 
         title = "Figure 1 Exploring relationship b/w Age and Number of hours played on subscription",
        color = "subsciption") +  
    theme(text = element_text(size = 13))
player_data_plot 

From Figure 1. we can clearly see that all individuals with a high number of 25+ hours played have newsletter subscription. However, individuals with 0 hours vary in subscriptions 

# Data Analysis
To determine if there's a relationship between hours played and age that predicts whether a player has a subscription, I approached this question as a classification problem.

`subscribe` has 2 categories `TRUE`(subscribed) or `FALSE`(unsubscribed) and as the outcome variable we turned it into a factor in the previous lines of code.

Split data and evaluate proportions 

In [None]:
player_split <- initial_split(player_data, prop = 0.75, strata = subscribe)
player_train <- training(player_split)
player_test <- testing(player_split)

player_train <- player_train |>
    drop_na()
player_test <- player_test |>
    drop_na()

player_train_proportions <- player_train |>
                      group_by(subscribe) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(player_train))

player_train_proportions

Looking at the proportions in the training data, we need to keep in mind that there isnt an equal amount of people subscribed and unsubscribed so we want to make sure we calculate recall (to catch as many sctual subscibers) and precision (to be correct most of the time)

Preprocess data and train the classifier 

to standardize with the training data, make a model with k = 3 (just a guess by looking at Figure 1) and combine them with workflow then fit to build classifier

In [None]:
player_recipe <- recipe(subscribe ~ Age + played_hours, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
  set_engine("kknn") |>
  set_mode("classification")


knn_fit <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec) |>
  fit(data = player_train)

knn_fit


predicting labels in the test set and evaluating our classifier's performance

In [None]:
set.seed(1)
player_test_predictions <- predict(knn_fit, player_test) |>
  bind_cols(player_test)

player_test_predictions


player_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

player_test_predictions |> pull(subscribe) |> levels()

player_test_predictions |>
  precision(truth = subscribe, estimate = .pred_class, event_level = "first")

player_test_predictions |>
  recall(truth = subscribe, estimate = .pred_class, event_level = "first")

confusion <- player_test_predictions |>
             conf_mat(truth = subscribe, estimate = .pred_class)
confusion




Evaluating the performance with a random k values of 3, 51% accuracy, 21% precision and 30% recall is not very good.

We are going to try and improve these percentages by with tuning the classifier with cross validation (multiple splits)

Cross - validation

In [None]:
set.seed(1)

player_vfold <- vfold_cv(player_train, v = 10, strata = subscribe)

player_recipe <- recipe(subscribe ~ Age + played_hours,
                        data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

knn_results <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = player_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies


In [None]:
set.seed(1)

accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", 
       y = "Accuracy Estimate",
      title = "Figure 2 K neighbors VS accuracy estimate") +
  theme(text = element_text(size = 12))

accuracy_vs_k

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

in this case the best number of neighbors from a  was k = 7 with a round 67% accuracy.

Now we will evaluate on the test set

In [None]:
set.seed(1)

player_recipe <- recipe(subscribe ~ Age + played_hours, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec2 <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit2 <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec) |>
  fit(data = player_train)

knn_fit2

Now we will look at accuracy, recall and precision to see if there any improvement

In [None]:
set.seed(1)

player_test_predictions2 <- predict(knn_fit2, player_test) |>
  bind_cols(player_test)

player_test_predictions2 |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

table(player_train$subscribe)
table(player_test$subscribe)


player_test_predictions2 |>
    precision(truth = subscribe, estimate = .pred_class, event_level="first")

player_test_predictions2 |>
    recall(truth = subscribe, estimate = .pred_class, event_level="first")

confusion <- player_test_predictions2 |>
             conf_mat(truth = subscribe, estimate = .pred_class)
confusion

There was an improvement to accuracy, recall and precision of around 8% by tuning the k value from k=3 to k=7. 

plot acuracy vs k values with the tuning results