DSCI 100 Project

name: Ryan Cheng 

student ID: 53355756

Question to answer: 

    Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

In [None]:
# libraries / plot setting 
library(tidyverse)
library(tidymodels)
library(repr)
library(GGally)
options(repr.plot.width = 11, repr.plot.height = 8)
options(repr.matrix.max.rows = 6)

# Introduction

This is a 

# Methods

In [None]:
#Read the data from the data folder
players <- read_csv("data/players.csv")|>
    as.data.frame()|>
    mutate(subscribe = as.factor(subscribe))|> # make the subscribe column a factor instead of "lgl"
     mutate(experience = case_when(
    experience == "Beginner" ~ 1,
    experience == "Amateur"  ~ 2,
    experience == "Regular"  ~ 3,
    experience == "Veteran"  ~ 4,
    experience == "Pro"      ~ 5,))#quantify the experience value 
head(players) #used head as there are too many rows


In [None]:
# session <- read_csv("data/session.csv")|>
#     as.data.frame()
# head(session)#not useful for this question

In order to find out which "kinds" of players are most likely to contribute a large amount of data, we need to make a model that predicts the played hours of the players. This kind of problem is a Regression problem. 

Looking at the players data, there are four varibles that can be use to do the prediction : experience, subscribe, gender, and age.

First, we need to scale the data and make some scatter plots to show the relationship between the varibles. 

In [None]:
players_age_plot <- players|>
    ggplot(aes(x = played_hours, y = Age)) +
    geom_point(alpha = 0.4)+
    xlab("played hours (hours)")+
    ylab("age (Years)")+
    ggtitle("Age Versus Played Hours")+
theme(text = element_text(size = 20))
players_age_plot  

In [None]:
players_experience_plot <- players|>
    ggplot(aes(x = played_hours, y = experience)) +
    geom_point(alpha = 0.4)+
    xlab("played hours (hours)")+
    ylab("experience of the player")+
    ggtitle("Experience of the Player Versus Played Hours")+
theme(text = element_text(size = 20))
players_experience_plot  

Looking at the two plots, it doesn't seem like there is any strong linear relationship between the varibles. Thus, k-nearest neighbors regression will be used to solve this problem.  

# Model

First, we should spilt the data into a training part and a testing part. The training part will be 75% of the data and the testing part will be 25%. 

In [None]:
players_split <- initial_split(players, prop = 0.75, strata = played_hours) #spliting the players data
players_training <- training(players_split)
players_testing <- testing(players_split)

In [None]:
#create a recipe that predicts played hours with the varibles and scale the data
player_recipe <- recipe(played_hours ~ experience + Age + gender + subscribe, data = players_training)|>
    step_scale(all_of(c("experience", "Age")))|> #we cannot scale gender and subscribe
    step_center(all_of(c("experience", "Age")))
#create a model specification for k-nearest neighbors regression
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune())|>
    set_engine("kknn")|>
    set_mode("regression")
#use 5- fold cross validation to find the best k value
player_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

player_workflow <- workflow()|>
    add_recipe(player_recipe)|>
    add_model(player_spec)

We will run cross validation for grid numbers of 1 to 30

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 30, by = 1))

player_result <- player_workflow|>
    tune_grid(resamples = player_vfold, grid = k_vals)|>
    collect_metrics()|>
     filter(.metric =="rmse")
player_result

In [None]:
player_min <- player_result|>
    filter(mean == min(mean))
player_min

It seems like k = 25 will give us the lowest RMSE. 

# Result

In [None]:
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 25)|>
    set_engine("kknn")|>
    set_mode("regression")

player_fit <- workflow()|>
    add_recipe(player_recipe)|>
    add_model(players_spec)|>
    fit(data = players_training)



# player_summary <- player_fit|>
#     predict(players_testing)|>
#     bind_cols(players_testing)|>
#     metrics(truth = played_hours, estimate = .pred)|>
#     filter(.metric == "rmse")
# player_summary