## Title for now ##

### Introduction

A research group has set up a MineCraft server to collect data on how people play video games. Players' actions are recorded as they navigate through the world. The data includes players data as well as sessions data. In this report we aim to analyze the question: Can age and played hours predict whether a player is subscribed or not in players data?   

In the players data, there are 196 observations and 7 variables. The 7 variables are:

- experience (how experienced the player is, type chr)
- subscribe (if they are subscribed,type lgl)
- hashedEmail (player's emails, type chr)
- played_hours (number of hours the player played, type dbl)
- name (player's name, type chr)
- gender (gender of the player, type chr)
- age (age of the player, type dbl)

### Methods 

#### Preprocessing and exploratory data analysis
1. Imported relevant libraries
2. wrangled and cleaned `players.csv` data by changing `experience` and `gender` into factors
3. calculated the mean of `Age` and `played_hours`
4. created simple visualizations from the cleaned data using `Age` and `played_hour` variables in relation to `subscribe`

In [9]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39m 1.2.1
[32m✔[39m [34mdials       [39m 1.3.0     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34minfer       [39m 1.0.7     [32m✔[39m [34mworkflows   [39m 1.1.4
[32m✔[39m [34mmodeldata   [39m 1.4.0     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mparsnip     [39m 1.2.1     [32m✔[39m [34myardstick   [39m 1.3.1
[32m✔[39m [34mrecipes     [39m 1.1.0     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [27]:
# set the seed
set.seed(1)

players_url<-"https://raw.githubusercontent.com/wenqin07/toy_ds_project/7ab5fe995d0e438443ebe9e80bd91a2363680d8f/players.csv"
players_data<-read_csv(players_url)

players_data_tidy<-players_data|>
    mutate(experience = as.factor(experience))|>
    mutate(gender = as.factor(gender)) |>
    mutate(subscribe = as.factor(subscribe))


players_data_tidy

players_mean<-players_data_tidy|>
    select(played_hours,Age)|>
    summarize(
          mean_played_hours = mean(played_hours, na.rm = TRUE),
          mean_age = mean(Age, na.rm = TRUE))
players_mean

explore1<-players_data_tidy|>
    ggplot(aes(x=Age,y=played_hours,color=subscribe))+
    geom_point() +
    labs(x = "Age in years",
       y = "Time played, in hours",
       color = "Subcribed",
        title= "Age vs number of hours played vs subscribed") 
explore2<-players_data_tidy|>
    ggplot(aes(x=Age,fill=subscribe))+
    geom_histogram(position = "identity") +
  facet_grid(rows = vars(subscribe)) +
    labs(x = "Age in years",
        title= "Age vs subscribed") 
explore3<-players_data_tidy|>
    ggplot(aes(x=played_hours,fill=subscribe))+
    geom_histogram(position = "identity") +
    facet_grid(rows = vars(subscribe)) +
    labs(x = "Time played in hours",
        title= "Time played in hours vs subscribed") 

explore1
explore2
explore3

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: Error in map_df(select(mutate(mutate(mutate(players_data, experience = as.factor(experience)), : argument ".f" is missing, with no default


In [22]:
# Creating training and testing data 
players_split <- initial_split(players_data_tidy, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

In [28]:
players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_data_tidy) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) |>
  set_engine("kknn") |>
  set_mode("classification")

players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit_resamples(resamples = players_vfold)

knn_fit

→ [31m[1mA[22m[39m | [31merror[39m:   [1m[22mAssigned data `orig_rows` must be compatible with existing data.
               [31m✖[39m Existing data has 29 rows.
               [31m✖[39m Assigned data has 30 rows.
               [36mℹ[39m Only vectors of size 1 are recycled.
               [1mCaused by error in `vectbl_recycle_rhs_rows()`:[22m
               [33m![39m Can't recycle input of size 30 to size 29.

There were issues with some computations   [1m[31mA[39m[22m: x1

→ [31m[1mB[22m[39m | [31merror[39m:   [1m[22mAssigned data `orig_rows` must be compatible with existing data.
               [31m✖[39m Existing data has 28 rows.
               [31m✖[39m Assigned data has 29 rows.
               [36mℹ[39m Only vectors of size 1 are recycled.
               [1mCaused by error in `vectbl_recycle_rhs_rows()`:[22m
               [33m![39m Can't recycle input of size 29 to size 28.

There were issues with some computations   [1m[31mA[39m[22m: 