# Simon Littlewood DSCI 100 Project Planning Stage Individual

# Data Description

This is a dataset created by reasearchers at UBC containing information about the activity of players on a public MineCraft server. The dataset contains two files; "players.csv" and "sessions.csv"

### players.csv

Each observation/row in this dataframe corresponds to a unique player that has played on the server before.

There are 196 observations in this dataframe (197 rows - 1 header row)

This dataframe contains 7 variables: 

"experience" (chr/str); how experienced the player is at minecraft, ranging from Beginner to Veteran

"subscribe" (logical/bool); whether or not the player is subscribed to the gaming newsletter

"hashedEmail" (chr/str); a unique identifier for each player

"played_hours" (numeric); how many total hours the player played on the server

"name" (chr/str); the first name of the player

"gender" (chr/str); the gender of the player

"Age" (numeric); the age of the player

In [None]:
#Run this cell to load relevant libraries
library(tidyverse)
library(tidyr)
library(dplyr)
library(repr)
library(tidymodels)
library(cowplot)

In [None]:
#Loading and assigning the raw dataframes, demonstrating they can be loaded into R

players <- read_csv("players.csv")

In [None]:
# This cell will present summary statistics of the two dataframes. Summary statistics include mean for quantitative
# variables and the mode value for qualitative variables

#creating a function we can use multiple times to find the mode of a qualitative column rather than
#repeating the same code multiple times
find_mode <- function(dataframe, var){
    dataframe |>
        group_by({{var}}) |>
        summarise(count = n()) |>
        arrange(desc(count)) |>
        slice(1) |>
        pull({{var}})
    }

#creating a function we can use multiple times to find the mean of a quantitative column
find_mean <- function(dataframe, var){
    dataframe |>
        summarise(mean = mean({{var}}, na.rm = TRUE)) |>
        round(2) |>
        pull(mean)
    }

players_gender_mode <- find_mode(players, gender)
players_experience_mode <- find_mode(players, experience)
players_subscribed_mode <- find_mode(players, subscribe)
players_played_hours_mean <- find_mean(players, played_hours)
players_age_mean <- find_mean(players, Age)

print(paste("The most common gender of players is", players_gender_mode))
print(paste("The most common experience level of players is", players_experience_mode))
print(paste("The most common subscription status of players is", players_subscribed_mode))
print(paste("The mean value for total hours played among all players is", players_played_hours_mean))
print(paste("The mean value of age among all players is", players_age_mean))

# Questions

### Broad Question

Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Question

Can experience level, age, gender, and total hours played predict whether or not someone is subscribed to a game-related newsletter?

### How will the data help to answer this question?

By using the variables "experience", "Age", "gender" and "played_hours" from the players.csv dataframe as predictors, we can use exploratory visualizations and a knn classification engine to see if there is a correlation between the predictor variables and the response variable, "subscribe". 

# Exploratory Data Analysis and Visualization

In [None]:
### the players.csv dataset is already in tidy format, which is the only dataset I need to use for this analysis.

players_tidy <- players

In [None]:
### this cell will be used to present the mean value of the quantitative variables in players.csv in a table

players_mean_value_table <- tibble(
    Variable = c("Total Hours Played", "Age"), 
    Mean = c(players_played_hours_mean, players_age_mean)
)

players_mean_value_table

In [None]:
### this cell will be used to create some exploratory visualizations of the data
options(repr.plot.width = 5, repr.plot.height = 5)


experience_vs_hours_played <- players_tidy |>
    group_by(experience) |>
    summarise(total_hours = sum(played_hours, na.rm = TRUE))

experience_vs_hours_played_plot <- ggplot(players_tidy, aes(x = reorder(experience, played_hours), y = played_hours))+
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(x = "Experience Level", y = "Total Hours Played", title = "Experience Level vs. Total Hours Played")


gender_vs_hours_played <- players_tidy |>
    group_by(gender) |>
    summarise(total_hours = sum(played_hours, na.rm = TRUE))

gender_vs_hours_played_plot <- ggplot(gender_vs_hours_played, aes(x = reorder(gender, total_hours), y = total_hours))+
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(x = "Gender", y = "Total Hours Played", title = "Gender vs. Total Hours Played")


age_vs_hours_played <- players_tidy |>
    group_by(Age) |>
    summarise(total_hours = sum(played_hours, na.rm = TRUE))

age_vs_hours_played_plot <- ggplot(age_vs_hours_played, aes(x = Age, y = total_hours)) +
    geom_line() +
    labs(x = "Age", y = "Total Hours Played", title = "Age vs. Total Hours Played")


subscribe_vs_played_hours_plot <- ggplot(players_tidy, aes(x = subscribe, y = played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Subscribed or Not", y = "Total Hours Played", 
         title = "Total Hours Played for Subscribers and Non-subscribers")

experience_vs_hours_played_plot
gender_vs_hours_played_plot
age_vs_hours_played_plot
subscribe_vs_played_hours_plot

# Methods and Plan

The method I will use to address my question of interest is knn classification, and tuning the classifier using cross-validation to find the most accurate value of k. 

This method is appropriate since I'm attempting to predict the qualitative variable, subscribe, which knn classification is used for. 

To effectively do this method, I'll have to scale and center the predictors and response variables.

With only 196 samples, certain outliers may skew the overall model. For example, if someone had a large playing time but didn't subscribe, this may skew the model's predictions a lot due to the overall sample size not being super large. 

To compare and select the model, I'm going to use cross-validation to select the number of neighbors that lead to the highest accuracy for the model. 

The data will be split 70/30 training/testing initially, and then the training data will be split once more into 5 to perform cross-validation and parameter selection, which will occur before the model is tested on the testing data.