In [None]:
#loading necessary libraries
install.packages(c("tidyverse", "recipes"))
library(tidyverse)
library(recipes)
# GitHub link: https://github.com/tahsansamin/project_planning_stage_individual.git

# Dataset Descriptions

This dataset was collected by a computer science research team at the University of British Columbia and revolves around how people play video games which was recorded through a MineCraft server.  

#### players.csv
This original file contains 196 observations and 7 variables which are explained below  

1. **experience**: players experience in 5 categories: Pro, Veteran, Amateur, Regular and Beginner(character data type).
2. **susbcribe**: TRUE if player has subscribed to newsletter and FALSE otherwise(logical data type).
3. **hashedEmail**: encrypted email address of player(character data type). 
4. **played_hours**: number of hours spent on the game by the player(double data type).  
5. **name**: name of player(character data type).
6. **gender**: gender of player with unique values: male , female, non-binary, prefer not to say, agender, two-spirited or other(character data type).
7. **Age of player**: age of the player(double data type).



### Potential issues with players.csv 

1.The "experience" and "gender" variables should be changed from character data type to the factor data type. 

2.The "hashedEmail" and "name" variables should be deleted because they contain personal information.  

        

In [None]:
players <- read_csv("https://raw.githubusercontent.com/tahsansamin/project_planning_stage_individual/refs/heads/main/dataset/players.csv", show_col_types = FALSE)
head(players)

#### sessions.csv
This original file consists of 1535 observations and the following 5 variables:  

1. **hashedEmail** : hashed email address of players(character data type).
2. **start_time** : date and time of when player started gaming session(character data type).
3. **end_time** : date and time of when player stopped gaming session(character data type).
4.  **original_start_time** : unix timestamp of when player started session(double data type).
5. **original_end_time** : unix timestamp of when player ended gaming session(double data type).

### Potential issues with sessions.csv 

1. The hashedEmail variable should be removed to not personally identify players.
2. start_time and end_time should each be split into two columns because they contain both date and time. The newly split columns should each be changed to either date or time format.

In [None]:
sessions <- read_csv("https://raw.githubusercontent.com/tahsansamin/project_planning_stage_individual/refs/heads/main/dataset/sessions.csv", show_col_types = FALSE)
head(sessions)

### Summary Statistics

#### Players.csv

| Variable | Minimum | Maximum | Mean |
| -------- | ------- | -------- | -----|
| Age | 9.00 | 58.00 | 21.14 |
| played_hours | 0.00 | 223.10 | 5.85 |

#### sessions.csv

| Variable | Minimum | Maximum | Mean |
| -------- | ------- | -------- | -----|
| original_start_time | 1.71e+12 | 1.73e+12	 | 1.72e+12	 |
| original_end_time | 1.71e+12 | 1.73e+12 | 1.72e+12 |


In [None]:
summary_stats_players <- players |> 
                summarise(min_Age = min(Age, na.rm = TRUE),max_Age = max(Age, na.rm = TRUE),mean_age = mean(Age, na.rm = TRUE), min_played_hours = min(played_hours),max_played_hours = max(played_hours), 
                         , mean_played_hours = mean(played_hours, na.rm = TRUE))
summary_stats_players
summary_stats_sessions <- sessions |> 
                summarise(min_original_start_time = min(original_start_time, na.rm = TRUE), max_original_start_time = max(original_start_time, na.rm = TRUE), mean_original_start_time = mean(original_start_time, na.rm = TRUE),
                         min_original_end_time = min(original_end_time, na.rm = TRUE), max_original_end_time = max(original_end_time, na.rm = TRUE), mean_original_end_time = mean(original_end_time, na.rm = TRUE))

summary_stats_sessions

There are two NA values in players.csv and 4 NA values in sessions.csv.

In [None]:
na_players <- sum(is.na(players))
na_sessions <- sum(is.na(sessions))
na_players
na_sessions

# Questions
**Broad question:**
        *What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ            between various player types?* More **specifically** *can player age and number of hours played be used to predict their gaming newsletter subscription status?*

The players.csv file will be used because it contains more relevant variables than sessions.csv. The name and hashed email's of all the players will be removed for confidentiality. A subset of the dataset will be created with the age, number of hours and susbcription variables. The rows that have missing values will be removed because there are not many missing values so removing them is very unlikely to cause loss of valuable data. After, the hours played and age variables will be standardized so no one variable takes more weight in prediction than the other.

# Explaratory Data Analysis and Visualizations

In [None]:
tidy_players <- drop_na(players)
head(tidy_players)

#### Table Summarizing mean of quantitative variables
| Variable | Mean |
-----------|--------|
| played_hours | 5.85|
| Age | 21.14|  



### Visualization 1: Number of Hours played against player age

In [None]:
figure_1 <- tidy_players |>
            ggplot(aes(x = Age, y= log1p(played_hours), color = subscribe)) +
            geom_point() +
            labs(x = "Age of Player", y = "Number of Hours Played", title = "Number of Hours played against Age of Player") +
            theme(text = element_text(size = 14)) 

figure_1

There seems to be a data imbalance with fewer observations for higher ages and fewer observations where the susbcription status is false.

### Visualization 2: Mean number of hours played for each subscription group

In [None]:
means_for_subscription_status <- tidy_players |>
            group_by(subscribe) |>
            summarise(mean_hours  = mean(played_hours)) |> 
            ggplot(aes(x = subscribe, y = mean_hours)) +
            geom_bar(stat = "identity", fill = "#5B86AE") +
            geom_text(aes(label = format(round(mean_hours, 2)), figures = 2), vjust = 0, size = 6)+
            labs(x = "Subscription status", y = "Mean number of hours played", title = "Mean number of hours played for each subscription group")+
            theme(text = element_text(size = 20))

means_for_subscription_status


### Visualization 3: Mean age for each subscription status

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
means_for_subscription_status0 <- tidy_players |>
            group_by(subscribe) |>
            summarise(mean_age  = mean(Age)) |> 
            ggplot(aes(x = subscribe, y = mean_age)) +
            geom_bar(stat = "identity", fill = "#5B86AE") +
            geom_text(aes(label = format(round(mean_age, 2)), figures = 2), vjust = 0, size = 6)+
            labs(x = "Subscription status", y = "Mean age", title = "Mean age for each subscription group")+
            theme(text = element_text(size = 20))

means_for_subscription_status0

### Visualization 4: Mean number of hours played per age group

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
hoursperage <- tidy_players |>
                   mutate(age_grp = cut(Age, breaks = 8)) |>
                    group_by(age_grp) |>
                    summarise(mean_hours = mean(played_hours)) |>
                    ggplot(aes(x = age_grp, y = mean_hours)) +
                    geom_bar(stat = "identity", fill = "#5B86AE") +
                    labs(x = "Age group", y = "Average number of hours played", title = "Average number of hours per age group")+
                    theme(text = element_text(size = 15))
hoursperage

The number of hours played generally decreases with age with a few exceptions.

# Methods and Planning

KNN classification will be used because the two variables of interest(hours played and age of player) are numerical and predicting subscription status is a classification problem 

##### Assumptions of the model
KNN few assumptions. However, it assumes that the closer two given data points are the more related and similar they are to each other.

##### Limitations of KNN classification
KNN is computationally inefficient for large datasets. Furthermore, KNN classification may not perform well if the classes are imbalanced. The algorithm may be biased to the class with the higher number of occurences

##### How to compare and select the model?
5 fold cross validation with a range of values of K will be used to select the model based on the highest cross validation accuracy.

##### Splitting the data
The data will be split into 75% training data and 25% testing data after converting it to a tidy format. 5 fold cross validation will be applied to the training data