In [None]:
#loading necessary libraries
library(tidyverse)
library(recipes)


# Dataset Descriptions
## What is the dataset about?
This dataset was collected by the Pacific Laboratory for Artificial Intelligence (PLAI) led by Frank Wood at the University of British Columbia. The data collected revolves around how people play video games and was recorded through a MineCraft server ,set up by the research team, where players' actions are recorded as they navigate through a MineCraft world.<br></br>
## Dataset composition
The dataset consists of two comma seperated values files : players.csv and sessions.csv .
#### players.csv
This original file contains 196 observations and consists of 7 variables which are explained below: <br></br>
        1. experience: variable with character data type that classifies a players experience into 5 distinct categories: Pro, Veteran, Amateur, Regular and Beginner. The datatype is in character form. <br></br>
        2. susbcribe: TRUE if player has subscribed to newsletter and FALSE otherwise. This column is of a logical type. <br></br>
        3. hashedEmail: hashed(encrypted) email address of player. This column is of the character type <br></br>
        4. played_hours: number of hours spent on the game by the player. This values in this column are a double type. <br></br>
        5. name: name of player. Values in this column are character type. <br></br>
        6. gender: gender of player with unique values: Male , Female, Non-binary, Prefer not to say, Agender, Two-Spirited, Other.<br>  
        7. Age of player: this is the age of the player and values in this column are of double data type.



### Potential issues with players.csv 

1.The "experience" variable is in the character data type format; it would make more sense to change it to the factor data format. <br></br>
2.The "hashedEmail" variable should be deleted because it does not provide any useful information for this project. <br></br>
3.The "gender" variable is a character type data; it should be changed to the factor data type. <br></br>
4.The "name" variable should be deleted because it personally identifies players.
        

In [None]:
download.file("https://raw.githubusercontent.com/tahsansamin/project_planning_stage_individual/refs/heads/main/dataset/sessions.csv", "sessions.csv")
download.file("https://raw.githubusercontent.com/tahsansamin/project_planning_stage_individual/refs/heads/main/dataset/players.csv", "players.csv")

In [None]:
players <- read_csv("players.csv", show_col_types = FALSE)
players

#### sessions.csv
This original file consists of 1535 observations and contains the following variables: <br></br>
            1. hashedEmail : this column contains the hashed email address of players in character data type format <br></br>
            2. start_time : records date and time of when player starteds gaming session. Current data type in character format and it may need to be reformatted to date time format to make sense of the data. <br></br>
            3. end_time : records date and time of when player stopped gaming session. Current data type in character format and it may need to be reformatted to date time format to make sense of the data. <br></br>
            4.  original_start_time: unix timestamp (in seconds) of when player started session. The datatype in this column is of the double data type. <br></br>
            5. original_end_time: unix timestampe (in seconds) of when player ended gaming session. The datatype in this column is of the double data type. <br></br>
            The difference between the original_end_time and the original_start_time can be used to easily calculate the duration of the playing session accurately than the start_time and end_time variables since the latter two could be in different timezones.

<b>Note: </b>The hashedEmail variable should be removed because it contains sensitive personal information (players' emails).

In [None]:
sessions <- read_csv("sessions.csv", show_col_types = FALSE)
sessions

# Questions
The <b>broad question</b> that will be explored is:
        <i>What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ            between various player types?</i> More <b>specifically</b> <i>can player age and number of hours played be used to predict their newsletter subscription status? </i>

In order to answer this question the players.csv file will be used. First, the name and hashed email's of all the players will be removed to keep the data confidential. Also , the player experience variable will be changed to be a factor data format. Next, the rows that have NA or missing values for player age , number of hours played and newsletter subscription status will be removed because there are not many NA or missing values so removing them is very unlikely to cause loss of valuable data ( btw check this Samin!). After that, the hours played and age variables will be standardized so no one variable takes more weight in prediction that the other. Finally, the K nearest-neighbors algorithm will be applied to predict a player's newsletter susbcription status based on their age and the number of hours they have played.

# Explaratory Data Analysis and Visualizations

##### Loading data

In [None]:
players <- read_csv("players.csv", show_col_types = FALSE)
players

In [None]:
na_vals <- colSums(is.na(players))
na_vals

<p> Based on the above results, only two observations have NA values for Age. Because of such a small number of observations that have NA, these observations can be removed.</p>

In [None]:
tidy_players <- players |>
            
            drop_na()
tidy_players

In [None]:
mean_vals <- tidy_players |>
            select(played_hours, Age) |>
            summarize(across(played_hours:Age, mean))
mean_vals

#### Table Summarizing mean of quantitative variables
| Variable | Mean |
-----------|--------|
| played_hours | 5.90|
| Age | 21.14|

In [None]:
tidy_players2 <- tidy_players |> mutate(Age_scaled = scale(tidy_players$Age)) |> mutate(Hours_scaled = scale(tidy_players$played_hours))

summary(tidy_players2)

In [None]:
tidy_players2

### Visualization 1: Number of Hours played plotted against age

In [None]:
figure_1 <- tidy_players2 |>
            ggplot(aes(x = Age, y= log1p(played_hours), color = subscribe)) +
            geom_point() +
            labs(x = "Age of Player", y = "Number of Hours Played", title = "Number of Hours played against Age of Player") +
            theme(text = element_text(size = 14)) 

figure_1

The visualization above shows a general trend that younger players tend to play more hours of the game and that the more number of hours a player plays the more likely they are to be subscribed.

In [None]:
figure_2 <- tidy_players2 |>
            ggplot(aes(x = Age_scaled, y= Hours_scaled, color = subscribe)) +
            geom_point() +
            labs(x = "Age of Player", y = "Number of Hours Played", title = "Number of Hours played against Age of Player") +
            theme(text = element_text(size = 14)) 

figure_2

In [None]:



hist_3 <- filter(tidy_players, subscribe == TRUE) |>
            ggplot(aes(x = Age)) +
            geom_histogram() +
            
            labs(x = "Age of players", title = "Histogram for Age of players for subscribers")

hist_4 <- filter(tidy_players, subscribe == FALSE) |>
            ggplot(aes(x = Age)) +
            geom_histogram() +
            labs(x = "Age of players", title = "Histogram for Age of players for subscribers")

hist_3
hist_4

According to the histograms above, players who are subscribers typically have a higher number of hours played as is evident in the scales of the histogram. For players who are susbcribers there are individuals who played over 200 hours of the game whereas the maximum number of hours played by someone who is not subscribed is around 7 hours.

In [None]:
hist_3 <- filter(tidy_players, subscribe == TRUE) |>
            ggplot(aes(x = Age)) +
            geom_histogram() +
            
            labs(x = "Age of players", title = "Histogram for Age of players for subscribers")

hist_4 <- filter(tidy_players, subscribe == FALSE) |>
            ggplot(aes(x = Age)) +
            geom_histogram() +
            labs(x = "Age of players", title = "Histogram for Age of players for subscribers")

hist_3
hist_4

The histograms above show a general trend that players who are susbcribers tend to be younger than players who are not susbscribers

In [None]:
##do we need other visualizations?

# Methods and Planning

For this project a the K nearest neighbor algorithm will be used for classification. This is appropriate because the two variables of interest(hours played and age of player) are numerical. Also predicting the subscription status of a player is a classification problem so KNN classification would be appropriate. Furthermore, KNN is baesd on similarity of nearby points, so general trends in player behaviour patterns can be better understood. Another crucial aspect is that the KNN algorithm requires few assumptions of what data must look like.

##### Assumptions of the model
The KNN algorithm has few assumptions. Noteably, it assumes that the closer two given data points are the more related and similar they are to each other

##### Limitations of KNN classification
First, KNN is computationally efficient for large datasets which means that it would take more time to train. Furthermore, KNN classification may not perform well if the classes are imbalanced such as if the dataset constitutes more subscribers than subscribers. In this case the KNN algorithm would in many cases predict the class of the data point as the class with the highest number of occurences which may not always be right. 

##### How to compare and select the model?
5 fold cross validation with a range of values of K will be used to select the model based on the model with the highest cross validation accuracy.

##### Data preprocessing
The dataset players.csv will be split into a training and testing set with training data taking 75% of the original data and testing taking the remaining 25%. Splitting will be performed after removing confidential information from data, removing NA values and changing data type of some columns to another more appropriate data type.  Cross validation will be performed with 5 folds and a range of K values to determine which K is best.