In [None]:
library(tidyverse)
set.seed(123)

players_url <- "https://raw.githubusercontent.com/jw0220/individual_project/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/jw0220/individual_project/refs/heads/main/sessions.csv"

In [None]:
players <- read_csv(players_url)
head(players)

In [None]:
sessions <- read_csv(sessions_url)
head(sessions)

**1) Sessions Dataset**
- 5 variables, 1535 observations

Variables:
- hashedEmail: Character, anonymized email address.
- start_time: Character, event start time in DD/MM/YYYY HH:MM format.
- end_time: Character, event end time in DD/MM/YYYY HH:MM format.
- original_start_time: Double, start time as Unix timestamp (seconds since 01/01/1970).
- original_end_time: Double, end time as Unix timestamp.

Data Issues:
- start_time and end_time are character strings, inefficient for time calculations.




**2) Players Dataset**
- 7 variables, 196 observations

Variables:
- experience: Character, user experience level.
- subscribe: Logical, indicates if the user has an active subscription (TRUE/FALSE).
- hashedEmail: Character, anonymized user email.
- played_hours: Double, total hours played by the user.
- name: Character, user's first name.
- gender: Character, user's gender.
- Age: Double, user's age in years.

Data Issue:
Age should be stored as an integer, not a double, as it is always a whole number.


**Questions**

Broad Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question: Can the user's amount of playtime and age predict whether they are subscribed to the game-related newsletter?

**Exploratory Data Analysis and Visualization**

In [None]:
# tidy format
players_tidy <- players |>
  mutate(
    Age = as.integer(Age),  
    played_hours = as.numeric(played_hours) 
  )

# Compute mean values for quantitative variables
mean_values <- players_tidy |>
  summarise(
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    mean_Age = mean(Age, na.rm = TRUE)
  )

mean_values

In [None]:
library(ggplot2)
library(dplyr)

# 1. Histogram of Played Hours  
played_hours_distribution <- ggplot(players, aes(x = played_hours)) +
                                  geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
                                  labs(title = "Distribution of Played Hours", x = "Played Hours", y = "Number of Users") 
played_hours_distribution

# 2. Boxplot of Played Hours by Subscription Status  
played_hours_subscribe <- ggplot(players, aes(x = as.factor(subscribe), y = played_hours, fill = as.factor(subscribe))) +
                          geom_boxplot() +
                          labs(title = "Played Hours by Subscription Status", x = "Subscribed", y = "Played Hours") +
                          scale_fill_manual(values = c("FALSE" = "red", "TRUE" = "green"), name = "Subscribed") 
played_hours_subscribe

# 3. Scatter Plot with Subscription  
subscription_scattered <- ggplot(players, aes(x = played_hours, y = subscribe)) +
                          geom_jitter(aes(color = as.factor(subscribe)), height = 0.2, alpha = 0.7) +
                          labs(title = "Scatter Plot of Played Hours vs Subscription", x = "Played Hours", y = "Subscription Status") +
                          scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green"), name = "Subscribed") 
subscription_scattered


- Histogram of Played Hours
    - The histogram shows a highly right-skewed distribution, meaning most users have very low played hours, while a few have significantly high values.

- Boxplot of Played Hours by Subscription Status
    - Users who are subscribed have more hours played.
    - There is more of a spread of hours played for users who are subscribed to the newsletter.
    - Outliers in played_hours could indicate highly engaged players who might bias the model.
      
- Scatter Plot (Jittered) with Subscription
    - For both subscribed and unsubscribed users, they both mostly have the same amount of hours played as there is a large concentration of points around the 0-10 hours range. However, for users who are subscribed, there are some big outliers with hours played past 150 hours. 


**Methods and Plan**

predition model: K-nearest neighbours 

Why is this method appropriate?
- KNN does not assume a specific distribution for the data, making it flexible for various relationships.
- KNN can capture non-linear relationships between played_hours and subscribe.

Assumptions Required for KNN
- Choosing the right k: If k is too small, the model can be overly sensitive to noise. If k is too large, it may over-smooth the decision boundary. We can tune k using cross-validation.
- Balanced classes: If the dataset has more non-subscribed users than subscribed ones, class imbalance could affect accuracy.

Limitations of KNN
- Computationally expensive: KNN requires storing the entire dataset and computing distances for every prediction. If the dataset is large, this can be slow.
- If too many features are used, distance calculations become less effective. However, since we are using only played_hours, this is less of a concern.

Model Selection and Evaluation Strategy
- Normalize played_hours to mean = 0 and standard deviation = 1
- Split the data into training and testing sets.
- Choose the best k using cross-validation.
- Train the KNN classifier and evaluate its performance.
