In [None]:
library(tidyverse)

In [None]:
download.file(
  url = 'https://raw.githubusercontent.com/sunnyshang12/dsci100-individual-planning-stage/main/players.csv',
  destfile = 'players.csv'
)

players <- read_csv('players.csv')

In [None]:
glimpse(players)
nrow(players)
ncol(players)
colSums(is.na(players))
summary(players)

**<h3>(1) Data Description<h3>**





**<h5>Data Overview<h5>**

The dataset ```players.csv``` will be used in this project, containing demographical and behavioral information on individual players who participated in a Minecraft research server. 

**<h5>Summary of Observations<h5>**
- Number of observations (players): 196
- Number of variables: 7

#### Variable Summary

| Variable | Type | Description | Notes / Potential Issues |
|-----------|------|----------------|--------------------------|
| `experience` | Categorical | Self-reported Minecraft skill level | Uneven class sizes |
| `subscribe` | Logical (TRUE/FALSE) | Whether player subscribed to the newsletter |  |
| `hashedEmail` | Identifier | Player's unique identification |  |
| `played_hours` | Numeric | Total hours played on the research server | Highly right-skewed, many near-zero values |
| `name` | Categorical | Playerâ€™s in-game name |  |
| `gender` | Categorical | Self-identified gender | Category imbalance |
| `Age` | Numeric | Age in years | 2 missing values |


#### Summary Statistics
Quantitative Variables
| Variable     | Mean  | SD    | Min  | Max    | Median |
| ------------ | ----- | ----- | ---- | ------ | ------- |
| played_hours | 5.85  | 28.36 | 0.00 | 300.00 | 0.10       |
| Age          | 21.14 | 7.39  | 9.00 | 58.00  | 19.00       |


Non-quantitative Variables
| Variable    | Type      | Notes                                                         |
| ----------- | --------- | ------------------------------------------------------------- |
| experience  | character | majority = Amateur |
| subscribe   | logical   | majority = TRUE (FALSE 52; TRUE 144)                       |
| hashedEmail | character | all unique                |
| name        | character | all unique        |
| gender      | character | majority = male (124/196)                         |


#### Data Collection
- Automated logging with tracking software during gameplay (```playtime```)
- Self-reported survey data, user profiles, or registration forms (```Age```, ```gender```, etc)
- Emails anonymized via hasing


**<h3>(2) Question<h3>**

I will be addressing the question: **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

More specifically, **can demographic and gameplay-related variables such as ```Age```, ```gender```, ```experience```and ```played_hours```  be used to predict whether a player subscribes to the UBC Minecraft research newsletter in the ```players.csv``` dataset?**

#### Variables
- Response variable: ```subscribe```, indicating whether player is subscribed to game-related newsletter
- Explanatory variables: ```Age```, ```gender```, ```experience```, and ```played_hours```.

Dataset provides both player-level characterics and behavioral data that allows comparison between subscribed and non-subscribed players. Upon modelling, variables such as ```hashedEmail``` and ```name``` will be excluded as they do not contribute to this analysis. The missing values in ```age``` will be handled as well as skewed data in ```played_hours``` through logorithmic transformation. The dataset is already tidy, which will then be split into training and testing sets and apply the K-NN classification model to evaluate predictive performance.

**<h3>(3) Data Analysis and Visualization <h3>**

In [None]:
#to find the mean of the quantifiable explanatory variables 
players_means <- players |> 
    summarize(mean_age = round(mean(Age, na.rm = TRUE), 2), mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2))

players_means 

| Quantitative Variable     | Mean  | 
| ------------ | ----- | 
| played_hours | 5.85  | 
| Age          | 21.14 |

In [None]:
#converting relevant variables to factors 
players <- players |>
    mutate(
        experience = as.factor(experience),
        gender     = as.factor(gender),
        subscribe  = as.factor(subscribe))

In [None]:
#box plot of played hours against subscribed
ggplot(players, aes(x = subscribe, y = log10(played_hours + 1), fill = subscribe)) + # we scaled played_hours since it is highly skewed, scaling will reduce impact of large outliers
  geom_boxplot() +
  labs(
    title = 'Played Hours by Subscription Status',
    x = 'Subscribed to Newsletter',
    y = 'Total Hours Played (Log10)'
  ) + theme(legend.position = 'right')


Players who subscribed to the newsletter generally have a higher total number of hours played even after log transformation, indicating that ```played_hours``` could predict ```subscribe```.

In [None]:
#boxplot of age against subscribed 
ggplot(players, aes(x = subscribe, y = Age, fill = subscribe)) +
  geom_boxplot(alpha = 0.8) +
  labs(
    title = "Age Distribution by Newsletter Subscription",
    x = "Subscribed",
    y = "Age (years)") +
    theme(legend.position = 'right')


The slightly older median among non-subscribers could indicate that players who subscribed to the newsletter are marginally younger compared to those who didn't. However the similar age distributions for subscribed and non-subscribed players indicates that ```Age``` does not have a major influence on ```subscribed```. 

In [None]:
ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Subscribed Players by Experience Level",
    x = "Experience Level",
    y = "Proportion of Players"
  ) +
  theme_minimal()

Players with beginner and regular experience have relatively higher proportions of subscribers, indicating that ```experience``` could predict ```subscribe```.

In [None]:
ggplot(players, aes(x = gender, fill = subscribe)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Subscribed Players by Gender",
    x = "Gender",
    y = "Proportion of Players"
  ) +
  theme_minimal()

Although agender and other players have nearly 100% subscription, there are significantly less data on those players compared to male and female. Females do seem to have a slightly greater proportion of subcribers compared to males, but overall ```gender``` does not seem to strongly predict ```subcribe```.

**<h3>(4) Methods and Plan <h3>**

- **Method**: K-Nearest Neighbors (KNN) Classification will be appropriate, as the response variable ```subscribe``` is binary, making KNN suitable, and it makes no assumptions on data distribution or linearity between variables.
- **Assumptions/Requirements**: Quantitative variables like ```Age``` and ```played_hours``` would need to be standardized so no variable dominates another in terms of distance. The choice of k (number of neighbors) should be reasonable and should account for over/under fitting.
- **Limitations**: KNN provides predictions, not direct insight into the correlation of each variable on subscription. KNN is also sensitive to outliers and may favour the majority class of TRUE for ```subscribed```.
- **Model comparison & selection**: Different ```k``` values will be tested, 5-fold cross-validation will be used to find the value with the highest prediction accuracy. Performance of model will be evaluated using accuracy, precision, and recall and compared with simple linear regression.
- **Data processing**: Numeric variables will be standardized, data split into training (70%) and testing (30%) using fixed seed then perform 5-fold cross-validation on training set to tune ```k```. ```k``` with highest accuracy will be used to predict ```subscribe``` on test set.