In [None]:
library(tidyverse)

In [None]:
download.file(
  url = 'https://raw.githubusercontent.com/sunnyshang12/dsci100-individual-planning-stage/main/players.csv',
  destfile = 'players.csv'
)

players <- read_csv('players.csv')
players

In [None]:
glimpse(players)
nrow(players)
ncol(players)
colSums(is.na(players))
summary(players)

**<h3>(1) Data Description<h3>**





**<h5>Data Overview<h5>**

The dataset ```players.csv``` will be used in this project, containing demongraphical and behavioral information on individual players who participated in a Minecraft research server. 

**<h5>Summary of Observations<h5>**
- Number of observations (players): 196
- Number of variables: 7

#### Variable Summary

| Variable | Type | Description | Notes / Potential Issues |
|-----------|------|----------------|--------------------------|
| `experience` | Categorical | Self-reported Minecraft skill level | Uneven class sizes |
| `subscribe` | Logical (TRUE/FALSE) | Whether player subscribed to the newsletter |  |
| `hashedEmail` | Identifier | Player's unique identification |  |
| `played_hours` | Numeric | Total hours played on the research server | Highly right-skewed, many near-zero values |
| `name` | Categorical | Playerâ€™s in-game name |  |
| `gender` | Categorical | Self-identified gender | Category imbalance |
| `Age` | Numeric | Age in years | 2 missing values |


#### Summary Statistics
Quantitative Variables
| Variable     | Mean  | SD    | Min  | Max    | Median |
| ------------ | ----- | ----- | ---- | ------ | ------- |
| played_hours | 5.85  | 28.36 | 0.00 | 300.00 | 0.10       |
| Age          | 21.14 | 7.39  | 9.00 | 58.00  | 19.00       |


Non-quantitative Variables
| Variable    | Type      | Notes                                                         |
| ----------- | --------- | ------------------------------------------------------------- |
| experience  | character | majority = Amateur |
| subscribe   | logical   | majority = TRUE (FALSE 52; TRUE 144)                       |
| hashedEmail | character | all unique                |
| name        | character | all unique        |
| gender      | character | majority = male (124/196)                         |


#### Data Collection
- Automated logging with tracking software during gameplay (```playtime```)
- Self-reported survey data, user profiles, or registration forms (```Age```, ```gender```, etc)
- Emails anonymized via hasing


**<h3>(2) Question<h3>**

I will be addressing the question: **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

More specifically, **can demographic and gameplay-related variables such as ```Age```, ```gender```, ```experience```and ```played_hours```  be used to predict whether a player subscribes to the UBC Minecraft research newsletter in the ```players.csv``` dataset?**

#### Variables
- Response variable: ```subscribe```, indicating whether player is subscribed to game-related newsletter
- Explanatory variables: ```Age```, ```gender```, ```experience```, and ```played_hours```.

Dataset provides both player-level characterics and behavioral data that allows comparison between subscribed and non-subscribed players. Upon modelling, variables such as ```hashedEmail``` and ```name``` will be removed as they do not contribute to this analysis. The missing values in ```age``` will be removed and skewed data in ```played_hours``` will be handled through logorithmic transformation. The dataset is already tidy, which will then be split into training and testing sets and apply the K-NN classification model to evaluate predictive performance.

**<h3>(3) Data Analysis and Visualization <h3>**

In [None]:
#to find the mean of the quantifiable explanatory variables 
players_means <- players |> 
    summarize(mean_age = round(mean(Age, na.rm = TRUE), 2), mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2))

players_means 