## Basis for project
## (1) Data Description

The data comes from a Minecraft research server that was operated by a reseach group at UBC. They evaluated how many people play video games. Two datasets were formed:

- **players.csv** ‚Äî 196 observations, and 7 variables  
  - **experience** (character): Player experience (e.g., beginner, intermediate, expert)  
  - **subscribe** (logical): Whether the player is subscribed to the project‚Äôs newsletter  
  - **hashedEmail** (character): Unique anonymous ID for each player  
  - **played_hours** (numeric): Total hours played on the server  
  - **name** (character): Player name  
  - **gender** (character): Player‚Äôs reported gender  
  - **Age** (numeric): Player‚Äôs age (2 missing values)

  Mean values of numeric variables:  
  - *played_hours*: **5.85 hours**  
  - *Age*: **21.14 years**

- **sessions.csv** ‚Äî 1535 observations, and 5 variables  
  - **hashedEmail** (character): Used to match players  
  - **start_time**, **end_time** (character): Start and end of a play session  
  - **original_start_time**, **original_end_time** (numeric): Alternative time encoding

  Mean values of numeric variables:

  -*original_start_time*: üï∞Ô∏è**1.719201e+12**

  -*original_end_time*: **1.719196e+12**

Issues: Only 2 quantitative variables, limiting predictors. May also be biases and inconsistitencies in how the data was collected.



### (2) Question

**Broad question:** What player characteristics and behaviours are most predictive of newsletter subscription?

**Specific question:** Can things such as player experience level, age, gender, and total hours played predict whether a player subscribes to the newsletter?

The response variable is `subscribe`, while predictors include `experience`, `played_hours`, `gender`, and `Age`. This question matters because it is important to identify which types of players are more likely to stay engaged, helping the research group target recruitment and improve server resource planning.



### (3) Exploratory Data Analysis and Visualization

The dataset loads into R. Numeric variables are limited to `played_hours` and `Age`, both with realistic ranges but slight skew in `played_hours`.  
A histogram of `played_hours` shows most players spent under 10 hours in-game, with a few high outliers. Boxplots comparing `played_hours` by `subscribe` suggest subscribers generally played longer.

These findings imply playtime and experience may be strong predictors of newsletter subscription.



### (4) Methods and Plan

**Proposed method:** *k-nearest neighbours (KNN) classification.*

- **Why appropriate:** Non-parametric, interpretable, and effective for mixed numeric and categorical predictors.
- **Assumptions:** Distance metrics are meaningful once predictors are standardized and categorical variables encoded.
- **Limitations:** Sensitive to outliers and class imbalance, requires scaling.
- **Plan:**
  1. Merge `players.csv` and `sessions.csv` using `hashedEmail`.  
  2. Standardize numeric predictors (`played_hours`, `Age`) and encode categoricals (`experience`, `gender`).  
  3. Split data into 70% training / 30% testing sets.   
  4. Evaluate model accuracy and confusion matrix on the test set.



### (5) GitHub Repository

All code and progress for this project are tracked in:  
`https://github.com/vincccss/dsci100-project-planning.git`