Project Planning Stage  
Group-009-36 

**Data Description — players.csv**

The `players.csv` dataset contains activity data for each participant in the Minecraft research project. Each row represents one player, identified by a hashed email. The dataset includes **196 observations** and **7 variables**.

**Key Variables**

| Variable | Type | Description |
|-----------|------|-------------|
| `experience` | string | Player’s experience level (e.g., Pro, Veteran, Regular, Amateur). |
| `subscribe` | boolean | Whether the player subscribed to the newsletter. |
| `hashedEmail` | string | Unique player identifier. |
| `played_hours` | float | Number of hours played. |
| `name` | string | Player’s name (non-identifiable). |
| `gender` | string | Player’s gender. |
| `Age` | float | Player’s age in years. |

**Summary Statistics (rounded to 2 decimals)**

| Statistic | `played_hours` | `Age` |
|------------|----------------|--------|
| Mean | 5.85 | 21.14 |


**Observations and Issues**

Most players recorded short playtime, but a few logged over 200 hours, meaning the distribution is **uneven**, with most values being low and a few extremely high ones (outliers). The `Age` variable has **2 missing values**, and `experience` includes multiple categorical levels requiring encoding for analysis. The `name` field is purely nominal and can be ignored. 

**Collection Notes**

The data were collected by a **UBC research group in Computer Science**, led by **Professor Frank Wood**







 **Data Description — sessions.csv**

The `sessions.csv` dataset logs individual play sessions from the UBC Minecraft research server.  
Each row is **one session** linked to a player in `players.csv` via a hashed email.  
The dataset contains **1,535 observations** and **5 variables**.

**Variables**

| Variable | Type | Description |
|-----------|------|-------------|
| `hashedEmail` | string | Player identifier linking to `players.csv`. |
| `start_time` | string | start time of the session. |
| `end_time` | string | end time of the session. |
| `original_start_time` | float | Start time in **milliseconds**. |
| `original_end_time` | float | End time in **milliseconds**. |

**One-number summaries (rounded to 2 decimals)**  
- **Mean `original_start_time`:** 1,719,200,000,000.00 ms  
- **Mean `original_end_time`:** 1,719,200,000,000.00 ms  


**Observations & issues**  
- Most sessions cluster in mid-2024; a **few very large timestamps** indicate an **uneven** distribution.  
- `end_time` / `original_end_time` have **2 missing values** 

**Data source**  
The data were collected by a **UBC research group in Computer Science**, led by **Professor Frank Wood**


**Research Question**  

Can a player's total play time (played_hours) predict their gender in the players dataset?






 **Exploratory Data Analysis and Visualization**


In [12]:
## (3) Exploratory Data Analysis and Visualization

# 3.1 Loading the dataset

# Load required packages
library(tidyverse)

# Read the players dataset (change the path if needed)
players <- read_csv("players.csv")

players


ERROR: Error: '/home/jovyan/work/.history/players.csv' does not exist.


## Methods and Plan

**Method chosen:** *K-nearest neighbors (KNN) classification*

### Why is this method appropriate?
KNN classifies a new player by comparing them to players with similar total play time, so it is a natural choice when using one numeric predictor (*played_hours*) to predict a categorical label (*gender*).

### Assumptions
KNN assumes that players with similar play time are likely to have similar gender labels, allowing distance-based comparisons to capture meaningful patterns.

### Potential limitations
Some gender categories have very few observations, making them harder to predict, and KNN is sensitive to noise and outliers, so unusual players may introduce noticeable error.

### Model comparison and selection
I will try several values of *k* (e.g., 3, 5, 7, 9), compare them using classification accuracy, and select the value of *k* that performs best.

### Data processing and splitting
Before modeling, I will split the data into training and test sets (about 80/20) using a **stratified split** so that all gender categories appear in both sets. Cross-validation will be performed on the training set to choose the best *k*.








 **GitHub Repository**
 https://github.com/uiuikmn/dsci-100-project.git