# DSCI 100 Project: Planning Stage (Individual)

**Author**: Sydney Peters

**Predicting Usage of a Video Game Research Server**

A research group at UBC (PLAI) has collected data on players who signed up to use their Minecraft research server. Their goal is to recruit players who will contribute large amounts of gameplay data, and they want to understand which types of players tend to play the most. This project explores how player demographics relate to total gameplay time.

## Data Description: The Players

- Number of observations: 196 (each row represents one unique player)
- Number of variables: 7 (after cleaning, with individualId and organizationName removed)
- Observational unit: Player-level data
- Purpose: Identify which demographic characteristics correspond to higher gameplay time (and therefore more data contributed)

**Variables used in the analysis**

- `played_hours` (numeric) — Total hours each player spent on the server
- `experience` (categorical) — Self-reported experience level in Minecraft
- `gender` (categorical) — Player’s gender
- `subscribe` (Boolean) - If the player wants to suscribe to the game-related newsletter
- `hashedEmail` (String) - The players email, hashed for privacy
- `name` (String) - Name of player
- `age` (Numeric) - Age of player

**Issues to consider**

- Missing identifiers (`individualId`, `organizationName`) were removed.
- `hashedEmail` is anonymized for privacy.
- `played_hours` may include idle time, which could slightly overestimate actual engagement.

Source of the original dataset:
https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit

## The Question
**Question 2**: Which kinds of players are most likely to contribute a large amount of data, and how can we identify them for targeted recruitment?

For this project, we focus on predicting total `played_hours` from the `players.csv` dataset using two predictors: `experience` and `gender`.

**Response variable**:

`played_hours` — total play time and data contribution

**Explanatory variables**:

`experience` — player experience level

`gender` — player gender

The dataset allows us to examine whether `experience` or `gender` relates to total playtime. If certain experience levels or genders tend to play more, those groups may contribute more gameplay data. This helps identify the player demographics most likely to provide substantial data for the research team.

# Exploratory Data Analysis and Visualization 

In [1]:
# Import the necessary libraries
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
# Read in the players data
url= "https://raw.githubusercontent.com/sydlpeters/dsci-group-2025w1-group-101-1/refs/heads/main/data/players.csv"
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# Tidy the players data
clean_players = players.drop(columns=["hashedEmail", "individualId", "organizationName", "name"])
clean_players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [4]:
# What are the total session hours played based on player experience?
clean_players_exp = alt.Chart(clean_players).mark_bar().encode(
    x=alt.X("experience").title("Experience"),
    y=alt.Y("played_hours").title("Play Time (hrs)"),
    color =alt.Color("experience").title("Experience")
)
clean_players_exp

**Experience vs. Play Time(hrs)**:
This plot shows us that the player experience level is strongly correlated to the total play time. Amateur and regular experience players show to have more play time than the other experience levels.

In [6]:
# Does gender have an impact on player session time?
clean_players_gender = alt.Chart(clean_players).mark_bar().encode(
    x=alt.X("gender").title("Gender"),
    y=alt.Y("played_hours").title("Play Time (hrs)"),
    color =alt.Color("gender").title("Gender")
)
clean_players_gender

**Gender vs. Play Time (hrs)**: This plot shows us that while non-binary, female, and male gendered players are the main demographic, male players show to have more play time than the other genders.

**Relevance to our question:** The figures produced in the notebook visualize how `played_hours` varies across `experience` levels and `genders`. These plots help reveal general trends and motivate why these predictors might help in forecasting gameplay time. For example, some experience groups appear to play substantially more than others, suggesting the potential for meaningful prediction.

# Methods and Plan

**Proposed Method:** k-Nearest Neighbors (KNN) Regression

**Why KNN Regression Is Appropriate**

The goal is to predict the numeric outcome `played_hours`, which makes KNN regression an appropriate choice. KNN regression works by finding the k most similar players (based on their demographic features) and averaging their playtime to predict a new player's playtime.
KNN is especially well-suited to this dataset because:

- Player behavior and engagement patterns are unlikely to follow a perfectly linear relationship. KNN can capture complex or nonlinear patterns better than linear regression.
- It handles numerical variables. We can transform experience and gender into numerical form.
- If players with similar experience and gender tend to have similar playtime, KNN will detect and use this pattern.

KNN classification predicts categories, not numerical values, so it cannot be used to predict `played_hours`. Linear regression assumes a linear relationship between predictors and the outcome, which may not hold with human gameplay behavior. Using linear regression might underfit the data. 

**Assumption**s:

- Standardized features.
- Encoded categorical variables must be treated consistently for all observations.
- Balanced across experience and gender categories.

**Potential Limitations**:

- KNN can be slow for larger datasets.
- Model performance depends on the selected k value.
  
**Model Selection**:

- Split the data into 70% training and 30% testing.
- Use 5-fold cross-validation on the training set to find the optimal number of neighbors, k.
- Train the final model using the chosen k.
- Evaluate on the test set to estimate real-world predictive performance.

**GitHub**: https://github.com/sydlpeters/dsci-group-2025w1-group-101-1.git