# DSCI 100 – Project Planning Stage  
## Saba Alipour – Group 7  

### Broad Question 2: Recruiting Players Who Are Most Likely to Contribute a Large Amount of Data

In [None]:
library(tidyverse)
library(janitor)
library(knitr)

options(readr.show_col_types = FALSE)

players <- read_csv("data/players.csv") |> clean_names()
sessions <- read_csv("data/sessions.csv") |> clean_names()

dim(players)
dim(sessions)

head(players)
head(sessions)

# 1. Data Description

This project uses two datasets collected from a UBC research Minecraft server. The **players.csv** dataset contains one row per player with demographic and experience information. The **sessions.csv** dataset contains one row per play session, including start and end times for each session. The datasets are linked by a hashed email identifier. Together, these data combine survey responses with automatically logged behaviour from the server.

In [None]:
players_vars <- tibble(variable = names(players),
                       type = sapply(players, class),
                       description = c("Self-reported Minecraft experience level.",
                                       "Whether the player subscribed to the project newsletter.",
                                       "Anonymized player identifier.",
                                       "Total number of hours played on the server.",
                                       "Player's chosen in-game name.",
                                       "Self-reported gender.",
                                       "Self-reported age."))

kable(players_vars, caption = "Variables in players.csv")

In [None]:
sessions_vars <- tibble(variable = names(sessions),
                        type = sapply(sessions, class),
                        description = c("Anonymized player identifier linked to players.csv.",
                                        "Start time of the session (character timestamp).",
                                        "End time of the session (character timestamp).",
                                        "Original large numeric start timestamp.",
                                        "Original large numeric end timestamp."))

kable(sessions_vars, caption = "Variables in sessions.csv")

In [None]:
players_num_summary <- players |>
summarise(across(where(is.numeric), list(mean = ~ round(mean(.x, na.rm = TRUE), 2),
                                         sd   = ~ round(sd(.x, na.rm = TRUE), 2),
                                         min  = ~ round(min(.x, na.rm = TRUE), 2),
                                         max  = ~ round(max(.x, na.rm = TRUE), 2))))

kable(players_num_summary, caption = "Summary statistics for numeric variables in players.csv")

There are 196 rows and 7 variables in the `players` dataset. It includes both qualitative and quantitative factors. Qualitative variables include experience level, gender, and newsletter subscription, while quantitative variables include age and total hours played. The `sessions` dataset contains 1535 rows and 5 variables. Each player may have multiple sessions that are linked by their hashed email address. Some individuals who completed the sign-up survey do not have any recorded sessions, and a small group of players have extremely high total play hours, creating a right-skewed distribution. Some demographic variables contain missing values. Session timestamps are stored in both character and large numeric formats. Due to these features, the data will need to be aggregated, cleaned, and joined before analysis can be performed.

# 2. Questions

**Broad Question:**  
Which kinds of players are most likely to contribute a large amount of data to the Minecraft research server?

**Specific Question:**  
Can early player characteristics (experience level, age, gender, and newsletter subscription) predict whether a player will become a **high data contributor**, defined as being in the top 25% of players based on total play time aggregated from all their sessions?

To answer this question, I will join the player-level data (`players.csv`) with aggregated session-level summaries from `sessions.csv` using the hashed email identifier. This will create one row per player with the response variable (high vs. not-high contributor) and explanatory variables describing demographics and experience. This dataset will then be ready for a classification model in the final project stage.

# 3. Exploratory Data Analysis and Visualization

In this section, I compute simple summary statistics, perform minimal wrangling to prepare the data for exploration, and create visualizations to better understand patterns related to player behaviour and potential predictors of high data contribution.

In [None]:
sessions_player <- sessions |>
group_by(hashed_email) |>
summarise(n_sessions = n(),
          .groups = "drop")

players_eda <- players |>
left_join(sessions_player, by = "hashed_email")

In [None]:
players_eda |>
ggplot(aes(x = played_hours)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Total Play Hours",
     x = "Total Play Hours",
     y = "Number of Players")

In [None]:
players_eda |>
ggplot(aes(x = experience, y = played_hours)) +
geom_boxplot(fill = "lightgreen") +
coord_cartesian(ylim = c(0, quantile(players$played_hours, 0.95))) +
labs(title = "Total Play Hours by Experience Level",
     x = "Experience Level",
     y = "Total Play Hours")

In [None]:
players_eda |>
ggplot(aes(x = n_sessions, y = played_hours)) +
geom_point(alpha = 0.6, color = "purple") +
labs(title = "Number of Sessions vs Total Play Hours",
     x = "Number of Sessions",
     y = "Total Play Hours")

(The missing values are automatically removed for this plot)

## 4. Methods and Planning

The main purpose of this study is to treat the problem as a classification task. I want to find out whether a player is in the top 25% of total play time (a high data contributor) or if they contribute less to the dataset. To try to predict this, I want to use a K-nearest neighbours (KNN) classifier with factors such as gender, age, experience level, newsletter subscription, and some early session data.

KNN seems appropriate because it doesn’t assume any specific pattern or structure in the data. Instead, it looks at whether players are “close” to each other based on their attributes and predicts that they will behave in a similar way. It does have some weaknesses though: irrelevant or very similar predictors can affect it, all numeric variables need to be scaled, and it may struggle if the high-contributor group is much smaller than the rest.

To get the dataset ready, I will add up all of the session times for each player to calculate their total play time. Then I will create a variable that marks which players fall into the top 25%. After that, I will split the data into a training set (75%) and a test set (25%). All preprocessing—scaling numeric predictors, encoding categorical ones, and trying out different values of k—will happen only on the training set using cross-validation. Once the best model is chosen, I will test it on the remaining 25% to see how well it performs.

## 5. GitHub Repository

My project repository can be found at:

**https://github.com/saba02716-ops/dsci-100-2025w1-group-7-saba**

This repository contains my planning report notebook and the work completed for this stage of the project.
