1. Data Description

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
combined_data <- inner_join(sessions, players)
combined_data

In [None]:
average_hr_played <- combined_data |>
  group_by(average_hr_played = round(mean(played_hours, na.rm = TRUE), 2))
average_hr_played

- 1535 observations in total.
- 11 variables in total , including 7 variables in players and 5 variables in sessions.
-  players.csv variables
	•	experience — categorical; self-reported experience level
	•	subscribe — logical; whether subscribed to newsletter
	•	hashedEmail — character ID; used for merging
	•	played_hours — numeric; self-reported past hours played
	•	name — character; player name, not used for modeling
	•	gender — categorical; gender identity
	•	Age — numeric; age of player

   sessions.csv variables
	•	hashedEmail — character ID linking sessions to players
	•	start_time — datetime string; session start
	•	end_time — datetime string; session end
	•	original_start_time — numeric; rounded Unix timestamp
	•	original_end_time — numeric; rounded Unix timestamp
- created variables:
    -   duration — numeric; minutes per session
	•	total_session_minutes — numeric; total minutes per player
	•	mean_session_minutes — numeric; average session length
	•	number_of_sessions — integer; count of sessions per player
- Issues Observed in the Data
	•	original_start_time and original_end_time are too rounded, causing inaccurate durations
	•	Some session durations are extremely long (possible AFK / idle time)
	•	Certain experience levels have very small counts (imbalanced categories)
- Potential Hidden Issues (Things Not Directly Observable)
	•	Sampling bias (players who join a study server ≠ general population)
	•	Possible time zone discrepancies in timestamps
	•	Missing or incomplete session logs for some players
- Data collected via plaicraft.ai program launched by The Pacific Laboratory for Artificial Intelligence with volunteers
  


2.Questions
Broad question:
Which kinds of players contribute the most gameplay data?

Specific question:
Can player characteristics (experience level, age, gender, total played hours) predict whether a player is a high or low data contributor based on each player’s total time spent in all recorded sessions?

To define the response variable, I will summarize the sessions data for each player:
	•	First compute the duration of every session in sessions.csv using
session_minutes = end_time − start_time
	•	Then aggregate total session time per player:
total_session_minutes
	•	To classify players, I will use the median total session time as a threshold:
	•	HighData = 1 if above median
	•	LowData = 0 otherwise

This creates a balanced binary classification target appropriate for KNN.

3.Exploratory Data Analysis and Visualization

In [None]:
average_gender <- combined_data |>
  group_by(gender) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2))
average_gender

In [None]:
average_experience <- combined_data |>
  group_by(experience) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2)) |>
  arrange(desc(count))
average_experience

In [None]:
average_across <- combined_data |>
  summarize(across(where(is.numeric), ~round(mean(.x, na.rm = TRUE), 2)))
average_across 

In [None]:
sessions <- sessions |>
  mutate(original_end_time = as.numeric(as.POSIXct(end_time, format = "%d/%m/%Y %H:%M")),
         original_start_time = as.numeric(as.POSIXct(start_time, format = "%d/%m/%Y %H:%M")))
sessions

In [None]:
session_duration <- sessions |>
    mutate(duration = original_end_time - original_start_time) |>
    select(duration)
sessions_summary <- session_duration |>
    summarize(min_duration = min(duration, na.rm = TRUE),
              Q1_duration = quantile(duration, 0.25, na.rm = TRUE),
              median_duration = median(duration, na.rm = TRUE),
              Q2_duration = quantile(duration, 0.75, na.rm = TRUE),
              max_duration = max(duration, na.rm = TRUE),
              mean_duration = mean(duration, na.rm = TRUE))
sessions_summary

|Variable|Value|
|-------|------|
| Play hours | 98.57|
| Age        | 19.43|

In [None]:
age_distribution <- combined_data |>
  ggplot(aes(x = Age)) +
  geom_histogram(bins = 30) +
  xlab("Age(years)") +
  ylab("Number of Players") +
  ggtitle("Age Distribution of Players in Years") +
  theme(text = element_text(size = 20))
age_distribution

In [None]:
played_hours_distribution <- combined_data |>
  ggplot(aes(x = played_hours)) +
  geom_histogram(bins = 30) +
  xlab("Played_hours") +
  ylab("Number of Players") +
  ggtitle("Played hours Distribution of Players") +
  theme(text = element_text(size = 20))
played_hours_distribution

4. Methods and Plans
   To address my research question—whether player characteristics can predict whether a player is a high-data or low-data contributor—I propose using a KNN classification model. 

To evaluate and select the final model, I will compare performance across different values of k and choose the one that minimizes classification error on a validation set. I will split the dataset into training and testing sets using an 80/20 split, performed after all necessary preprocessing (dummy encoding and standardization) but before any model fitting. Cross-validation may also be used within the training set to determine the best value of k, helping avoid overfitting and ensuring that the selected model generalizes well.

Overall, KNN is an appropriate and flexible choice for this question, provided that careful preprocessing and model selection procedures are followed.