### Project Report
# Predicting Usage of a Video Game Research Server

By Victoria Zhou

## Introduction

Video games are a major part of modern entertainment and social interaction. As games grow more complex, understanding player behaviour helps improve design and infrastructure. Using data collected from the UBC Computer Science research group, I analyzed whether certain player traits can predict newsletter subscription.

My guiding question is: Can age, gender, experience, and played hours predict whether a player subscribes to the game-related newsletter?

I used two datasets: one with individual player information and another with records of their play sessions.

### Players Dataset Summary

| Variable Name | Variable Type | Description |
| --- | ----------- | -----------|
| experience | factor | Level of gameplay experience (Beginner, Amateur, Regular, Pro, Veteran).|
| subscribe | logical | Whether the player subscribed to the newsletter (TRUE or FALSE).|
| hashedEmail | character | Hashed email identifier (used to anonymize individual players).|
| played_hours | numeric | Total number of hours the player has spent on the server.|
| name | character | Player’s in-game display name (not used in analysis).|
| gender | factor | Player’s self-reported gender (Male, Female, Non-binary, Two-Spirited, Agender, Prefer not to say, Other).|
| Age | integer | Player’s age in years.|

Summary Statistics & Key Insights
- Number of observations: 196 players
- Number of variables: 7
- Source: Minecraft research server
- Collection Method: Player info collected at account registration, and gameplay time was recorded during server use.
- Subscription: 144 players subscribed (73%)
- Age: Mean = 20.5, Median = 19, Range = 8–50 
- Playtime: Mean = 5.85 hours
- Most common experience level: Amateur (63 players)
- Most common gender: Male (124 players)

Observations & Issues
- Missing values: Only in Age (2 cases)
- Limited behaviour tracking: No session or in-game activity metrics
- Sampling bias: Likely overrepresents younger users (median age = 19)

### Sessions Dataset Summary

| Variable Name | Variable Type | Description |
| --- | ----------- | -----------|
| hashedEmail | character | Player identifier to link with user-level data.|
| start_time | Date | Date the session started (DD/MM/YYYY).|
| end_time | Date | Date the session ended (DD/MM/YYYY).|
| original_start_time | numeric | Unix timestamp of start time (milliseconds).|
| original_end_time | numeric | Unix timestamp of end time (milliseconds).|

Summary Statistics & Key Insights
- Number of observations: 1,535 session records
- Number of variables: 5
- Source: Minecraft research server 
- Collection Method: Collected automatically by the server each time a player logged in and out
- Most active player: Logged 310 sessions
- Session date range: From approximately March 2024 to August 2024

Observations & Issues
- Missing data: Two sessions are missing end_time.
- Session duration: Not provided directly, must be calculated from start and end timestamps.
- Time of day: Available in string form but not separated — must be parsed to analyze peak usage hours.

## Methods & Results

### Preliminary Exploratory Data Analysis

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


### 1. Read and Tidy Data
For this project, I decided to use the variables age, experience, gender, played hours, and subscribe from the Minecraft player dataset.

To tidy and modify our data, I used the mutate and as_factor functions to convert character and logical types to factor types so that I could use them as categorical variables in the analysis. Then, I used the select function to create a data frame with only the columns I wanted to analyze.

In [18]:
#players <- read_csv("players.csv")
# to demonstrate that the dataset is readable from local into R
# I first downloaded the Minecraft player dataset provided by the UBC research group
# Then, I saved the file in my local directory for easy access
# Finally, I used read_csv to import the raw data into R for analysis
players <- read_csv("players.csv")
head(players)


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [19]:
players<-players|>
mutate(experience=as_factor(experience))|>
mutate(subscribe=as_factor(subscribe))|>
mutate(gender=as_factor(gender))|>
select(-hashedEmail,-name)
players

experience,subscribe,played_hours,gender,Age
<fct>,<fct>,<dbl>,<fct>,<dbl>
Pro,TRUE,30.3,Male,9
Veteran,TRUE,3.8,Male,17
Veteran,FALSE,0.0,Male,17
⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,0.0,Prefer not to say,17
Amateur,FALSE,2.3,Male,17
Pro,TRUE,0.2,Other,


### 2. Summarize Data
First, the set.seed function is used to ensure that results are reproducible.

Then, the data is split into 75% for training and 25% for testing using the initial_split function.

To summarize the training data, I counted the percentage of player who subscribed to the game-related newsletter using the group_by and summarize functions. I also used map_df to calculate and compare the average value of each predictor between subscribers and non-subscribers.


In [20]:
set.seed(123) # ensures replicability
players_split <- initial_split(players, prop = 0.75, strata = subscribe) 
players_training <- training(players_split)
players_testing <- testing(players_split)

In [21]:
glimpse(players_training)
glimpse(players_testing)

Rows: 147
Columns: 5
$ experience   [3m[90m<fct>[39m[23m Amateur, Amateur, Veteran, Amateur, Veteran, Beginner, Re…
$ subscribe    [3m[90m<fct>[39m[23m FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ played_hours [3m[90m<dbl>[39m[23m 0.0, 0.1, 0.0, 0.0, 1.4, 0.0, 0.0, 0.9, 0.0, 0.1, 0.2, 0.…
$ gender       [3m[90m<fct>[39m[23m Male, Female, Male, Prefer not to say, Prefer not to say,…
$ Age          [3m[90m<dbl>[39m[23m 22, 17, 23, 33, 25, 24, 23, 18, 42, 22, 37, 28, 23, 17, 1…
Rows: 49
Columns: 5
$ experience   [3m[90m<fct>[39m[23m Veteran, Veteran, Amateur, Amateur, Pro, Amateur, Regular…
$ subscribe    [3m[90m<fct>[39m[23m TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, …
$ played_hours [3m[90m<dbl>[39m[23m 3.8, 0.0, 0.7, 0.0, 0.0, 48.4, 0.3, 0.1, 0.6, 0.4, 5.6, 2…
$ gender       [3m[90m<fct>[39m[23m Male, Male, Female, Male, Male, Female, Male, Male, Male,…
$ Age          [3m[90m<dbl>[39m[23m 17, 17, 21, 21, 17, 17, 8, 1

In [22]:
cat("Table 1: Number and Percentage of Players that Subscribe to the Newsletter\n")
players_proportions <- players_training |>
                          group_by(subscribe) |>
                          summarize(count = n()) |>
                          mutate(percent = 100*count/nrow(players_training))
players_proportions

Table 1: Number and Percentage of Players that Subscribe to the Newsletter


subscribe,count,percent
<fct>,<int>,<dbl>
False,39,26.53061
True,108,73.46939


In [23]:
# Define a simple custom mode function
get_mode <- function(x) {
  uniq_vals <- unique(x)
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}

cat("Table 2: Average Predictor Values for Subscribers and Non-subscribers")
comparison <- players_training |>
                group_by(subscribe) |>
                summarize(avg_age = mean(Age, na.rm=TRUE),
                          avg_played_hours = mean(played_hours, na.rm=TRUE),
                          mode_experience = get_mode(experience),
                          mode_gender = get_mode(gender))
comparison

Table 2: Average Predictor Values for Subscribers and Non-subscribers

subscribe,avg_age,avg_played_hours,mode_experience,mode_gender
<fct>,<dbl>,<dbl>,<fct>,<fct>
False,23.87179,0.4641026,Amateur,Male
True,19.93396,9.725,Amateur,Male


### 3. Exploratory Data Visualization
I created separate visualizations for each predictor to explore its relationship with newsletter subscriptions. For age and played_hours, I used geom_boxplot to compare the distribution and central tendency of these continuous variables between subscribers and non-subscribers. Boxplots make it easy to identify differences in medians, variability, and the presence of outliers. For experience level and gender, I used geom_bar with position = "fill" to show the proportion of players in each category who subscribed. This approach is well-suited for visualizing how categorical variables relate to a binary outcome.