# Predicting Usage of a Video Game Research Server

## Project Final Report

##### Group 13 - Section 005

## Introduction

A research group in the Department of Computer Science at UBC, called The Pacific Laboratory for Artificial Intelligence (PLAI), is working on advancing AI into something more safe and reliable, that we can all trust. One of their projects is a research data collection project that focusses on generative AI (The Pacific Laboratory for Artificial Intelligence, 2023). PLAI has created an online server called "plaicraft.ai", that is a free version fo Minecraft, that anyone can sign up for and play They collect data about the players and how they interact with the server, with the goal of using the data to create AI characters that respond to aspects in the video game in a way that's "smarter" than current non-player characters (Smith, 2023).

Using this collected data, the main goal of our project is to answer the question of "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?", with a more specefic question of "Can age, gender, experience, and average play time per session predict the subscription of a player?".

#### Description of Datasets

The first dataset we used to answer our question is the `players.csv` dataset.

This dataset is about 196 players in the Minecraft server, including:
- `experience` - (Amateur, Beginner, Regular, Pro, Veteran)
- `subscribe` - Subscription to the newsletter (TRUE/FALSE)
- `hashedEmail` - String of letters and numbers to identify player
- `played_hours`
- `name`
- `gender` - (Agender, Female, Male, Non-Binary, Two-Spirited, Other, Prefer not to say)
- `age`


The second dataset we used to answer our question is the `sessions.csv` dataset.

This dataset is about 1535 sessions of play, including:
- `hashedEmail`
- `start_time` - Including date and time
- `end_time` - Including date and time
- `original_start_time` 
- `original_end_time`

We used, from the `players.csv` data set, the `subscribe` variable as our response variable and `age`, `gender`, `experience` variables as three of our preedictor variables. From both the `players.csv` data set and `sessions.csv`, we used the `hashedEmail` variable to combine the data from both data sets. And from the `sessions.csv` data set, we used the `original_start_time` and `original_end_time` to find the average play time per session for each player.

## Methods and Results

In [1]:
#Loading all necessary libraries for data analyis
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

The first step in our data analysis is to load the `players.csv` and `sessions.csv` datasets.

In [5]:
players_url <- "https://raw.githubusercontent.com/sarahmontgomery04/project-data/refs/heads/main/players.csv"
players <- read_csv(players_url)
players

sessions_url <- "https://raw.githubusercontent.com/sarahmontgomery04/project-data/refs/heads/main/sessions.csv"
sessions <- read_csv(sessions_url)
sessions

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


Now, we are going to compute the mean and median of the numbered statistics from the players dataset, that we're using, which is just the age.

In [6]:
players_age <- players |>
    summarize(mean_age = mean(Age), median_age = median(Age)) |>
    select(mean_age, median_age)
players_age

mean_age,median_age
<dbl>,<dbl>
,


We're getting "NA" values for both summary statistics indicating some of the observations have missing values for age, therefore we will repeat the process while removing these missing values.

In [8]:
players_age <- players |>
    summarize(mean_age = mean(Age, na.rm = TRUE), median_age = median(Age, na.rm = TRUE)) |>
    select(mean_age, median_age)
players_age

mean_age,median_age
<dbl>,<dbl>
20.52062,19


We've now learned that the mean and median ages are similar, indicating the values are not skewed and some of the age values are missing. For now, we will keep the NA values and tidy the data by removing `name` and `played_hours`, since they're not involved in our analysis.

In [23]:
players_tidy <- players |>
    select(-name, -played_hours)
players_tidy

experience,subscribe,hashedEmail,gender,Age
<chr>,<lgl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,Male,17
⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,Other,


Now, we'll look at the sessions dataset. From this dataset, we only want the `hashedEmail` and we want to find the average play time per session, which we'll call `mean_session_time`, for each player using the `start_time` and `end_time`.

In [22]:
sessions_tidy <- sessions |>
    select(hashedEmail, start_time, end_time) |> 
    separate(start_time, into = c("start_date", "start_time"), sep = " ") |> 
    separate(end_time, into = c("end_date", "end_time"), sep = " ") |>
    mutate(start_time = as.POSIXct(start_time, format = "%H:%M")) |>
    mutate(end_time = as.POSIXct(end_time, format = "%H:%M")) |>
    mutate(total_time = abs(start_time - end_time)) |>
    select(hashedEmail, total_time) |>
    group_by(hashedEmail) |> 
    summarize(mean_session_time = mean(total_time))
sessions_tidy

hashedEmail,mean_session_time
<chr>,<drtn>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,712 mins
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,30 mins
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,11 mins
⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,33.73871 mins
fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d4c311fb58fb211f471,9.00000 mins
fef4e1bed8c3f6dcd7bcd39ab21bd402386155b2ff8c8e53683e1d2793bf1ed1,72.00000 mins


Since, we want to use variables from both datasets, we will merge the date using the `hashedEmail` variable and then remove it, since it's unnecessary for the rest of the data analysis.

In [25]:
plaicraft_data <- merge(sessions_tidy, players_tidy, by = "hashedEmail")
plaicraft_data

hashedEmail,mean_session_time,experience,subscribe,gender,Age
<chr>,<drtn>,<chr>,<lgl>,<chr>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,712 mins,Regular,TRUE,Male,20
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,30 mins,Pro,FALSE,Male,21
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,11 mins,Beginner,TRUE,Male,17
⋮,⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,33.73871 mins,Amateur,TRUE,Male,23
fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d4c311fb58fb211f471,9.00000 mins,Amateur,TRUE,Male,17
fef4e1bed8c3f6dcd7bcd39ab21bd402386155b2ff8c8e53683e1d2793bf1ed1,72.00000 mins,Beginner,TRUE,Male,20


In [None]:
plaicraft_data <- plaicraft_data |>
    select(-hashedEmail)

plaicraft_data

Now that we have our data cleaned and formatted, we can run a classification model to predict if a player is subscribed based on their `mean_session_time`, `experience`, `gender`, and `Age`.

In [None]:
plaicraft_split <- initial_split(plaicraft_data, prop = 0.60, strata = subscribe)
plaicraft_training <- training(plaicraft_split)
plaicraft_testing <- testing(plaicraft_split)

plaicraft_recipe <- recipe(subsribe ~ ., data = plaicraft_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

#cross validation to find best k-value?

### References

(1) Smith, A. (2023, September 28). plaicraft.ai launch - Pacific Laboratory for Artificial Intelligence. Pacific Laboratory for Artificial Intelligence. https://plai.cs.ubc.ca/2023/09/27/plaicraft/

(2)  The Pacific Laboratory for Artificial Intelligence. (2023, September 28). Home Page - Pacific Laboratory for Artificial Intelligence. Pacific Laboratory for Artificial Intelligence. https://plai.cs.ubc.ca/