# DSCI 100 Project Final Report – Group Component

Created by Chrissy DIng, Kaylee Hogeboom, Rhett Cotton, and Trinity Chan

### 1. Introduction

#### The addressed broad question: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts"

#### Specifc Question: Can age and experience predict the total hours played so we can target similar players for large-data recruitment?
Using provided datasets, We'll investigate this question for Frank Wood's CS research group's game, PLAICraft. The main dataset used will be players.csv.


#### Pre-Inspection Details

The dataset that we will use for our analysis comes from Pacific Laboratory for Artificial Intelligence (PLAI), which is part of the Department of Computer Science at UBC; the Minecraft server they run is aptly named PLAICraft. Their mission is to create ethical generative AI systems that will have a lasting positive impact on society. Through the collection of data from this Minecraft server, PLAI aims to create an advanced embodied AI which can interact with real people. All this is done by training an AI model on the thousands of hours of behavioural data of the players of this server. 
\
To ensure the data collected proves useful and is able to be easily analyzed, PLAICraft has certain rules that must be followed. The main rules are as follows:
- If you are away from keyboard (AFK) for more than 5 minutes you will be removed from the server
- All voice chat must be spoken in English
- Please remember to play in a quiet environment without background noise
- Please be sure to enter gameplay within 10 minutes of requesting a session
As well, players must be generally respectful, they must not use any hacks or cheats while playing, and they must refrain from destroying other players' works. If players do not follow the above rules, their data is discarded and they are removed from the server. Finally, to ensure resources are not wasted, if players request a session and then do not show up (log in to the server) within 10 minutes, they endure a time penalty during which they cannot join the server. This time penalty increases as players continue to fail to show up for requested sessions, with a maximum penalty of 125 hours before players can next join the server.

\
Furthermore, PLAI encourages players to interact with each other via in-game voice chat, and they encourage players to invite their friends. Both of these are incentivized by additional playtime if players participate. Otherwise, players are allowed to play on the server for 3 minutes every hour they're away, up to a maximum of 30 minutes. As for the datasets we are concerned with for this analysis, their data was collected between May 1st, 2024 and September 1st, 2024. The sessions were tracked from the time the game was opened until it was closed by the players. There are two NA values in *players.csv*, for which we kept as other variables in those rows as may be valuable for exploratory visualizations and summaries.

\
As for the ethics involved, PLAI has received permission to run this study from the office of research ethics at UBC, and they require a consent form (or parental consent) for players to access the server. Emails and phone numbers are collected to send links to players in order to allow them to access the server, as well as if a player's data needs to be deleted for any reason. PLAI says they track gameplay, speech, and key presses in the PLAICraft Minecraft browser window; if for any reason a player wishes for part of their data to be excluded from collection, they are able to email support@plaicraft.ai for assistance. This is also the email to which players should write if they wish to report technical problems or abuses from other players. Now we will begin our analysis of the *players.csv* dataset. 


In [None]:
# Now, let's load in some R packages!
library(tidyverse)
library(ggplot2)
library(RColorBrewer)
library(tidymodels)
library(gridExtra)

#### i) Data Descriptions and Inspecting the Datasets with R Functions and Summaries

In [None]:
#load the dataset we'll be working with
url_players <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/players.csv"
players <- read_csv(url_players)

head(players)

#### Description of the set (players.csv):

- 196 observations, 7 variables about the players:
1. `experience`: *character* - gaming "level" - order: Beginner, Amateur, Regular, Veteran, Pro (most experienced)
2. `subscribe`: *logical* - PlaiCraft newsletter subscription: TRUE="yes", FALSE="no"
3. `hashedEmail`: *character* - email formatted in numbers and letters
4. `played_hours`: *double* - total hours played
5. `name`: *character* - first name
6. `gender`: *character* - gender
7. `age`: *double* - age (years)

##### Issues: 
- `subscribe` is ambiguous - likely indicates newsletter subscription
- `experience` "level" order is unclear
    - The assumed order is stated above, but there is no metadata to verify this
 
Tidy data follows one variable per column, one observation per row, and one value per cell. Wrangling isn't needed as the dataset follows this.

In [None]:
# Now, let's read in sessions.csv!
url_sessions <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/sessions.csv"
sessions <- read_csv(url_sessions)
head(sessions)

#### Description of the set (sessions.csv):

- 1535 observations, 5 variables:
1. `hashedEmail`: *character* - email formatted in numbers and letters
2. `start_time`: *character* - session start time (DD/MM/YYYY", "time (24-hour-clock)")
4. `end_time`: *character* - session end time ("DD/MM/YYYY", "time (24-hour-clock)")
5. `original_start_time`: *double* - session start time in UNIX (milliseconds)
6. `original_end_time`: *double* - session end time in UNIX (milliseconds)

This dataset isn't the focus, but beneficial for further exploratory visualizations. It's already tidy as well.

### 2. Methods & Results

This section will include summaries and exploratory visualizations of the dataset, the creation of the model, and visualizations based on the final data analysis.

#### Summary #1 - Experience

In [None]:
experience_count <- players |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |> 
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2)) #round to 2 decimal places
experience_summary

#### Summary #2 - Played Hours

In [None]:
played_hours_summary <- players |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      most = max(played_hours),
                                      least = min(played_hours)) |>
                            mutate(across(mean:least, ~ round(.x, 2)))    
played_hours_summary       

#### Summary #3 - Age (Years)

In [None]:
age_summary <- players |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              oldest = max(Age, na.rm = TRUE),
                              youngest = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:youngest, ~ round(.x, 2)))    
age_summary   

#### Summary #4 - Total Sessions

In [None]:
# Because this dataset is not the focus, we'll just summarize the variable we will be using in a visualization later.
total_sessions_count <- sessions |>
                                group_by(hashedEmail) |>
                                summarize(count = n())

total_sessions_summary <- total_sessions_count |>
                                summarize(mean = mean(count, na.rm = TRUE),
                                      median = median(count, na.rm = TRUE),
                                      most = max(count),
                                      least = min(count)) |>
                                mutate(across(mean:least, ~ round(.x, 2)))  
total_sessions_summary

In [None]:
# We will merge the datasets together too to simplify future explorations:
sessions_players_merged <- merge(players, sessions, by = "hashedEmail", all = TRUE)
head(sessions_players_merged)

#### ii) Exploratory Data Analysis and Visualization

Let's create visualizations to seek out relationships and overlooked issues.

#### Visualization #1

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)

age_histogram <- ggplot(players, aes(x = Age)) +
                    geom_histogram(bins = 12) +
                    labs(x = "Age (Years)", y = "Total Players") +
                    ggtitle("Distribution of Ages Across Different Gaming Experiences") +
                    scale_y_continuous(breaks = seq(0, 32, by = 2)) +
                    scale_x_continuous(breaks = seq(0, 60, by = 10)) +
                    facet_grid(cols = vars(experience)) +
                    theme(text = element_text(size = 15))
age_histogram

The majority of players are approximately 17 years old; there are numerous teenagers and young adults who participated in the study, though relatively few children and older adults. By the class observations, there are many amateurs and veterans and from these plots we cann tell that age does not equate to gaming experience. 

#### Visualization #2

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

played_hours_age_plot <- ggplot(players, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y-axis, the 0 values won't be infinity
                            geom_point(alpha = 0.4) +
                            labs(x = "Age (Years)", y = "Total Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("Total Hours Played vs Age of PlaiCraft Players") +
                            scale_y_log10() +
                            theme(text = element_text(size = 14))
played_hours_age_plot

From this plot, there is no clear relationship or trend as the points are very spread out and seemingly unordered. There are some points condensed near the bottom of the graph, however. This implies that teenagers and young adults play more, though due to the spread of the points, this is likely also player dependent. 

#### Visualization #3

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

sessions_hours_total <- sessions_players_merged |>
                        select(hashedEmail, played_hours, experience) |>
                        filter(played_hours != 0) |>
                        group_by(hashedEmail, played_hours, experience) |>
                        summarize(count = n()) #summarizing how many sessions each player played

experience_sessions_plot <- sessions_hours_total |>
                        ggplot(aes(x = count, y = played_hours, color = experience)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
experience_sessions_plot                        

There is a moderately strong to weak positive relationship in this plot as the variables increase together. Players who play the same number of sessions do not necessarily play the same amount of hours. As well, the gaming experience is rather scattered across this chart, suggesting a weaker relationship between experience and hours/sessions played. Notably, some regulars and amateurs played the greatest number of sessions and the most hours.

#### iii) KNN Regression Model and Data Analysis

Now, let's make the regression model! The steps we will follow are outlined below:

1. Mutate **experience** into numerical values
2. Inspect and clean data (eg. handle NA values)
3. Split dataset
      - 75% training, 25% testing
4. Tune training set
      - Scale predictors
5. Cross-validate training set
      - 5 folds
      - Tested neighbors differ by 3; neighbors = 1,4,..100
6. Find optimal **K-value**, refit model, assess on testing data for RMSPE

In [None]:
#Step 1 and 2
players_tidy <- players |>
                    mutate(experience = factor(experience,
                           levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"),
                           ordered = TRUE),
                           experience_num = as.numeric(experience)) |>
                           select(played_hours, experience_num, experience, Age) |>
                           na.omit(played_hours)
head(players_tidy)

In [None]:
#Step 3 and 4
set.seed(1234)
players_split <- initial_split(players_tidy, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                    set_engine("kknn") |>
                    set_mode("regression")

players_recipe <- recipe(played_hours ~ experience_num + Age, data = players_training) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())

players_workflow <- workflow() |>
                        add_recipe(players_recipe) |>
                        add_model(players_spec)
players_workflow

In [None]:
# Step 5
set.seed(1234)
gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 3))

players_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

players_results <- players_workflow |>
                        tune_grid(resamples = players_vfold, grid = gridvals) |>
                        collect_metrics() |>
                        filter(.metric == "rmse")
head(players_results)

In [None]:
set.seed(1234)
players_min <- players_results |>
                    slice_min(mean, n = 1) #this shows the best "k" to use
players_min

In [None]:
# Step 6
set.seed(1234)
k_min <- players_min |>
          pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
          set_engine("kknn") |>
          set_mode("regression")

players_best_fit <- workflow() |>
          add_recipe(players_recipe) |>
          add_model(players_best_spec) |>
          fit(data = players_training)

players_summary <- players_best_fit |>
           predict(players_testing) |>
           bind_cols(players_testing) |>
           metrics(truth = played_hours, estimate = .pred)
players_summary

In [None]:
# Final model
players_spec_final <- nearest_neighbor(weight_func = "rectangular", neighbors = 100) |>
                        set_engine("kknn") |>
                        set_mode("regression")

players_recipe_final <- recipe(played_hours ~ experience_num + Age, data = players_tidy) |>
                            step_center(all_predictors()) |>
                            step_scale(all_predictors())

players_fit_final <- workflow() |>
                        add_recipe(players_recipe_final) |>
                        add_model(players_spec_final) |>
                        fit(players_tidy)

#### Trying the Model With New Observations

In [None]:
set.seed(1234)
new_obs <- tibble(Age = sample(5:58, size = 196, replace = TRUE),
                  experience_num = sample(1:5, size = 196, replace = TRUE))

predict_new_obs <- players_fit_final |>
                    predict(new_obs) |>
                    bind_cols(new_obs) 

head(predict_new_obs)

#### iv) Visualizations of Model and Analysis

The following visualizations will be further discussed and compared in the "Discussion" section.

#### Visualization #4

In [None]:
options(repr.plot.width = 9, repr.plot.height = 8)
k_rmspe_plot <- players_results |>
                    ggplot(aes(x = neighbors, y = mean)) +
                    geom_point() +
                    geom_line() +
                    labs(x = "Neighbours", y = "RMSPE") +
                    ggtitle("Neighbours vs RMSPE") +
                    theme(text = element_text(size = 15))
k_rmspe_plot

#### Visualization #5

In [None]:
options(repr.plot.width = 11, repr.plot.height = 12)
testing_data_plot <- players_testing |>
                            ggplot(aes(x = Age, y = played_hours + 1, colour = experience)) +
                            geom_point(alpha = 0.4) + 
                            labs(x = "Age (Years)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played vs Age of Plaicraft Players (Testing Data)") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            theme(text = element_text(size = 15))

testing_preds <- players_best_fit |>
  predict(players_testing) |>
  bind_cols(players_testing)

predicted_testing_plot <- ggplot(testing_preds, aes(x = Age, y = .pred, colour = experience)) +
                              geom_point(alpha = 0.4) +
                              labs(x = "Age (Years)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                              ggtitle("Prediction Total Hours Played vs Age of Plaicraft Players (Testing Data)") +
                              scale_color_brewer(palette = "Dark2") +
                              scale_y_log10() +
                              theme(text = element_text(size = 15))

grid.arrange(testing_data_plot, predicted_testing_plot)

#### Visualization #6

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
new_obs_plot <- predict_new_obs |>
                    ggplot(aes(x = Age, y = .pred + 1, colour = factor(experience_num))) +
                    geom_point(alpha = 0.4) +
                    labs(x = "Age (Years)", y = "Total Played Hours (Scaled)", colour = ") +
                    ggtitle("Prediction Total Hours Played vs Age of Plaicraft Players") +
                    scale_color_brewer(palette = "Dark2") +
                    theme(text = element_text(size = 15))

new_obs_plot

### 3. Discussion

Before discussing our results, let's discuss our predictions and expectations. Based on our exploratory data analysis, in visualization #3, hours played and number of sessions increase together but don't depend on gaming experience. This was expected because players in the same experience level can have really different hours and sessions played. Based on visualization #2 and the data provided, we would expect that younger ages will play the game more and produce higher total hours played as they are generally on technology more. We can also anticipate that newer players (Amateurs) and seasoned, competitive players (Veterans) would log the most total hours played as seen in visualization #1. We can deduce that this trend is likely due to the release of dopamine from the anticipation of playing a new game. That is why new players will log more played hours. For veterans, they log more hours because for many of them, the game is almost like a job. Professionals and content creators play many hours to train their skills before tournaments and to stream to audiences in order to monetize their gaming skills. Knowing what age range and experience levels play the game the most will let us target those demographics in recruitment efforts.

After creating our model, the best *k* we found is 100 and its RMSPE value is 11.30 as shown in Step 6 of making our model. This suggests that our model predicts values with an error of 11.30 hours. Assessing this error with the context of our question, we can say that it is not the best error value. An error of 11 hours can be quite the deciding factor, considering that we are looking for types of players that could contribute large amounts of data. For instance, if a player actually had 30 hours of playtime, but was predicted to have 19 hours, we may reconsider recruiting those types of players when in reality, they did contribute a decent amount of hours. On the other hand, if a player had 10 hours of playtime, but predicted to have 21 hours, we may consider them for recruitment when they actually didn't contribute as much as we thought.

Based on our model's data analysis, in visualization #5, we are comparing graphs of the actual testing data vs the predictions of the testing data the model made. As we can see, the predictions were fairly off. The 
After creating a tibble with randomized Age and experience_num values to test our model on, we may conclude that older players are most likely to play a smaller amount of hours of Plaicraft; therefore, contributing less data than younger players. This is shown visually in visualization #6. Therefore, we may consider recruiting younger players which is what we expected to find as stated before.

Limitations of our model and thus, reasons for this RMSPE value arises from the dataset being real data. There is a high chance for variability amongst each player and there won't be many strong and consistent patterns or trends seen. In addition, outliers in our dataset may also affect our model. The highest amount of total hours played as seen in summary #2 was 223 hours, and the mean was 5 hours. If there were many strong outliers, even values in the 100s of hours, it may affect the predictions as they are based on the 100 closest data points, which is a large range of points we are basing it off of. Along the same lines, when *k* is 100, it is approximately 68% of our original dataset. This could cause some underfitting issues, making our model not influenced enough by the data. Moreover, data imbalance could also be a potential problem. As seen in summary #1, the distribution of experience levels is not even and pros only make up approximately 7% of the data while the other experiences make up 10 - 30%.

A suggestion to make our model better would be to try a smaller range of *k* values to avoid underfitting, collect more data to train our model with so there are more pros, an even distribution, and more data overall to base our predictions on as the original dataset may be considered on the smaller side. With this future changes, it would hopefully guide our model to becoming more accurate and lowering its RMSPE value.

- <mark>Summarize findings
- <mark>Discuss if this what is what was expected
- <mark>What impact do the findings have (done)
- <mark>Are there any future questions of interest

### GitHub Repository

https://github.com/tchan0717/dsci-100-2025w1-group-36.git

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly. https://r4ds.had.co.nz/.
\
\
The Pacific Laboratory of Artificial Intelligence. FAQ. Plaicraft. https://plaicraft.ai/faq. 