# DSCI 100 Project Final Report – Group Component

Created by Chrissy DIng, Kaylee Hogeboom, Rhett Cotton, and Trinity Chan

### 1. Introduction

#### The addressed broad question: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts"

#### Specifc Question: Can age and experience predict the total hours played so we can target similar players for large-data recruitment?
Using provided datasets, We'll investigate this question for Frank Wood's CS research group's game, PlaiCraft. The main dataset used will be players.csv.


#### Pre-Inspection Details

 - Collection period: May 1-September 1, 2024
 - Session tracking duration: once game was opened to closed
 - Two NA values (players.csv)
     - I kept them as other variables in those rows may be valuable for visualizations and summaries

<mark>Add more about background info, how data was collected, etc.<mark>

In [None]:
# Now, let's load in some R packages!
library(tidyverse)
library(ggplot2)
library(RColorBrewer)
library(tidymodels)

### 2. Methods & Results

#### i) Data Descriptions and Inspecting the Datasets with R Functions and Summaries

In [None]:
#load the dataset we'll be working with
url_players <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/players.csv"
players <- read_csv(url_players)

head(players)

#### Description of the set (players.csv):

- 196 observations, 7 variables about the players:
1. `experience`: *character* - gaming "level" - order: Beginner, Amateur, Regular, Veteran, Pro (most experienced)
2. `subscribe`: *logical* - PlaiCraft newsletter subscription: TRUE="yes", FALSE="no"
3. `hashedEmail`: *character* - email formatted in numbers and letters
4. `played_hours`: *double* - total hours played
5. `name`: *character* - first name
6. `gender`: *character* - gender
7. `age`: *double* - age (years)

##### Issues: 
- `subscribe` is ambiguous - likely indicates newsletter subscription
- `experience` "level" order is unclear
    - Assumed order stated above, but no metadata to verify
 
Tidy data is one variable per column, one observation per row, and one value per cell. Wrangling isn't needed as dataset follows this.

#### Summary #1 - Experience

In [None]:
experience_count <- players |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |> 
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2)) #round to 2 decimal places
experience_summary

#### Summary #2 - Played Hours

In [None]:
played_hours_summary <- players |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      most = max(played_hours),
                                      least = min(played_hours)) |>
                            mutate(across(mean:least, ~ round(.x, 2)))    
played_hours_summary       

#### Summary #3 - Gender

In [None]:
gender_count <- players |>
                    group_by(gender) |>
                    summarize(count = n())

gender_summary <- gender_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |>
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2))
gender_summary

#### Summary #4 - Age (Years)

In [None]:
age_summary <- players |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              oldest = max(Age, na.rm = TRUE),
                              youngest = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:youngest, ~ round(.x, 2)))    
age_summary   

In [None]:
# Now, let's read in sessions.csv!
url_sessions <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/sessions.csv"
sessions <- read_csv(url_sessions)
head(sessions)

#### Description of the set (sessions.csv):

- 1535 observations, 5 variables:
1. `hashedEmail`: *character* - email formatted in numbers and letters
2. `start_time`: *character* - session start time (DD/MM/YYYY", "time (24-hour-clock)")
4. `end_time`: *character* - session end time ("DD/MM/YYYY", "time (24-hour-clock)")
5. `original_start_time`: *double* - session start time in UNIX (milliseconds)
6. `original_end_time`: *double* - session end time in UNIX (milliseconds)

This dataset isn't the focus, but beneficial for further exploratory visualizations. It's already tidy as well.

#### Summary #5 - Total Sessions

In [None]:
# Because dataset is not the focus, we'll just summarize the variable I will be using in a visualization later.
total_sessions_count <- sessions |>
                                group_by(hashedEmail) |>
                                summarize(count = n())

total_sessions_summary <- total_sessions_count |>
                                summarize(mean = mean(count, na.rm = TRUE),
                                      median = median(count, na.rm = TRUE),
                                      most = max(count),
                                      least = min(count)) |>
                                mutate(across(mean:least, ~ round(.x, 2)))  
total_sessions_summary

In [None]:
# I will merge the datasets together too to simplify future explorations:
sessions_players_merged <- merge(players, sessions, by = "hashedEmail", all = TRUE)
head(sessions_players_merged)

#### ii) Exploratory Data Analysis and Visualization

Let's create visualizations to seek out relationships and overlooked issues.

#### Visualization #1

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)

age_histogram <- ggplot(players, aes(x = Age)) +
                    geom_histogram(bins = 12) +
                    labs(x = "Age (Years)", y = "Total Players") +
                    ggtitle("Distribution of Ages Across Different Gaming Experiences") +
                    scale_y_continuous(breaks = seq(0, 32, by = 2)) +
                    scale_x_continuous(breaks = seq(0, 60, by = 10)) +
                    facet_grid(cols = vars(experience)) +
                    theme(text = element_text(size = 15))
age_histogram

- Majority ~17 years old
- Numerous teenagers and young adults
   - Few young children and adults
- Many Amateurs and Veterans
- Age &ne; to gaming experience

#### Visualization #2

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

played_hours_age_plot <- ggplot(players, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y-axis, the 0 values won't be infinity
                            geom_point(alpha = 0.4) +
                            labs(x = "Age (Years)", y = "Total Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("Total Hours Played vs Age of PlaiCraft Players") +
                            scale_y_log10() +
                            theme(text = element_text(size = 14))
played_hours_age_plot

- No relationship nor clear trend
    - Widespread points
- Condensed near bottom of graph
- Insinuates teens and young adults play more
    - Likely player dependent though

#### Visualization #3

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

sessions_hours_total <- sessions_players_merged |>
                        select(hashedEmail, played_hours, experience) |>
                        filter(played_hours != 0) |>
                        group_by(hashedEmail, played_hours, experience) |>
                        summarize(count = n()) #summarizing how many sessions each player played

experience_sessions_plot <- sessions_hours_total |>
                        ggplot(aes(x = count, y = played_hours, color = experience)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
experience_sessions_plot                        

- Strong, positive relationship
    - Variables increase together
- Identical session counts &ne; identical hours played
- Gaming experience fairly scattered
    - Some Regulars and Amateurs played the most sessions and hours

#### Visualization #4

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

sessions_hours_total_2 <- sessions_players_merged |>
                        select(hashedEmail, played_hours, gender) |>
                        filter(played_hours != 0) |>
                        group_by(hashedEmail, played_hours, gender) |>
                        summarize(count = n())

gender_sessions_plot <- sessions_hours_total_2 |>
                        ggplot(aes(x = count, y = played_hours, color = gender)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gender") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
gender_sessions_plot

Visualization #3 with gender instead:
- Gender is scattered
- Males dominate outer range
   - Play more hours and sessions 

#### iii) KNN Regression Model and Data Analysis

Now, let's make the regression model! The steps we will follow are outlined below:

1. Mutate "experience"
2. Inspect and clean data (eg.handle NA values)
3. Split dataset
      - 75% training, 25% testing
4. Tune training set
      - Scale predictors
5. Cross-validate training set
      - 5 folds
      - Tested neighbors differ by 4; neighbors = 1,5,...196
6. Find optimal *k*, refit model, assess on testing data for RMSPE

In [None]:
#Step 1 and 2
players_tidy <- players |>
                    mutate(experience = factor(experience,
                           levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"),
                           ordered = TRUE),
                           experience_num = as.numeric(experience)) |>
                           select(played_hours, experience_num, Age) |>
                           na.omit(played_hours)
head(players_tidy)

In [None]:
#Step 3 and 4
set.seed(1234)
players_split <- initial_split(players_tidy, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                    set_engine("kknn") |>
                    set_mode("regression")

players_recipe <- recipe(played_hours ~., data = players_training) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())

players_workflow <- workflow() |>
                        add_recipe(players_recipe) |>
                        add_model(players_spec)
players_workflow

In [None]:
# Step 5
set.seed(1234)
gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 2))

players_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

players_results <- players_workflow |>
                        tune_grid(resamples = players_vfold, grid = gridvals) |>
                        collect_metrics() |>
                        filter(.metric == "rmse")
head(players_results)

In [None]:
set.seed(1234)
players_min <- players_results |>
                    slice_min(mean, n = 1) #this shows the best "k" to use
players_min

In [None]:
# Step 6
set.seed(1234)
k_min <- players_min |>
          pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
          set_engine("kknn") |>
          set_mode("regression")

players_best_fit <- workflow() |>
          add_recipe(players_recipe) |>
          add_model(players_best_spec) |>
          fit(data = players_training)

players_summary <- players_best_fit |>
           predict(players_testing) |>
           bind_cols(players_testing) |>
           metrics(truth = played_hours, estimate = .pred)
players_summary

In [None]:
# Final model
players_spec_final <- nearest_neighbor(weight_func = "rectangular", neighbors = 99) |>
                        set_engine("kknn") |>
                        set_mode("regression")

players_recipe_final <- recipe(played_hours ~., data = players_tidy) |>
                            step_center(all_predictors()) |>
                            step_scale(all_predictors())

players_fit_final <- workflow() |>
                        add_recipe(players_recipe_final) |>
                        add_model(players_spec_final) |>
                        fit(players_tidy)
players_fit_final

#### Description of Above Methods 

#### iv) Visualizations of Analysis

In [None]:
options(repr.plot.width = 9, repr.plot.height = 8)
k_rmspe_plot <- players_results |>
                    ggplot(aes(x = neighbors, y = mean)) +
                    geom_point() +
                    geom_line() +
                    labs(x = "Neighbours", y = "RMSPE") +
                    ggtitle("Neighbours vs RMSPE") +
                    theme(text = element_text(size = 15))
k_rmspe_plot

In [None]:
#Make viualizations based on performed regression

### 3. Discussion

- <mark>Summarize findings
- <mark>Discuss if this what is what was expected
- <mark>What impact do the findings have
- <mark>Are there any future questions of interest

### GitHub Repository

https://github.com/tchan0717/dsci-100-2025w1-group-36.git

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly. https://r4ds.had.co.nz/.
\
\
The Pacific Laboratory of Artificial Intelligence. FAQ. Plaicraft. https://plaicraft.ai/faq. 