# PlaiCraft DSCI 100 Individual Planning

In this individual planning, I will analyze and visualize data provided by Frank Wood's Computer Science research group for their vanilla survival MineCraft server, PlaiCraft. The datasets provided were players.csv and sessions.csv.

### The Question
Broad question: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts". 

Specific predictive question: 
##### Can a player's age and experience predict the total hours they will play so that we can target those "kinds" of players for large data recruitment?

### 1. Data Description and Pre-Inspection

Before analyzing, here are a couple details about the collection of the data:
 - Collected between May 1 - September 1, 2024
 - Each sessions' data was collected from when game browser opened to closed

After previewing both datasets, there are two NA values and two rows missing data. However, the other variables in those rows may be valuable, so I'll make note of them, but not drop them.

Now, let's load some R packages!

In [None]:
library(tidyverse)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

### 2. Inspecting the Datasets with R Functions and Summaries

Let's read in and inspect the datasets!

In [None]:
url_players <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/players.csv"
players_data <- read_csv(url_players)
players_data

#### Description of the set:

The "players.csv" has 196 observations and 7 variables about the players:
1. `experience`: *character* - "level" of gaming experience from Beginner, Amateur, Regular, Veteran, or Pro (most experienced)
2. `subscribe`: *logical* - subscription to PlaiCraft's newsletter: TRUE = "yes", FALSE = "no"
3. `hashedEmail`: *character* - hashed email
4. `played_hours`: *double* - total played hours
5. `name`: *character* - first name
6. `gender`: *character* - gender
7. `age`: *double* - age


The subscribe column is ambiguous, but likely indicates PlaiCraft's newsletter subscription. Furthermore, the order of the experience column is unclear. Usually the order is the one stated above. However, there's no metadata to verify these.

#### Summary #1 - Experience

In [None]:
experience_count <- players_data |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |> 
                        mutate(percent_of_overall = round(percent_of_overall, 2))
experience_summary

#### Summary #2 - Played Hours

In [None]:
played_hours_summary <- players_data |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      max = max(played_hours),
                                      min = min(played_hours)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))    
played_hours_summary                      

#### Summary #3 - Subscribed

In [None]:
subscribe_count <- players_data |>
                        group_by(subscribe) |>
                        summarize(count = n())

subscribe_summary <- subscribe_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |>
                        mutate(percent_of_overall = round(percent_of_overall, 2))
subscribe_summary

#### Summary #4 - Gender

In [None]:
gender_count <- players_data |>
                    group_by(gender) |>
                    summarize(count = n())

gender_summary <- gender_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |>
                        mutate(percent_of_overall = round(percent_of_overall, 2))
gender_summary

#### Summary #5 - Age (Years)

In [None]:
age_summary <- players_data |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              max = max(Age, na.rm = TRUE),
                              min = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:min, ~ round(.x, 2)))    
age_summary   

In [None]:
url_sessions <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/sessions.csv"
sessions_data <- read_csv(url_sessions)
sessions_data

#### Description of the set:
The "sessions.csv" has 1535 observations and 5 variables: 
1. `hashedEmail`: *character* - player's hashed email
2. `start_time`: *character* - session start time (DD/MM/YYYY", "time (24-hour-interval)")
4. `end_time`: *character* - session end time ("DD/MM/YYYY", time (24-hour-interval)")
5. `original_start_time`: *double* - session start time in UNIX (milliseconds)
6. `original_end_time`: *double* - session end time in UNIX (milliseconds)



#### Summary #6 - Dates of Sessions

In [None]:
date_total <- sessions_data_tidy |>
                    group_by(date_start) |>
                    summarize(count = n()) |>
                    arrange(desc(count))

pull(head(date_total, 1))
pull(tail(date_total, 1))

The highest and lowest number of sessions in a day was 38 and 1. Let's find the day(s) with these counts.

In [None]:
date_summary <- date_total |>
                    filter(count %in% c(38, 1))

date_summary

- Least activity in July and September, one in April and June 
- Most activity was July 25, 2024

#### Summary #7 - Session Start Times

In [None]:
start_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(start_time, na.rm = TRUE),
                                      max = max(start_time, na.rm = TRUE),
                                      min = min(start_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
start_time_summary

- Mean = 10:41am
- Latest = 11:58pm
- Earliest = 12:00am

#### Summary #8 - Session End Times

In [None]:
end_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(end_time, na.rm = TRUE),
                                      max = max(end_time, na.rm = TRUE),
                                      min = min(end_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
end_time_summary

- Mean = 10:05am
- Latest = 11:58pm
- Earliest = 2:00am

##### I will merge the datasets together too to simplify future explorations:

In [None]:
sessions_players_merged <- merge(players_data, sessions_data, by = "hashedEmail", all = TRUE)
sessions_players_merged

### 3. Exploratory Data Analysis and Visualization


Let's make visualizations to understand and seek out helpful relationships or issues we didn't catch!

#### Visualization #1

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
played_hours_age_plot <- ggplot(players_data, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y-axis, the 0 values won't be infinity
                            geom_point(alpha = 0.5) +
                            labs(x = "Age (Years)", y = "Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("Age of PlaiCraft Players vs Hours Played") +
                            scale_y_log10() +
                            theme(text = element_text(size = 14))
played_hours_age_plot

- No relationship nor clear trend
    - Widespread points
- Condensed near bottom of graph, outliers near top half
- Insinuates that teens and young adults (18 - 20) play more hours

#### Visualization #2

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)
age_histogram <- ggplot(players_data, aes(x = Age)) +
                    geom_histogram(bins = 20) +
                    labs(x = "Age (Years)", y = "Total Players") +
                    ggtitle("Distribution of Ages Across Different Gaming Experiences") +
                    scale_y_continuous(breaks = seq(0, 28, by = 2)) +
                    scale_x_continuous(breaks = seq(0, 60, by = 10)) +
                    facet_grid(cols = vars(experience)) +
                    theme(text = element_text(size = 15))
age_histogram

- Majority are ~17 years old
- Lots of teenagers and young adults
   - Very few young children and adults
- Age doesn't correlate to gaming experience
    - Any age can still be a pro, etc.

#### Visualization #3

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
sessions_hours_total <- sessions_players_merged |>
                        select(hashedEmail, played_hours, experience) |>
                        group_by(hashedEmail, played_hours, experience) |>
                        summarize(count = n())

age_sessions_plot <- sessions_hours_total |>
                        ggplot(aes(x = count + 1, y = played_hours + 1, color = experience)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
age_sessions_plot                        

- Strong, positive relationship
    - Clear upwards trend
    - Total sessions and hours played increase together
- Same number of sessions does not directly indicate same total hours played
- Gaming experience is fairly scattered
    - Some Regulars and Amateurs played the most

### 4. Methods and Plans

Choosing an appropriate prediction model is important for accurate and strong performances. I would approach my question using KNN regression. I would work with the players.csv dataset as it contains played_hours, Age, and experience; all variables I need. The model objective is to input any age and experience and receive a prediction for the total hours they may contribute. Then, we can decide whether to "target" those players.

KNN regression is appropriate because it's flexible and I want to predict numerical values, played_hours. Contrastly, KNN classification is for categorical predictions. Judging by visualizations above, the variables don't have a linear relationship, so linear regression is not preferred.

I chose predictor variables, played_hours and Age, because they are descriptions of "kinds" of players. They provide the most meaningful and various data to explore and predict from. Typically, they significantly contribute to determining total gametime. Gender was considered but it felt unreliable because there are "prefer not to say" values. These aren't helpful in narrowing down "kinds" of players. Similarily, each players' session times are random; it's not a unique behaviour to each player like age and experience.

##### *To process the data, I will follow the general KNN regression model steps with noted adjustments:*
1. Mutate "experience" to numerical levels(1 = beginner,... 5 = pro)
2. Inspect and clean data
3. Split dataset
    - 75% training and 25% testing
4. Tune training set
    - Recipe: response variable = total_hours, predictor variables = Age + experience_levels
        - Scale predictors
5. Cross-validate training set
        - 5 folds
    - Make tibble with range of neighbors: 1 to 196
        - Differ by 4; neighbors = 1, 5,... 196
    - Create workflow, use *tune_grid* and *collect_metrics*
6. Filter for optimal *k*
7. Make new KNN model and workflow, assess on testing data, use *predict* and *bind_cols*, and collect metrics
8. Compare RMSPE to cross-validation RMSE
9. Use optimal *k*
    - Fit into original dataset for final model

KNN regression requires few assumptions compared to linear regression. It assumes that new observations are similar to its training data. Some setbacks is inaccurate predictions with observations outside the training data range. Moreover, a larger dataset means longer computation. Luckily, this dataset isn't too large, so computing should not be an issue. However, it can become a problem with additional data. It's also sensitive to noisy data and predicts by distance, so scaling is needed to ensure comparable variable scales.

### GitHub Repository

https://github.com/tchan0717/dsci-100-2025w1-group-36.git

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Oâ€™Reilly. https://r4ds.had.co.nz/.
\
\
The Pacific Laboratory of Artificial Intelligence. FAQ. Plaicraft. https://plaicraft.ai/faq. 