# PlaiCraft DSCI 100 Individual Planning

In this individual planning, I will analyze and visualize data provided by Frank Wood's Computer Science research group for their vanilla survival MineCraft server, PlaiCraft. The datasets provided were players.csv and sessions.csv.

### Data Description and Pre-Inspection

Before we start analyzing, here are a couple details about the collection of the data:
 - Collected between May 1 - September 1, 2024
 - Each sessions' data was collected from when game browser opened to closed

After previewing both datasets, there are two NA values and two rows missing data. However, the other variables of those rows may be valuable, so I won't drop them; just make note of them.

So, let's begin! First, let's load some R packages.

In [None]:
library(tidyverse)
library(ggplot2)
library(lubridate)
library(RColorBrewer)

### 1. Inspecting the Datasets with R Functions and Summaries

Let's read in the datasets and inspect what's in store.

In [None]:
url_players <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/players.csv"
players_data <- read_csv(url_players)
players_data

#### Description of the set:

The "players.csv" has 196 observations and 7 variables:
1. `experience`: *character* - the "level" of players' gaming experience from Beginner, Amateur, Regular, Veteran, or Pro (most experienced)
2. `subscribe`: *logical* - subscription to PlaiCraft's newsletter, indicated with TRUE for "yes" or FALSE for "no"
3. `hashedEmail`: *character* - player's hashed email
4. `played_hours`: *double* - player's total played hours
5. `name`: *character* - player's first name
6. `gender`: *character* - player's gender
7. `age`: *double* - player's age


There subscribe column is ambiguous, but likely indicates PlaiCraft's newsletter subscription. Furthermore, the order of the experience column is unclear. Usually the order is the one stated above. However, there's no metadata to verify these.

#### Summary #1 - Experience

In [None]:
experience_count <- players_data |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |> 
                        mutate(percent_of_overall = round(percent_of_overall, 2))
experience_summary

#### Summary #2 - Played Hours

In [None]:
played_hours_summary <- players_data |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      max = max(played_hours),
                                      min = min(played_hours)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))    
played_hours_summary                      

#### Summary #3 - Subscribed

In [None]:
subscribe_count <- players_data |>
                        group_by(subscribe) |>
                        summarize(count = n())

subscribe_summary <- subscribe_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |>
                        mutate(percent_of_overall = round(percent_of_overall, 2))
subscribe_summary

#### Summary #4 - Gender

In [None]:
gender_count <- players_data |>
                    group_by(gender) |>
                    summarize(count = n())

gender_summary <- gender_count |>
                        ungroup() |>
                        mutate(percent_of_overall = count/sum(count) * 100) |>
                        mutate(percent_of_overall = round(percent_of_overall, 2))
gender_summary

#### Summary #5 - Age (Years)

In [None]:
age_summary <- players_data |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              max = max(Age, na.rm = TRUE),
                              min = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:min, ~ round(.x, 2)))    
age_summary   

**Now, let's read in the sessions dataset!**

In [None]:
url_sessions <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/sessions.csv"
sessions_data <- read_csv(url_sessions)
sessions_data

#### Description of the set:
The "sessions.csv" has 1535 observations and 5 variables: 
1. `hashedEmail`: *character* - player's hashed email
2. `start_time`: *character* - player's session start time in "dd/mm/yyyy" and "time (24-hour-interval)"
4. `end_time`: *character* - player's session end time in "dd/mm/yyyy" and "time (24-hour-interval)"
5. `original_start_time`: *double* - player's start time in UNIX (milliseconds)
6. `original_end_time`: *double* - player's end time in UNIX (milliseconds)


The start_time and end_time columns are character types, which makes it difficult to perform numerical functions. Although they exist in the UNIX format, they are difficult to mutate correctly.

The data is already tidy, but I will wrangle the data to get the start_time and end_time in a numeric format. This will make it easier to use for later.

In [None]:
sessions_data_tidy <- sessions_data |>
                        separate(end_time, 
                                into = c("date_end", "end_time"),
                                sep = " ") |>
                        separate(start_time,
                                into = c("date_start", "start_time"),
                                sep = " ") |>
                        mutate(start_time = as.numeric(hm(start_time))/3600) |> #lubridate's hm function extracts the time in seconds
                        mutate(end_time = as.numeric(hm(end_time))/3600) |> #devide it by 3600 to get time in hours
                        mutate(start_time_hr = as.integer(start_time)) |>
                        mutate(start_time_hr = start_time_hr %% 24) #%% will divide each value by 24 and output the remainder; we want this so that any 24 hour values = 0 on the 24 hour interval
sessions_data_tidy

#### Summary #6 - Dates of Sessions

In [None]:
date_total <- sessions_data_tidy |>
                    group_by(date_start) |>
                    summarize(count = n()) |>
                    arrange(desc(count))

pull(head(date_total, 1))
pull(tail(date_total, 1))

The highest and lowest number of sessions in one day was 38 and 1. Let's see the day(s) with these counts.

In [None]:
date_summary <- date_total |>
                    filter(count %in% c(38, 1))

date_summary

- Least activity in July and September, one in April and June 
- Most activity was July 25, 2024

#### Summary #7 - Session Start Times

In [None]:
start_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(start_time, na.rm = TRUE),
                                      max = max(start_time, na.rm = TRUE),
                                      min = min(start_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
start_time_summary

- Mean = 10:41am
- Latest = 11:58pm
- Earliest = 12:00am

#### Summary #8 - Session End Times

In [None]:
end_time_summary <- sessions_data_tidy |>
                            summarize(mean = mean(end_time, na.rm = TRUE),
                                      max = max(end_time, na.rm = TRUE),
                                      min = min(end_time, na.rm = TRUE)) |>
                            mutate(across(mean:min, ~ round(.x, 2)))
end_time_summary

- Mean = 10:05am
- Latest = 11:58pm
- Earliest = 2:00am

##### I will merge the datasets together to simplify future explorations.

In [None]:
sessions_players_merged <- merge(players_data, sessions_data_tidy, by = "hashedEmail", all = TRUE)
sessions_players_merged

### 2. The Question
After exploring the data, let's state the question.

Broad question: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts". 

Specific predictive question: 
##### Can player age and experience predict the total hours of PlaiCraft a player will play so that we can target those "kinds" of players for recruiting efforts to collect large amounts of data?

### 3. Exploratory Data Analysis and Visualization


Some graphs may be irrelevant to the question, but beneficial to understand relationships and explore both datasets.

#### Visualization #1

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
played_hours_age_plot <- ggplot(players_data, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y-axis, the 0 values won't be infinity
                            geom_point(alpha = 0.5) +
                            labs(x = "Age (Years)", y = "Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("Age of PlaiCraft Players vs Hours Played") +
                            scale_y_log10() +
                            theme(text = element_text(size = 14))
played_hours_age_plot

- No relationship, widespread points, no clear trend
- Condensed near bottom of graph, outliers near top half
- Insinuates that teens and young adults (18 - 20) tend to play more hours

#### Visualization #2

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)
age_histogram <- ggplot(players_data, aes(x = Age)) +
                    geom_histogram(bins = 20) +
                    labs(x = "Age (Years)", y = "Total Players") +
                    ggtitle("Distribution of PlaiCraft's Players' Ages") +
                    scale_x_continuous(breaks = seq(0, 65, by = 5)) +
                    scale_y_continuous(breaks = seq(0, 90, by = 10)) +
                    theme(text = element_text(size = 15))
age_histogram

- Majority are ~17 years old
- Lots around teenager to young adult ages (16 - 25)
   - Few beyond range (young children and adults)

#### Visualization #3

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
sessions_hours_total <- sessions_players_join |>
                        select(hashedEmail, played_hours, experience) |>
                        group_by(hashedEmail, played_hours, experience) |>
                        summarize(count = n())

age_sessions_plot <- sessions_hours_total |>
                        ggplot(aes(x = count + 1, y = played_hours + 1, color = experience)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
age_sessions_plot                        

- Positive relationship
    - Total sessions and hours played increase together
    - Pretty strong; clear upwards trend
- Same number of sessions does not directly indicate same total hours played
- Gaming experience is fairly scattered
    - Some Regular and Amateurs near top right
    - Mixed around origin

#### Visualization #4

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)

experience_age_histogram <- age_histogram +
                                ggtitle("Distribution of Ages Across Different Gaming Experiences") +
                                scale_y_continuous(breaks = seq(0, 28, by = 2)) +
                                scale_x_continuous(breaks = seq(0, 60, by = 10)) +
                                facet_grid(cols = vars(experience))

experience_age_histogram

- Age doesn't correlate to gaming experience
    - Any age can still be a pro, etc.
- From previous observations, we may consider "targeting" amateurs and younger players

#### Visualization #5

In [None]:
options(repr.plot.width = 15, repr.plot.height = 6)
start_time_players_total <- group_by(sessions_data_tidy, start_time_hr) |>
                                summarize(total_players = n())

start_time_players_line <- ggplot(start_time_players_total, aes(x = start_time_hr, y = total_players)) +
                                geom_line() +
                                scale_x_continuous(breaks = seq(0, 23)) +
                                labs(x = "Hour of Day", y = "Total Players Online") +
                                ggtitle("Total Players Online at Each Hour of the Day") +
                                theme(text = element_text(size = 15))
start_time_players_line

This visualization is a rough estimate of total players online since I converted the values to integers:
- Player activity:
    - Highest: 2 - 5am
    - Lowest: 8am - 2pm
    - High: 3pm onwards
- From previous observations, this makes sense as...
    - Teenagers and young adults have class or work midday
    - Free time later in the day

#### Visualization #6

In [None]:
options(repr.plot.width = 13, repr.plot.height = 6)
experience_subscribe_bar <- ggplot(players_data, aes(x = experience, fill = subscribe)) +
                                geom_bar(position = "stack") +
                                labs(x = "Gaming Experience", y = "Total Number of Players", fill = "Are they Subscribed?") +
                                ggtitle("Subscription of Players Based on Gaming Experience") +
                                scale_fill_brewer(palette = "Set2") +
                                theme(text= element_text(size = 14))
experience_subscribe_bar

- Majority subscribe to PlaiCraft newsletter
- From previous observations, amateurs make up largest portion of data
     - We may consider "targetting" this "kind"

### 4. Methods and Plans

Choosing an appropriate prediction model is important for accurate and strong performances. I would approach my question using non-linear KNN regression. Briefly, I would work with the players.csv dataset; it contains variables I need: played_hours, Age, and I would mutate the experience column, putting each category on a numerical level. The goal is to efficiently input any age and experience and receive a prediction for the total hours they may contribute. Then, we can decide whether or not to "target" those players.

KNN regression is appropriate because I want to predict a numerical value, played_hours. Contrastly, KNN classification is for categorical predictions. Judging by visualizations above, the variables don't have a linear relationship, so linear regression is not preferred.

I chose these explanatory variables because they are descriptions of "kinds" of players. They provide the most meaningful and various data to explore and predict from. Typically, they significantly contribute to determining total gametime. Gender was considered but it felt unreliable because there are "prefer not to say" values which would not be helpful in narrowing down a "kind". Furthermore, time variables don't correlate to "kinds" of players. Each players' session times start and end randomly, it's not a unique behaviour to each player like age and experience are.

##### *To process the data, I will follow the general KNN regression model steps and note specific changes where necessary:*
1. Mutate experience to numerical levels(1 = beginner,... 5 = pro)
2. Inspect and clean data
3. Split dataset
    - 75% training and 25% testing.
4. Tune training set
    - Recipe: response variable = total_hours, explanatory variables = Age + experience_levels
        - Scale with *step_scale* and *step_center*
5. Cross validation on training set
        - 5 folds
    - Make tibble with range of neighbors: 1 to 196
        - Differ by 4; neighbors = 1, 5,... 196
    - Create workflow and use *tune_grid* and *collect_metrics*
6. Filter for optimal *k*
7. Assess on testing data by making new knn model and workflow, then use *predict* and *bind_cols*, and collect metrics
8. Compare RMSPE to cross-validation RMSE
9. Use optimal *k*
    - Fit into original dataset for final model

KNN regression requires few assumptions compared to linear regression. It assumes that new observations are similar to its training data. Thus, some setbacks is inaccurate predictions with observations outside the training data range. Moreover, a larger dataset means longer computation. Luckily, this dataset is not too large, so computing should not be an issue. However, it can become a problem with additional data. It's also sensitive to noisy data and predicts by distance, so scaling is needed to ensure comparable variable scales.

### GitHub Repository

https://github.com/tchan0717/dsci-100-2025w1-group-36.git

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Oâ€™Reilly. https://r4ds.had.co.nz/.
\
\
The Pacific Laboratory of Artificial Intelligence. FAQ. Plaicraft. https://plaicraft.ai/faq. 