# PlaiCraft DSCI 100 Individual Planning


#### Can age and experience predict the total hours played so we can target similar players for large-data recruitment?
Using provided datasets, I'll investigate this question for Frank Wood's CS research group's game, PlaiCraft. The main dataset used will be players.csv.

The question addresses the broad question: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts". 

### 1. Pre-Inspection Details

 - Collection period: May 1-September 1, 2024
 - Session tracking duration: once game was opened to closed
 - Overall, two NA values, two rows missing data
     - I won't drop them as other values in those rows may be valuable

In [None]:
# Now, let's load in some R packages!
library(tidyverse)
library(ggplot2)
library(RColorBrewer)

### 2. Data Descriptions and Inspecting the Datasets with R Functions and Summaries

In [None]:
# Let's read in and inspect the datasets!
url_players <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/players.csv"
players_data <- read_csv(url_players)
players_data

#### Description of the set (players.csv):

- 196 observations, 7 variables about the players:
1. `experience`: *character* -gaming "level" - Beginner, Amateur, Regular, Veteran, Pro (most experienced)
2. `subscribe`: *logical* -PlaiCraft newsletter subscription: TRUE="yes", FALSE="no"
3. `hashedEmail`: *character* -email in form of numbers and letters
4. `played_hours`: *double* -total hours played
5. `name`: *character* -first name
6. `gender`: *character* -gender
7. `age`: *double* -age

##### Issues: 
- `subscribe` is ambiguous - likely indicates newsletter subscription
- `experience` "level" order is unclear
    - Assumed order stated above, but no metadata to verify
 
Tidy data is described as one variable pWrangling is not needed as the dataset is already tidy.

#### Summary #1 - Experience

In [None]:
experience_count <- players_data |>
                        group_by(experience) |>
                        summarize(count = n())

experience_summary <- experience_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |> 
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2)) #round to 2 decimal places
experience_summary

#### Summary #2 - Played Hours

In [None]:
played_hours_summary <- players_data |>
                            summarize(mean = mean(played_hours, na.rm = TRUE),
                                      sum = sum(played_hours, na.rm = TRUE),
                                      most = max(played_hours),
                                      least = min(played_hours)) |>
                            mutate(across(mean:least, ~ round(.x, 2)))    
played_hours_summary                      

#### Summary #3 - Subscribed

In [None]:
subscribe_count <- players_data |>
                        group_by(subscribe) |>
                        summarize(count = n())

subscribe_summary <- subscribe_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |>
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2))
subscribe_summary

#### Summary #4 - Gender

In [None]:
gender_count <- players_data |>
                    group_by(gender) |>
                    summarize(count = n())

gender_summary <- gender_count |>
                        ungroup() |>
                        mutate(percent_of_overall_dataset = count/sum(count) * 100) |>
                        mutate(percent_of_overall_dataset = round(percent_of_overall_dataset, 2))
gender_summary

#### Summary #5 - Age (Years)

In [None]:
age_summary <- players_data |>
                    summarize(mean = mean(Age, na.rm = TRUE),
                              oldest = max(Age, na.rm = TRUE),
                              youngest = min(Age, na.rm = TRUE)) |>
                    mutate(across(mean:youngest, ~ round(.x, 2)))    
age_summary   

In [None]:
# Now, let's read in sessions.csv!
url_sessions <- "https://raw.githubusercontent.com/tchan0717/dsci-100-2025w1-group-36/refs/heads/main/data/sessions.csv"
sessions_data <- read_csv(url_sessions)
sessions_data

#### Description of the set (sessions.csv):

- 1535 observations, 5 variables:
1. `hashedEmail`: *character* -email in form of numbers and letters
2. `start_time`: *character* -session start time (DD/MM/YYYY", "time (24-hour-clock)")
4. `end_time`: *character* -session end time ("DD/MM/YYYY", time (24-hour-clock)")
5. `original_start_time`: *double* -session start time in UNIX (milliseconds)
6. `original_end_time`: *double* -session end time in UNIX (milliseconds)

Although this dataset isn't the focus, it's beneficial to understand.

In [None]:
# I will merge the datasets together too to simplify future explorations:
sessions_players_merged <- merge(players_data, sessions_data, by = "hashedEmail", all = TRUE)
sessions_players_merged

### 3. Exploratory Data Analysis and Visualization


Let's create visualizations to seek out relationships and overlooked issues.

#### Visualization #1

In [None]:
options(repr.plot.width = 16, repr.plot.height = 6)

age_histogram <- ggplot(players_data, aes(x = Age)) +
                    geom_histogram(bins = 12) +
                    labs(x = "Age (Years)", y = "Total Players") +
                    ggtitle("Distribution of Ages Across Different Gaming Experiences") +
                    scale_y_continuous(breaks = seq(0, 32, by = 2)) +
                    scale_x_continuous(breaks = seq(0, 60, by = 10)) +
                    facet_grid(cols = vars(experience)) +
                    theme(text = element_text(size = 15))
age_histogram

- Majority ~17 years old
- Numerous teenagers and young adults
   - Few young children and adults
- Many Amateurs and Veterans
- Age &ne; to gaming experience (any age can be a *"pro"*)

#### Visualization #2

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

played_hours_age_plot <- ggplot(players_data, aes(x = Age, y = played_hours + 1)) + #add +1 so that when we log our y-axis, the 0 values won't be infinity
                            geom_point(alpha = 0.4) +
                            labs(x = "Age (Years)", y = "Total Hours of PlaiCraft Played (Scaled)") +
                            ggtitle("Age of PlaiCraft Players vs Total Hours Played") +
                            scale_y_log10() +
                            theme(text = element_text(size = 14))
played_hours_age_plot

- No relationship nor clear trend
    - Widespread points
- Condensed near bottom of graph
- Insinuates teens and young adults play more
    - Likely player dependent though

#### Visualization #3

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

sessions_hours_total <- sessions_players_merged |>
                        select(hashedEmail, played_hours, experience) |>
                        group_by(hashedEmail, played_hours, experience) |>
                        summarize(count = n()) #summarizing how many sessions each player played

experience_sessions_plot <- sessions_hours_total |>
                        ggplot(aes(x = count + 1, y = played_hours + 1, color = experience)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gaming Experience") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
experience_sessions_plot                        

- Strong, positive relationship
    - Variables increase together
- Identical session counts &ne; identical hours played
- Gaming experience fairly scattered
    - Some Regulars and Amateurs played the most sessions and hours

#### Visualization #4

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

sessions_hours_total_2 <- sessions_players_merged |>
                        select(hashedEmail, played_hours, gender) |>
                        group_by(hashedEmail, played_hours, gender) |>
                        summarize(count = n())

gender_sessions_plot <- sessions_hours_total_2 |>
                        ggplot(aes(x = count + 1, y = played_hours + 1, color = gender)) +
                            geom_point(alpha = 0.5) +
                            labs(x = "Total Number of Sessions (Scaled)", y = "Total Hours Played (Scaled)", colour = "Gender") +
                            ggtitle("Total Hours Played Based on Total Number of Sessions Played") +
                            scale_color_brewer(palette = "Dark2") +
                            scale_y_log10() +
                            scale_x_log10() +
                            theme(text = element_text(size = 13))
gender_sessions_plot        

- Gender is scattered
- Males dominant outer range
   - Play more hours and sessions 

### 4. Methods and Plans

The model objective is to predict *played_hours* from *Age* and *experience* to help decide which players to "target". I'd use KNN regression with players.csv as it contains all necessary variables and mutate the *experience* categories to numerical levels (1=beginner,...5=pro). 

KNN regression is appropriate because it's flexible, doesn't assume relationship shapes, and predicts numerical values off similarities with training data. Contrastingly, KNN classification predicts categories. Visualizations, like #1, show no linear relationship, so linear regression isn't preferred. To confirm the better model, KNN vs linear, I'd assess them on the testing data. The smaller RMSPE value is the better model.

I chose predictor variables, experience and Age, because they describe player "kinds" and provide meaningful differences in the data to predict played_hours from. Typically, they contribute significantly to determining total gametime. For example, pros and younger players may play longer hours. Gender was excluded due to ambiguous values, "prefer not to say" and "other", which don't aid in distinguishing player "kinds". Furthermore, session times are random; it's not a unique player behaviour.

##### I'll process the data using the general KNN regression model steps with noted adjustments:
1. Mutate "experience"
2. Inspect and clean data
3. Split dataset
    - 75% training, 25% testing
4. Tune training set
    - Scale predictors
5. Cross-validate training set
    - 5 folds
    - Tested neighbors differ by 4; neighbors = 1,5,...196
6. Find optimal *k*, refit model, assess on testing data for RMSPE

KNN regression requires few assumptions, but has drawbacks. It assumes new observations resemble its training data. Thus, it's prone to inaccurate predictions for observations outside the training data range. Larger datasets increase computation time, and it's sensitive to noisy data and distance-dependent. So, scaling is needed to ensure comparable variable scales.

### GitHub Repository

https://github.com/tchan0717/dsci-100-2025w1-group-36.git

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Oâ€™Reilly. https://r4ds.had.co.nz/.
\
\
The Pacific Laboratory of Artificial Intelligence. FAQ. Plaicraft. https://plaicraft.ai/faq. 