<a href="https://colab.research.google.com/github/vmah1/toy_ds_project/blob/main/Copy_of_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Can Age and Experience predict played hours in the players Dataset?
**Group 36** : Kaiyu Zhong 51312593, Wesley Mah 93296408, Cindy Chang 35504851, Danya Elkhidir 94711322


## (1) Introduction

It's no secret that video games have exploded in popularity, capturing the interest of people across various ages and experiences. Minecraft, in particular, attracts a diverse range of players—from casual gamers logging just a few hours to enthusiasts investing significant playtime. For research projects that depend on collecting large amounts of player data, understanding which players contribute the most can dramatically improve recruitment strategies.

To manage this project successfully, it’s important to recruit players who spend more time on the server. This helps collect more data and ensures resources (like software licenses and servers) are used efficiently. We want to find out if player characteristics, such as age and gaming experience, can help predict how many hours someone will play.

Specifically:
- We used KNN regression to explore the relationship between age and played hours.
- We also used KNN clasification to explore the relationship between experience and played hours.

Our data set includes columns such as:

- experience (gaming experience)

- played_hours (total hours played)

- Age (player age)

- gender (player gender)

- subscribe (subscription status)

To explore whether age and experience can predict playtime, we will analyze these variables and visualize our findings through scatter plots and regression analysis. Insights from this investigation will help inform future recruitment strategies and help to strategically direct resource decisions.



## (2) Methods & Results:

Reference for what we primarily did:

**- We used KNN clasification to explore the relationship between experience and played hours.**

**- We also used KNN regression to explore the relationship between age and played hours.**

In [2]:
library(tidyverse)
library(repr)
# install.packages("kknn")
# install.packages("tidymodels")
library(tidymodels)
#========
#The document should be running now, if not then run the two packages above (But I don't think you still need to):
#========


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


In [None]:
players <- read_csv("https://raw.githubusercontent.com/KaiyuZhong/Individual-Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/KaiyuZhong/Individual-Project/refs/heads/main/sessions.csv")

In [None]:
head(players)

- **Bellow we performed a summary statistics for the variables that we are interested in.**

In [None]:
## Performing summary for the players dataset
Summary_players <- list(
    num_columns = ncol(players),
    observations_num_rows = nrow(players),
    column_names = names(players),
    AGE = list(
        Quartiles = quantile(players$Age, na.rm = TRUE),
        NAs = sum(is.na(players$Age)),
        Max = max(players$Age, na.rm = TRUE),
        Min = min(players$Age, na.rm = TRUE),
        Mean = mean(players$Age, na.rm = TRUE),
        Median = median(players$Age, na.rm = TRUE),
        Standard_Deviation = sd(players$Age, na.rm = TRUE),
        Range = range(players$Age, na.rm = TRUE)
    ),
    Played_Hours = list(
        Quartiles = quantile(players$played_hours),
        NAs = sum(is.na(players$played_hours)),
        Max = max(players$played_hours),
        Min = min(players$played_hours),
        Mean = mean(players$played_hours),
        Median = median(players$played_hours),
        Standard_Deviation = sd(players$played_hours),
        Range = range(players$played_hours)
    )
)

experience_table <- table(players$experience)
experience_percentages <- prop.table(experience_table) * 100

Experience <- list(
    Experience_Frequency = experience_table,
    Experience_Percentages = experience_percentages
)


Summary_players
Experience

**Summary Statistics:**
- **Age:** The dataset includes ages ranging from 8 to 50 years, with a mean age of 20.52 years, a median of 19 years, and a standard deviation of 6.17; the age quartiles are 17 (Q1), 19 (Q2), and 22 (Q3), with 2 missing values.

- **Played Hours:** The total hours played vary widely from 0 to 223.1 hours, with a mean of 5.85 hours, a median of 0.1 hours, and a high standard deviation of 28.35.

- **Experience:** Most users are Amateurs (32.14%) or Veterans (24.49%), while fewer are Regulars (18.37%), Beginners (17.86%).

- A hashed email are identity identifiers. Hashed email is also unique identifier

In [None]:
# Minimum warngling for players.csv

players_tidy <- players |>
  select (Age, hashedEmail, experience, played_hours) |>
  mutate(
    Age = as.integer(Age),
    played_hours = as.numeric(played_hours),
    experience = as_factor(experience))

head(players_tidy)


### (2.1) Using player's AGE to predict total hours played by the player
We will then look into the question "Can we use player's AGE to predict total hours played by the player?" We will use KNN regression estimation.

**A visualization of Age and Hours Played**

In [None]:
players_age <- ggplot(players, aes(x = Age, y = played_hours)) +
 geom_point(alpha = 0.4) +
 xlab("Player Age (years)") +
 ylab("Hours Played") +
 ggtitle("Figure 1: Age and Hours Played") +
 scale_y_continuous(labels = dollar_format()) +
 theme(text = element_text(size = 12))
players_age

**1. Creat Workflow**

In [None]:
set.seed(7)
players_split <- initial_split(players, prop = 0.75, strata = played_hours)
players_train <- training(players_split)
players_test <- testing(players_split)
players_recipe <- recipe(played_hours ~ Age, data = players_train) |>
 step_scale(all_predictors()) |>
 step_center(all_predictors())
players_spec <- nearest_neighbor(weight_func = "rectangular",
 neighbors = tune()) |>
 set_engine("kknn") |>
 set_mode("regression")
players_vfold <- vfold_cv(players_train, v = 5, strata = played_hours)
players_wkflw <- workflow() |>
 add_recipe(players_recipe) |>
 add_model(players_spec)
players_wkflw

**2. Finding the Right K number**

In [None]:
gridvals <- tibble(neighbors = seq(from = 1, to = 109, by = 3))
players_results <- players_wkflw |>
 tune_grid(resamples = players_vfold, grid = gridvals) |>
 collect_metrics() |>
 filter(.metric == "rmse")
head(players_results)

In [None]:
players_min <- players_results |>
 filter(mean == min(mean))
players_min

**We find that the Smallest RMSPE when K is 94.**

**3. Evaluating the RMSPE on the test set:**

In [None]:
kmin <- players_min |> pull(neighbors)
players_spec <- linear_reg() |>
 set_engine("lm") |>
 set_mode("regression")
players_fit <- workflow() |>
 add_recipe(players_recipe) |>
 add_model(players_spec) |>
 fit(data = players_train)
players_summary <- players_fit |>
 predict(players_test) |>
 bind_cols(players_test) |>
 metrics(truth = played_hours, estimate = .pred) |>
 filter(.metric == "rmse")
players_summary

**4. Graph with k = 94**

In [None]:
age_prediction_grid <- tibble(
 Age = seq(
 from = min(players$Age, na.rm = TRUE),
 to = max(players$Age, na.rm = TRUE),
 by = 1
 )
)
players_preds <- players_fit |>
 predict(age_prediction_grid) |>
 bind_cols(age_prediction_grid)
plot_players <- ggplot(players, aes(x = Age, y = played_hours)) +
 geom_point(alpha = 0.4) +
 geom_line(data = players_preds,
 mapping = aes(x = Age, y = .pred),
 color = "steelblue",
 linewidth = 1) +
 xlab("Player Age (years)") +
 ylab("Hours Played") +
 scale_y_continuous(labels = comma_format()) +
 ggtitle(paste0("k-Nearest Neighbors Prediction (K = ", kmin, ")")) +
 theme(text = element_text(size = 12))
plot_players

**5. Visualization with histograms on Total and Average Played hours**

In [None]:
total_hours_by_age <- players_tidy |>
 group_by(Age) |>
 summarize(total_hours = sum(played_hours, na.rm = TRUE))
played_hours_age <- total_hours_by_age |>
 ggplot(aes(x = Age, y = total_hours)) +
 geom_bar(stat = "identity", fill = "steelblue") +
 labs(title = "Total Hours Played by Players of Different Ages",
 x = "Age",
 y = "Total Hours Played")
played_hours_age
average_hours_by_age <- players_tidy |>
 group_by(Age) |>
 summarize(avg_played_hrs = mean(played_hours, na.rm = TRUE))
avg_played_hours_age <- average_hours_by_age |>
 ggplot(aes(x = Age, y = avg_played_hrs)) +
 geom_bar(stat = "identity", fill = "steelblue") +
 labs(title = "Average Hours Played by Players of Different Ages",
 x = "Age",
 y = "Average Hours Played")
avg_played_hours_age


### (2.2) Using player's total hours played to predict the experience level of the players
We will first look into the question "Can we use player's hours played to predict experience level of the player?" We will use KNN classification estimation. But first we also showed why the linear regression estimation is not working for this variable.

**1. Bar graph to provide a better visualization in the distribution of experience levels**

In [None]:
experience_played_hours_summarize <- players_tidy |>
            select(experience, played_hours) |>
            group_by(experience) |>
            summarize(avg_played_hrs = mean (played_hours, na.rm = TRUE),
                      total_hours = sum(played_hours, na.rm = TRUE))|>
            mutate(experience = fct_reorder(experience, total_hours, .desc = TRUE))
experience_played_hours_summarize

played_hours_experience <- experience_played_hours_summarize |>
          ggplot(aes(x = experience, y = total_hours)) +
          geom_bar(stat="identity", fill = "purple") +
          labs(title = "Figure 2: Total hours played by players with different experience level",
               x = "Experience Level",
               y = "Totol hours played" )

played_hours_experience

players_average_experience <- experience_played_hours_summarize |>
  mutate(experience = fct_reorder(experience, avg_played_hrs, .desc = TRUE))

avg_played_hours_experience <- players_average_experience |>
          ggplot(aes(x = experience, y = avg_played_hrs)) +
          geom_bar(stat="identity", fill = "steelblue") +
          labs(title = "Figure 3: Average hours played by players with different experience level",
               x = "Experience Level",
               y = "Average hours played" )

avg_played_hours_experience

In [None]:
set.seed(1)

player_split <- initial_split(players_tidy, prop = 0.75, strata = played_hours)
player_train <- training(player_split)
player_test <- testing(player_split)

**2. The following is a trial on using linear regrssion model for prediction (For experience predicting played hours, it is similar if we flip the two variables). And it shows that linear regression works poorly here**

In [None]:
lm_recipe <- recipe(played_hours ~ experience, data = player_train) |>
            step_impute_mode(experience)

lm_spec <- linear_reg() |>
            set_engine("lm") |>
            set_mode("regression")

lm_fit <- workflow() |>
            add_recipe(lm_recipe) |>
            add_model(lm_spec) |>
            fit(data = player_train)

lm_rmse <- lm_fit |>
  predict(player_train) |>
  bind_cols(player_train) |>
  metrics(truth = played_hours, estimate = .pred)

lm_rmse

We can see that linear regression performed poorly this is because we have extreme outliers (as shown in the plot below) massively affect linear regression (and RMSE). In the plot below, we can see that we have extreme outliner (like 223 hours) with regular experience level.

In [None]:
Plot <- players |>
  ggplot(aes(x=Age, y=played_hours, color=experience)) +
  geom_point() +
  labs(title = "Figure 4: Age on played hours but grouped by Experience levels",
        color = "Experience Levels",
        x = "Age (Years)",
        y = "Played Hours" )
Plot

**3. KNN classification (KNN classification will be used instead of the linear regression due to what is shown above)**

So we will investigate the following: How played hours will determine the experience?

This will give us similar interpretation on how experience affect played hours. Thus we will know what experience level should we focus on.

**3.1 Workflow and tuning**

In [None]:
knn_recipe <- recipe(experience ~ played_hours, data = player_train) |>
 step_scale(all_predictors()) |>
 step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_vfold <- vfold_cv(player_train, v = 5, strata = experience)

In [None]:
gridvals <- tibble(neighbors = seq(from = 1, to = 25, by = 2))

knn_player_results <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = knn_vfold, grid = gridvals) |>
  collect_metrics()

head(knn_player_results)

**3.2 Finding K by Accuracy**

In [None]:
accuracies <- knn_player_results |>
 filter(.metric == "accuracy")

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
   geom_point() +
   geom_line() +
   labs(x = "Number of Neighbors (K)", y = "Accuracy Estimate") +
   ggtitle("Figure 5: Neibors used for Experience Prediction")
 accuracy_versus_k

Here we see that our model on test set perfomed very well, and the K we got is K=17.

**4. Now we see that K=17, we predict the higher played hours repectively for 200, 150, 100.**

In [None]:
knn_predict <- nearest_neighbor(weight_func = "rectangular", neighbors = 17) |>
 set_engine("kknn") |>
 set_mode("classification")
knn_fit <- knn_predict |>
     fit(experience ~ played_hours, data = players_tidy)

# When played hours = 200
new_obs_1 <- tibble(played_hours = 200)
experience_prediction_1 <- predict(knn_fit, new_obs_1)
experience_prediction_1

# When played hours = 150
new_obs_2 <- tibble(played_hours = 150)
experience_prediction_2 <- predict(knn_fit, new_obs_2)
experience_prediction_2

# When played hours = 50
new_obs_3 <- tibble(played_hours = 50)
experience_prediction_3 <- predict(knn_fit, new_obs_2)
experience_prediction_3

## (3) Discussions:

From the analysis of age and playtime, we observe a clear trend: the younger the age, the longer the playtime. Therefore, our target customers should be **teenagers or individuals aged 20–30.**

From the analysis of playtime by experience level, we find that longer playtimes are typically associated with **Amateur players.** Thus, we should focus more on Amateur players. This conclusion is supported by our KNN classification, which predicted "Amateur" for players with 200, 150, and 50 hours of gameplay.

The result for the age outcome aligns with our expectations: younger players tend to spend more time playing. However, the findings regarding experience level are contrary to what we anticipated. We expected professional players to dedicate more time to playing, but instead, amateur players were found to spend the most time in-game.

These insights are valuable for the project, as they help identify the key target audience—namely, younger and amateur players—who are most engaged in terms of time spent playing. This allows us to tailor strategies, content, and engagement efforts more effectively toward these groups.

In future research, it would be beneficial to expand the analysis by incorporating additional variables beyond age and experience level. For example, investigating the impact of gender, time of day, or geographic location could further refine our understanding of player behavior and help develop more targeted marketing or gameplay strategies.