# Project Final Report
#### project-002-21  
Group members: Natalie Huang,  
Student Numbers: 93033579,   
DSCI 100-002  

**Data Description** 

This data investigates players on the MineCraft server, their demographics and session details. The data was collected by a Computer Science research group at UBC, led by Frank Wood. They collected data on the following variables in two seperate data sets: 

**`players.csv`**


`experience`: Level of the player's experience. (Character)

`subscribe`: Whether the player is subscribed to the server. (Logical)

`hashedEmail`: Registered email addresses of players. (Character)

`played_hours`: Hours played. (Double) 

`name`: Name of players. (Character)

`gender`: The gender of each individual. (Character)

`Age`: Age of players. (Double)


Number of observations: 196


An issue in this data is that the `gender` varaible allows for the option `Prefer not to say` which restricts gender data interpretation. 


**`session.csv`**


`hashedEmail`: Registered email addresses of players. (Character)

`start_time`: Date and time (in military time) when the players' sessions started. (Character)

`end_time`: Date and time (in military time) when the players' sessions ended. (Character)

`original_start_time`: Time when the players sessions started in numerical format. (Double)

`original_end_time`: Time when the players sessions ended in numerical format. (Double)


Number of observations: 1535


An issue is that the varaibles `start_time` and `end_time` include two pieces of information in the cells, as well as that this data is reported as a character with dd/mm/yyyy. 


**Questions** 

The broad question I will be investigating is: 

Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Narrowing this question down to: 

**How does the age of a player determine the amount of hours they play?**

This data will help me address the question because being able to predict identifying factors of the players based on the amount of data they provide to the server (hours played) will allow recruiters to effectively select the "kinds" of players they want. For this investigation, the `session.csv` file will not be neccesary as all of the required data is contained in the `players.csv` file. 

In [None]:
# Load libraries

library(tidyverse)
library(tidymodels)
library(tidyclust)
library(forcats)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 10)
source("cleanup.R")
set.seed(2000)

In [None]:
players_url <- "https://raw.githubusercontent.com/Norah-supercoder/dsci-100-project-individual/refs/heads/main/players.csv"

players <- read_csv(players_url)

players 

In [None]:
players <- rename(players, 
                        hashed_email = hashedEmail, 
                        age = Age)


players <- na.omit(players)

In [None]:
players_mean <- players |>
    group_by(experience)|>
    summarize(average_hours = mean(played_hours, na.rm = TRUE),
             average_age = mean(age, na.rm = TRUE), 
             count = n())|>
mutate(experience=as_factor(experience))


players

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

players_time_plot <- ggplot(players, aes(x = age, y = played_hours, color = experience))+
        geom_point()+
        labs(color = "Level of Player Experience", x = "Age (in Years)", y = "Hours Played", title = "Hours Played by Individuals of Various Ages")+
        theme(text = element_text(size = 12))

players_time_plot 

This plot shows that `Regular` and `Amateur` players have the most amount of hours played with an age of ~20.  

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
players_count_plot <- ggplot(players_mean, aes(x = experience, y = count))+
        geom_bar(stat = "identity")+
        labs(x = "Level of Player Experience", y = "Number of Players", title = "Number of Players at Each Level of Experience")+
        theme(text = element_text(size = 12))

players_count_plot

This plot shows that most players are `Amateur` and `Veteran` whereas the least amount of players are `Pro`. 

**Methods and Plans**

**Changing the plan, Im just putting it here for inspo**

Input synthetic entries for the variables that do not have as many data points (e.g `Pro`), which is a limitation in the model. 

The data then needs to be scaled since the weight of varaibles could impact classifications. 

Next, split the data, 80% training set and 20% testing set. A large training set is used to allow for effective predictive methods.

Corss validation will be used to help determine the best number of k-neighbors. 

A validation set will be used with 3-5 folds since the data set is small. 

A classification model can be applied and the 'type' of player that provides the most data to the server can be determined. 

Then, the accurcay of the classification can be analysed. 

In [None]:
players_split <- initial_split(players, prop = 0.75, strata = experience)  
players_train <- training(players_split)   
players_test <- testing(players_split)


players_train
players_test

In [None]:
players_recipe <- recipe(~ played_hours + age, data = players) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

Clustering to find the value of k 

In [None]:
players_ks <- tibble(num_clusters = 1:10)

players_spec_tune <- k_means(num_clusters = tune()) |>
       set_engine("stats", nstart = 10)


players_tuning_stats <- workflow() |>
       add_recipe(players_recipe) |>
       add_model(players_spec_tune) |>
       tune_cluster(resamples = apparent(players), grid = players_ks) |>
       collect_metrics()

elbow_stats <- players_tuning_stats |>
       mutate(total_WSSD = mean) |>
       filter(.metric == "sse_within_total") |>
       select(num_clusters, total_WSSD)


elbow_stats

In [None]:
choose_players_k <- ggplot(elbow_stats, aes(x = num_clusters, y = total_WSSD))+
    geom_point()+
    geom_line()+
    labs(x = "Number of K Clusters", y = "Total Within-Cluster Sum of Squares")

choose_players_k

In [None]:
set.seed(2019) 
players_spec <- k_means(num_clusters = 4)|>
       set_engine("stats")

players_clustering <- workflow() |> 
    add_recipe(players_recipe) |> 
    add_model(players_spec) |> 
    fit(data = players)
players_clustering

In [None]:
clustered_players <- augment(players_clustering, players)
players_clustering_plot <- ggplot(clustered_players, aes(x = age, y = played_hours, color = .pred_cluster)) +
    geom_point(alpha = 0.7, size = 3) +  
    labs(
        title = "K-Means Clustering of Players (Age vs Played hours)",
        x = "Age(years)",
        y = "Played time(hours)",
        color = "Cluster"
    ) +
    theme_minimal(base_size = 15) 
players_clustering_plot