# DSCI 100 Final Project: Group 10  

Members: 

Liam Woo (45648557), Daniel Shibary (36650380), Miranda Currie (75065128), Anya Jones (86102779)

# Introduction #

### Background Information:

Data was collected from a study done by the Pacific Laboratory for Artificial Intelligence (PLAI), a computer science research group led by Dr. Frank Wood. Their current project involves the creation of an embodied AI model can receive, understand, and respond to a complex environment, similar to how humans interact with the world around them.

To do so, the popular game: "Minecraft" was chosen as the "complex environment". Behavioural data, such as locations visited, what players said in chat, in game actions, etc. were taken from players of the "PLAIcraft" server, a free Minecraft server run by PLAI, in order to study the trends and patterns of how players interact with the virtual world. This data will then be used to train and develop the AI model.

Information from the players.csv dataset was retrieved primarily during the signup process of each player with the rest coming from their actual play in the server.

### Question:

We wanted to answer the question on which "kinds" of players are most likely to contribute a large amount of data so that these players can be targeted during recruiting efforts. This lead to the question: **can a players age, experience, subscription status, and gender be used to determine a players total number of hours?**

### Dataset Description:
##### Below is a summary of the players.csv dataset, including the number of observations, number of variables, variable names and their correpsonding, data type, possible values, and description.

<br>

| Observations(rows): | Variables(columns): |
|---------------------------|-----------|
|    196    |    7    |

<br>

|Variable:|Data type:|Possible Values:|Desciption:|
|:----|:----|:----|:----|
| experience | character (chr) | Beginner, Amateur, Regular, Veteran, Pro | The self-declared skill level of the player |
| subscribe | logical (lgl) | TRUE, FALSE | Whether or not the player is subscribed |
| hashedEmail | character (chr) | N/A | A unique sequence of numbers and letters providing a secure way to represent a players E-mail |
| played_hours | double (dbl) | N/A | The number of hours a player has put into the server |
| name | character (chr) | N/A | The name of the player |
| gender | character (chr) | Male, Female, Non-binary, Two-Spirited, Prefer not to say, Other, Agender | Gender of the player |
| Age | double (dbl) | N/A | Age of the player |

<br>

##### Below is the summary statistics of the players.csv dataset (mean, median, maximum, and minimum values of each variable), which were computed using the summarize function. (corresponding code in the "Methods" section, Step 2)

<br>

| Category                  | Count      |
|---------------------------|-----------|
| Total Players             | 196       |
| Minimum Age               | 9         |
| Maximum Age               | 58        |
| Average Age               | 21.14     |
| Median Age                | 19        |
| Minimum Played Hours      | 0         |
| Maximum Played Hours      | 223.1     |
| Median Played Hours       | 0.1       |
| Average Played Hours      | 5.85      |
| Subscribed Players        | 144       |
| Regular Players           | 36        |
| Veteran Players           | 48        |
| Pro Players               | 14        |
| Male Players              | 124       |
| Female Players            | 37        |
| Non-binary Players        | 15        |
| Two-Spirited Players      | 6         |
| Other Gender Players      | 1         |
| Agender Players           | 2         |
| Prefer Not to Say         | 11        |

<br>

### Possible Issues:

Within the players.csv dataset there are few issues. These include `NA` values in the `Age` column as well as the `gender` and `experience` values being categorical variables. To fix the former, simple data wrangling removing `NA` values out of age would be beneficial. The latter issue requires a bit more involvement, the use of "dummy variables" seems like a valid option in order to predict numerical values off of categorical variables. Additionally, there is alot of players who did not play at all , this may skew the data towards values closer to 0.

## Methods

In order to determine if the player's age, experience level, subscription status, and gender can be used to predict the number of hours they play, and thus the amount of data they will provide to the research, we will first explore the relationships of each of the variables with played_hours via visualisation to see what characteristics are correlated to high playtimes and their individual relavance. In addition a foward selection of a knn regression model will be done to quantify the relavancy via rmse value. The rmse value will then be compared to the mean `played_hours` value to determine whether or not age, experience level, and subscription status can predict played_hours well.

**Why is this method appropriate?**

This method would be appropriate as it is a exploratory type of question. This form of question investigates possible relationships within a dataset. An example of this would be to determine if the size, shape, colour, etc. Of a tumor is relevant to the type of cancer it came from. In regards to answering the exact question stated previously, exploratory visualisation will allow the finding of trends and patterns within the data as well as showing visual relavancy. While the forward selection can give us the validity of said patterns seen in the exploratory visualisations. Which can be compared to form a final conclusion.

**What are the potential limitations or weaknesses of the method selected?**

The main limitation of this method is the small size of the dataset causing forward selection to give poor results, as the repeated training of models on the same dataset increases the likelihood of running into a high cross-validation accuracy/rmse estimate with a low true accuracy/rmse estimate on the test data.

**How are you going to compare and select the model?**

So far, we have learned knn and linear regression as classification techniques/models. Linear regression is used to predict values from a known linear relationship, meanwhile knn is used to predict values with a non-linear relationship. In this case the `hours_played` and the associated predictor variables all lack a linear relationship thus knn regression will be selected for use in the forward selection.

**How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?**

Data will be first split into training and testing sets with a 75-25 split as the dataset only contains 196 observations. The model will be made or trained using the training sets. Within the training sets, cross validation will be used with a 5 folds as the smaller dataset to lower variance in rmse estimates.

### Step 1: Loading in data

To begin, the relavant libraries and .csv file: `players.csv` must be loaded in using `library` and `read_csv`
`players.csv` will be assigned to `players_csv`

In [None]:
#------------------------------ STEP 1 ------------------------------#

# Needed libraries:
library(tidyverse)
library(dplyr)
library(tidyr)
library(recipes)
library(tidymodels)
library(RColorBrewer)

# Making it so that only 6 rows of a table are printed for clarity
options(repr.matrix.max.rows = 6)

# Reading in players.csv
players_csv <- read_csv('https://raw.githubusercontent.com/wolfgirl43/DSCI-Group-10-Final-Project-/refs/heads/main/players.csv')


#players_csv
#uncomment the aboove line to print players_csv

### Step 2: Summary Statistics

The following cell corresponds to the code used to find the summary statistics seen in the **Introduction** section.
<br>
For numerical values, the minimum, median, max, and mean values were found. For categorical variables, the counts (e.g the number of subscribed players in the dataset) were found.
<br>
These were all placed in a table (seen in the **Introduction** section)

In [None]:
#------------------------------ STEP 2 ------------------------------#

summary_table<- players_csv|>
    summarize(
        Rows= n(), 
        Columns= ncol(players_csv),
        total_players = n(),
        
# Age variable
        min_age = min(Age, na.rm = TRUE),
        max_age = max(Age, na.rm = TRUE),
        mean_age = mean(Age, na.rm = TRUE),
        med_age = median(Age, na.rm = TRUE),
        
# played_hours variable
        min_pt = min(played_hours, na.rm = TRUE),
        max_pt = max(played_hours, na.rm = TRUE),
        med_pt = median(played_hours, na.rm = TRUE),
        mean_pt = mean(played_hours, na.rm = TRUE),

# subscribe variable
        num_subscribed = sum(subscribe == TRUE, na.rm=TRUE),
        num_unsubscribed = sum(subscribe ==FALSE, na.rm=TRUE),

# experience variable
        num_Beginner = sum(experience == "Beginner", na.rm = TRUE),
        num_Amateur = sum(experience == "Amateur", na.rm = TRUE),
        num_Regular = sum(experience == "Regular", na.rm = TRUE),
        num_Veteran = sum(experience == "Veteran", na.rm = TRUE),
        num_Pro = sum(experience == "Pro", na.rm = TRUE),

# gender variable
        num_Male = sum(gender == "Male", na.rm = TRUE),
        num_Female = sum(gender == "Female", na.rm = TRUE),
        num_NonBinary = sum(gender == "Non-binary", na.rm = TRUE),
        num_TS = sum(gender == "Two-Spirited", na.rm = TRUE),
        num_Other = sum(gender == "Other", na.rm = TRUE),
        num_Ag = sum(gender == "Agender", na.rm = TRUE),
        num_pns = sum(gender == "Prefer not to say", na.rm = TRUE))


#summary_table
# uncomment the above line to print the table.

### Step 2.1: Visualisation of Summary Statistics

<pr>The following cell contains the code to print the following visulisations:<pr>

<pr> Using players_csv<pr>:
* Distribution of Age (fig. 2.1.1)
* Distribution of played_hours (fig. 2.1.2)
* Distribution of subscribe (fig. 2.1.3)
* Distribution of experience (fig. 2.1.4)
* Distribution of gender (fig. 2.1.5)

In [None]:
#----------------------------- STEP 2.1 -----------------------------#

#fig. 2.1.1 (Distribution of player's age)
age_vis <- players_csv |>
    select(Age) |>
    ggplot(aes(x = Age)) +
        geom_histogram(na.rm = TRUE, binwidth = 0.5) +
        theme(text = element_text(size = 15),
              plot.title = element_text(size = 15, hjust = 0.5),
              axis.title.x = element_text(size = 12, vjust = -0.5),
              axis.title.y = element_text(size = 12, vjust = 0.7)) +
        labs(x = 'Age (Years)',
             y = 'Number of Players',
             title = "Fig. 2.1.1 Distribution of Player's Ages")

#fig 2.1.2 (Distribution of total hours players played)
pt_vis <- players_csv |>
    select(played_hours) |>
    filter(played_hours > 0) |>
    ggplot(aes(x = played_hours)) +
        geom_histogram(binwidth = 0.3) +
        theme(text = element_text(size = 15),
              plot.title = element_text(size = 15, hjust = 0.5),
              axis.title.x = element_text(size = 12, vjust = -0.5),
              axis.title.y = element_text(size = 12, vjust = 0.7)) +
        labs(x = 'Total Hours Played',
             y = 'Number of Players',
             title = "Fig 2.1.2 Distribution of Total Hours Players Played") + 
        scale_x_log10()

#fig 2.1.3 (Distribution of player's subscription status)
sub_vis <- players_csv |>
    select(subscribe) |>
    ggplot(aes(x = subscribe,
               fill = subscribe)) +
        geom_bar() +
        scale_fill_brewer(palette = 'Dark2') +
        theme(legend.position = 'none',
              text = element_text(size = 15),
              plot.title = element_text(size = 15, hjust = 0.5),
              axis.title.x = element_text(size = 12, vjust = -0.5),
              axis.title.y = element_text(size = 12, vjust = 0.7)) +
        labs(x = 'Player Subscription Status',
             y = 'Number of Players',
             title = "Fig 2.1.3 Distribution of player's subscription status")

#fig 2.1.4 (Distribution of the experience levels of players)
exp_vis <- players_csv |>
    select(experience) |>
    ggplot(aes(x = experience,
               fill = experience)) +
        geom_bar() +
        scale_fill_brewer(palette = 'Dark2') +
        theme(legend.position = 'none',
              text = element_text(size = 15),
              plot.title = element_text(size = 15, hjust = 0.5),
              axis.title.x = element_text(size = 12, vjust = -0.5),
              axis.title.y = element_text(size = 12, vjust = 0.7)) +
        labs(x = 'Player Experience Level',
             y = 'Number of Players',
             title = 'Fig 2.1.4 Distribution of Player Skill Level')

#fig 2.1.5 (Distribution of player's gender)
gen_vis <- players_csv |>
    select(gender) |>
    mutate(gender = fct_recode(gender,
                               'PNTS' = 'Prefer not to say')) |>
    ggplot(aes(x = gender,
               fill = gender)) +
        geom_bar() +
        scale_fill_brewer(palette = 'Dark2') +
        theme(legend.position = 'none',
              text = element_text(size = 15),
              plot.title = element_text(size = 15, hjust = 0.5),
              axis.title.x = element_text(size = 12, vjust = -0.5),
              axis.title.y = element_text(size = 12, vjust = 0.7)) +
        labs(x = "Player's Gender",
             y = 'Number of Players',
             title = "Fig 2.1.5 Distribution of player's gender")


#age_vis
# uncomment the above line to print the Fig. 2.1.1
#pt_vis
# uncomment the above line to print the Fig. 2.1.2
#sub_vis
# uncomment the above line to print the Fig. 2.1.3
#exp_vis
# uncomment the above line to print the Fig. 2.1.4
#gen_vis
# uncomment the above line to print the Fig. 2.1.5

### Step 3: Wrangling Data

The following cell wrangles the `player_csv` dataset in order to tidy it for further analysis.
<br>
As stated in the **Potential Issues** section of the **Introduction**:

<pr>1 datasets from `players_csv` will be made:<pr>
1. `players_tidy`
<br>

<pr>`players_tidy` will be `players_csv` with the `NA` values in the age column removed.<pr>

<pr>This dataset will address the issue of the `NA` values in the `Age` column<pr>

<pr>Dummy variables will be used in the recipe via `step_dummy` as it makes the code more streamlined rather than inputting them in manually.<pr>

<pr>Additionally, irrelevant variables to this analysis (name and hashed_email) will be removed and the `gender` value "Prefer not to say" will be replaced with "PNTS" to fit on visualisations cleaner<pr>

In [None]:
#------------------------------ STEP 3 ------------------------------#

# create players_tidy
players_tidy <- players_csv |>
    select(-name) |>#, -hashedEmail) |>
    mutate(gender = fct_recode(gender, 'PNTS' = 'Prefer not to say')) |>
    filter(Age != 'NA')


#players_tidy
# uncomment the above line to print players_tidy_zero_in