# Data Science Project: Planning Stage (Individual)

**Name:** Victoria Chen  
**Student Number:** 66263492  
**Section:** 004  
**Group:** 30  

## 1. Data Description

In [1]:
# Load tidyverse package
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# Load dataset
players <- read_csv("players.csv")
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


### Basic Information

The `players.csv` data was collected by The Pacific Laboratory for Artificial Intelligence (PLAI), a research group based in the Department of Computer Science at the University of British Columbia. By setting up the Minecraft server, PLAICraft, they were able to collect data about players and how they interacted with the game. This dataset contains player-level information, with each observation representing one unique player.

From reading in the dataset, I observe the following: 
- Number of observations (rows): 196  
- Number of variables (columns): 7

| Variable | Type | Description | Possible values (if categorical) | Notes |
| -- | -- | -- | -- | -- |
| experience | chr | Player's experience level | Beginner, Amateur, Regular, Pro, Veteran | Experience level may be inconsistent due to subjectivity |
| subscribe | lgl | Whether the player is subscribed to the newsletter | TRUE, FALSE | Will serve as the target variable later
| hashedEmail | chr | Player's email (hashed) | | Can be used as a unique identifier |
| played_hours | dbl | Number of hours the player has played on the server | | May contain outliers (e.g. zeros for inactive players)
| name | chr | Player's name | | May not be unique
| gender | chr | Player's gender | Male, Female, Non-binary, Two-Spirited, Agender, Other, Prefer not to say | Lack of observations for some categories
| Age | dbl | Player's age in years | | Can be grouped in increments

### Summary Statistics

In [3]:
# Factorize categorical variables
players <- players |>
  mutate(
    experience = factor(experience,
                        levels = c("Beginner", "Amateur", "Regular", "Pro", "Veteran"),
                        ordered = TRUE),
    gender = factor(gender)
  )

# Calculate summary statistics for each variable
summary(players)

    experience subscribe       hashedEmail         played_hours    
 Beginner:35   Mode :logical   Length:196         Min.   :  0.000  
 Amateur :63   FALSE:52        Class :character   1st Qu.:  0.000  
 Regular :36   TRUE :144       Mode  :character   Median :  0.100  
 Pro     :14                                      Mean   :  5.846  
 Veteran :48                                      3rd Qu.:  0.600  
                                                  Max.   :223.100  
                                                                   
     name                         gender         Age       
 Length:196         Agender          :  2   Min.   : 9.00  
 Class :character   Female           : 37   1st Qu.:17.00  
 Mode  :character   Male             :124   Median :19.00  
                    Non-binary       : 15   Mean   :21.14  
                    Other            :  1   3rd Qu.:22.75  
                    Prefer not to say: 11   Max.   :58.00  
                    Two-Spirited    

Using the `summary` command, I notice that:
- The most prevalent experience level of players is "Amateur", followed by "Veteran"
- The proportion of individuals subscribed to the newsletter is 144 / 196 = 73.47%
- The average number of hours played is 5.846, yet the median is only 0.1
- The mean (21.14) and median (19.00) ages are both approximately 20 years old

## 2. Questions

### Broad Question

> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Question

> Can `played_hours` and `Age` predict the value of `subscribe` in `players.csv`?

Variables:
- Response/dependent variable: `subscribe`
- Explanatory/independent variables: `played_hours`, `Age`

I want to know whether or not the two listed explanatory variables can successfully predict the response variable. Since `subscribe` is categorical, specifically binary, I will need to use a binary classification method. In this class, we learned the K-nearest neighbors algorithm, which uses numeric predictors to predict a value for the response variable. In this case, `played_hours` and `Age` are both already `dbl` variables, so I can use them directly in the model. 

## 3. Exploratory Data Analysis and Visualization

## 4. Methods and Plan