# Minecraft DSCI100 Group Project

## DESCRIPTION OF DATA SETS: ##
### players.csv ###
This data set has 196 observations with 7 variables

| Variable Name | Type   | Meaning |
|----------------|-------------------|----------|
| experience     | Categorical (factor) | Player's Experience Level (Beginner, Amateur, Regular, Pro, Veteran|
| subscribe      | Logical (True/False)  | TRUE = Subcribed, otherwise, unsubscribed |
| hashedEmail    | chr    | Hashed Email of a Player |
| played_hours   | numeric (float)  | Number of Hours Played |
| name           | chr  | Player's Name |
| gender         | Categorical (factor)  | Geneder of Player |
| Age            | numeric (int)  | Age of Player |

Figure 1. Description of players.csv dataset

### sessions.csv ###
This data set has 1535 observations with 5 variables
| Variable Name | Type   | Meaning |
|----------------|-------------------|----------|
| hashedEmail    | Character | Anonymized unique player identifier |
| start_time     | Datetime (string)  | Timestamp for when session started in DD/MM/YYYY HH:MM format|
| end_time       | Datetime (string)  | Timestamp for when session ended in DD/MM/YYYY HH:MM format |
| original_start_time | numeric (float)  | Start time in milliseconds since 01/01/1970 |
| original_end_time   | numeric (float)  | End time in milliseconds since 01/01/1970 |

Figure 2. Description of sessions.csv dataset

### Potential Issues ###
- In players.csv, the column names are not standardized. Age is capitalized, while the other variable names are not
- In players.csv, experience could be self-reported, which might be an inaccurate representation of actual skill


### How Data Was Collected ###
- Player information collected through self-reporting
- Unique hashedEmail assigned randomly for identification
- Played_hours obtained by recording the player's total playtime

## Question ##
The broad question we aim to address is:

**“Can player characteristics be used to predict behavioural outcomes in the dataset?”**

The specific question we focus on is:

**“From the players.csv dataset, can a player’s age and playtime hours predict whether they will subscribe to a game-related newsletter?”**

To answer this question, we will build a k-nearest neighbours (k-NN) classification model, where:

Response variable:
- subscribe (yes/no)

Explanatory variables:
- age (numeric)
- playtime_hours (numeric)

The players.csv dataset contains all three of these variables, which allows us to construct a predictive model. Because the k-NN algorithm requires numerical predictors, we selected age and playtime hours as they are both quantitative and suitable for distance-based methods.

In [1]:
library(tidyverse)
library(dplyr)

players <- read_csv("https://raw.githubusercontent.com/sophiaymeng/dsci_100_008_7_minecraft/refs/heads/main/data/players%20(1).csv")


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


To answer our question, we will only be using the players.csv dataset. It can be loaded into R using the following line of code:

In [3]:
players_data <- read_csv("https://raw.githubusercontent.com/sophiaymeng/dsci_100_008_7_minecraft/refs/heads/main/data/players%20(1).csv", show_col_types = FALSE)|>
    rename(age = Age) 
head(players_data)

experience,subscribe,hashedEmail,played_hours,name,gender,age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Figure 3: players.csv dataset

## Wrangling and Cleaning the Data
Next, we minimally wrangeld the data so that rows containing NA were removed.

In [4]:
players_data <- players_data |>
    mutate(experience = as_factor(experience),
           subscribe = as_factor(subscribe),
           name = as_factor(name), 
           gender = as_factor(gender), 
           hashedEmail = as_factor(hashedEmail)) |>
    drop_na(age)
head(players_data)

experience,subscribe,hashedEmail,played_hours,name,gender,age
<fct>,<fct>,<fct>,<dbl>,<fct>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17
