# Title

# Introduction

The data we will be working with was collected by the Pacific Labratory of Artificial Intelligence (PLAI) to design an AI that can respond and interact like a human. They have created a minecraft server to observe and collect data surrounding human behaviour which they will use to train AI. The server records and stores information regarding the behaviour of players as they move through and interact with others in the MineCraft world.

Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

We will be working with the players data set as it contains the information relevant to our question: the subscribe variable and player characteristics and behaviours.  
The data we will be working with was collected using a questionnaire that is provided to players before they begin playing on a MineCraft server designed psecifically for the reserach. This questionnaire collects the player's gender, email, experience, and age, so these variables are all self-reported. Other data, such as start and end time, are collected by the server as the player participates in the game.

- Observations: 196
- Variables: 7

| Variable     | Type            | Meaning                                       |
|--------------|-----------------|-----------------------------------------------|
| Experience   | character (chr) | The experience of the player                  |
| Subscribe    | logical (lgl)   | If the player is subscribed to the newsletter |
| hashedEmail  | character (chr) | Email of the player in a privacy safe form    |
| played_hours | numeric (dbl)   | Number of hours the player played the game    |
| name         | character (chr) | Name of the player                            |
| gender       | character (chr) | Gender of the player                          |
| Age          | numeric (dbl)   | Age of the player                             |

Summary Statistics of Quantitative Data:

| Variable   | Avg   | Min |  Max  |
|------------|-------|-----|-------|
|Hours Played| 5.85  | 0   | 223.1 |
|Age         | 21.14 | 9   | 58    |

Summary Statistics of Qualitative Data:

| experience | #  | percentage |
|------------|----|------------|
| Amateur    | 63 | 32.14      |
| Beginner	 | 35 | 17.86      |
| Pro        | 14 | 7.14.      |
| Regular    | 36 | 18.37      |
| Veteran    | 48 | 24.49      |

| subscribe | #	  | percentage |
|-----------|-----|------------|
| FALSE   	| 52  | 26.53      |
| TRUE      | 144 | 73.47      |

| gender            | #   | percentage |
|-------------------|-----|------------|
| Agender           | 2   | 1.02.      |
| Female            | 37  | 18.88.     |
| Male              | 124 | 63.27      |
| Non-binary        | 15  | 7.65       |
| Other	            | 1	  | 0.51       |
| Prefer not to say | 11  | 5.61       |
| Two-Spirited      | 6   | 3.06       |


# Method and Results

In [17]:
library(tidyverse)
library(tidymodels)
library(dplyr)

In [3]:
players <- read_csv("https://raw.githubusercontent.com/sjhillen/DSCI-Group-Project/refs/heads/main/data/players.csv")
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Since each column is a variable, each row is an observation, and each cell is a single value, this data is already in a tidy format. To clean the data and prepare it for our model, we will select the variables we will be using and remove any values listed as NA.

In [14]:
clean_players <- players |>
select(Age, played_hours, subscribe) |>
filter(Age != "NA", played_hours != "NA", subscribe != "NA")
head(clean_players)

Age,played_hours,subscribe
<dbl>,<dbl>,<lgl>
9,30.3,True
17,3.8,True
17,0.0,False
21,0.7,True
21,0.1,True
17,0.0,True


In [16]:
# Quantitative Summary Statitstics
players_summary_quantitative <- clean_players |>
summarise(avg_played_hours = mean(played_hours, na.rm = TRUE), 
          avg_age = mean(Age, na.rm = TRUE),
          min_played_hours = min(played_hours, na.rm = TRUE),
          max_played_hours = max(played_hours, na.rm = TRUE),
          min_age = min(Age, na.rm = TRUE),
        max_age = max(Age, na.rm = TRUE)) |>
round(2)
players_summary_quantitative

# Qualitative Summary Statistics
Total <- nrow(players)

subscribe_summary <- count(players, subscribe) |>
mutate(percentage = round(n/Total*100, 2)) 
subscribe_summary


avg_played_hours,avg_age,min_played_hours,max_played_hours,min_age,max_age
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.9,21.14,0,223.1,9,58


subscribe,n,percentage
<lgl>,<int>,<dbl>
False,52,26.53
True,144,73.47


# Discussion