# Individual Project Planning

- DSCI_V100-009
- Rachel Hovestad
- Group 42

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)

### Exploratory Data Analysis and Evaluation

First, let's load both of our datasets into R. For my analysis, I'll only be using the players dataset, but let's load both anyway so we can look at them. 

In [None]:
players <- read_csv("players.csv")

sessions <- read_csv("sessions.csv")

head(players)

Now, let's find the mean value for each quantitative variable in the players dataset.

In [None]:
players_mean <- players |>
  summarise(mean_age = mean(Age, na.rm = TRUE), 
            mean_playtime = mean(played_hours, na.rm = TRUE))

players_mean

Next, let's do some minimal tidying of our players dataset (the dataset I'll be using for my analysis). We're going to remove all identifier variables (name and hashedEmail) and ensure that characters are turned into factors so we can use them for analysis later. 

In [None]:
tidy_players <- players |>
  select(-hashedEmail, -name) |>
  mutate(experience = as.factor(experience),
            gender = as.factor(gender))

head(tidy_players)

Now let's  make some visualizations to further understand our data! We're going to do so with the original players dataset. I want to look at each variable and their distributions while also ensuring each graph is very readable and engaging.

In [None]:
gender <- players |> ggplot(aes(y = gender, fill = gender)) +
  geom_bar() +
  labs(title = "Figure 1: Gender Distribution of Players", x = "Count", y = "Gender") +
    theme(text = element_text(size = 16)) 
gender

play_distribution <- players |> 
ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 0.5, fill = "blue") +
    labs(x = "Number of hours played", y = "Count") +
    ggtitle("Figure 2: Distribution of Number of Hours Played by Gamers") +
    theme(text = element_text(size = 14))

play_distribution


subscribe <- players |> 
ggplot(aes(x = subscribe, fill = subscribe)) +
    geom_bar() +
    labs(x = "Subscribed?", y = "Count") +
    ggtitle("Figure 3: Are players Subscribed to their Gaming Newsletter?") +
    theme(text = element_text(size = 14))
subscribe


experience <- players |> 
    ggplot(aes(x = experience, fill = experience)) +
    geom_bar() +
    labs(title = "Figure 4: Experience Distribution of Players", x = "Experience Level", y = "Count") +
    theme(text = element_text(size = 14)) 
experience

age_distribution <- players |> ggplot(aes(x = Age)) +
    geom_histogram(binwidth = 5, fill = "pink", color = "red") +
    labs(x = "Age of player", y = "Count") +
    ggtitle("Figure 5: Distribution of Ages of Players") +
    theme(text = element_text(size = 12))

age_distribution

#### Data Visualization Analysis

Looking at **Figure 1**, we can see that the majority of the observations were male, underrepresenting other gender identities. Looking at **Figure 2**, we can note that our graph is significantly skewed to the right, thus most players only played for a few minutes. With Minecraft being a very long game that requires many different steps and ways to complete the game, having gamers who only played for such brief amounts of time may not make the data very accurate in terms of what true Minecraft gamers look like. Looking at **Figure 3**, we can see that more than half of the gamers chose to subscribe to the newsletter. **Figure 4** shows us that most gamers self-identified as Amateur. Finally, **Figure 5** shows us that the majority of gamers were just under 20, probably  making them university students.

### Data Description

Let's look at the original players dataset and see what we can note about the data. We already learned a lot about the variables through data visualization, but let's look at statistics now!

In [None]:
players

In [None]:
stats_summary <- players |>
summarise(mean_age = mean(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    sd_played_hours = sd(played_hours, na.rm = TRUE),
    min_played_hours = min(played_hours, na.rm = TRUE),
    max_played_hours = max(played_hours, na.rm = TRUE), 
    median_played_hours = median(played_hours, na.rm = TRUE), 
    median_age = median(Age, na.rm = TRUE))
stats_summary

newsletter <- players |>
    count(subscribe)
newsletter

nrow(players) 
ncol(players)

gender <- players |>
   count(gender)
gender

#### How data was collected
This data were collected by a UBC Computer Science research group, led by Frank Wood. According to their website linked in the Project Description on Canvas, players chose to join a custom Minecraft server used for behavioral research, so they could train AI NPCS. Timestamps of how long each person played and basic demographic information (experience, gender, age) were recorded. Newsletter subscription was also collected, but it seems unclear what this newsletter entails exactly. 


#### Variable Descriptions
**Experience**: a factor, either Beginner, Amateur, Pro or Veteran

**Subscribe**: a boolean, either true or false: are users subscribed to the newsletter

**played_hours**: a double, the number of hours a player contributed to the project through playtime

**gender**: a string, the self-identified gender of a player

**Age**: an integer, the age of a player

#### Summary Statistics

- Number of observations (rows): 196
- Number of variables (columns): 7


- Our *maximum* **played_hours** is (hours): 223.1
- Our *minimum* **play_hours** is (hours): 0


- Number of newsletter **subscribers**: 144
- Number of **non-subscribers**: 52


- *Mean* of **Age**: 21.14
- *Standard Deviation* of **Age**: 7.39
- *Median* of **Age**: 19
- *Mean* of **played_hours**: 5.85
- *Standard Deviation* of **played_hours**: 28.36
- *Median* of **played_hours**: 0.1



#### Gender Identity Breakdown

- Agender: 2
- Female: 37
- Male: 124
- Non-binary: 15
- Other: 1
- Prefer not to say: 11
- Two-spirited: 6


#### Potential Issues
Sampling bias: 75% of players are male, which might create demographic issues. 

Newsletter subscription definition: It’s unclear what the newsletter content was, which limits interpretability of “subscription” as an engagement metric.


Self-reported data: Experience level and age may be inaccurate or inconsistent because gamers self declared this.

Engagement bias: Players who chose to join the UBC Minecraft server may differ from the general gaming population.

Limited behavioral detail: Only playtime is available; no data on play style (ex. building, exploring).

Skewness: played_hours is heavily skewed right (most players play for very little time). 

### Questions

I wanted to complete the Exploratory Data Analysis and Visualization and the Data before formulating a question to analyze. Now that that's done, I've noticed many things that have helped me choose the specific question I want to analyze. Seeing the massive skewness in some variables and the greater representation of men (75%) in the dataset, I have decided I want to further examine the reasearch group's second broad question of interest: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

Reflecting on this broad question, I have narrowed it down into a more specific question I will be exploring: **Can we use age, play hours, and experience to predict subscription of the newsletter?**

- My **response** variable is: *subscribe*
- My **explanatory** variables are: *Age*, *play_hours*, *experience*

The variables recorded in the players dataset can be used to see if we can use demographics and behaviour to predict whether or not an individual would subscribe to the newsletter. To make this dataset suitable for analysis, we can wrangle it to ensure it is suitable and consistent. 

I would use the our tidy dataset that I produced earlier in Exploratory Data Analysis. Next, due to skewness of played_hours (Figure 2), I might choose to omit any players that played less than a specific number of minutes, and then upscale to ensure we have more datapoints. I will split the dataset into training and testing subsets, for example, 70% for training and 30% for testing, so that the predictive model can be trained on one portion of the data and evaluated on unseen observations. These wrangling steps will prepare a clean, well-structured dataset ready for modeling.

### Methods and Plans

To answer the question **“Can we use age, play hours, and experience to predict subscription to the newsletter?”**, I propose using the **k-nearest neighbors (k-NN) classification algorithm**. 

#### Why is this method appropriate

Because the target variable subscribe is binary (subscribe vs. not subscribe), and we have a small set of predictors (*Age*, *played_hours*, *experience*), k-NN offers a straightforward and interpretable classification method that fits our analysis. k-NN classifies a new observation by finding its k closest training-observations, then assigning the class most common among those neighbors.

#### Which assumptions are required, if any, to apply the method selected?

If one variable (ex. *played_hours*) has a much larger scale than another (ex. *Age*), then that variable will dominate the distance calculation. So we must standardize and rescale the numerical variables before applying k-NN. 



