# Individual Project Planning

- DSCI_V100-009
- Rachel Hovestad
- Group 42

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)

### Exploratory Data Analysis and Evaluation

First, let's load both of our datasets into R.

In [None]:
players <- read_csv("players.csv")

sessions <- read_csv("sessions.csv")

head(players)

Now, let's find the mean value for each quantitative variable in the players dataset.

In [None]:
players_mean <- players |>
  summarise(mean_age = mean(Age, na.rm = TRUE), 
            mean_playtime = mean(played_hours, na.rm = TRUE))

players_mean

Next, let's do some minimal tidying of our players dataset (the dataset I'll be using for my analysis). We're going to remove all identifier variables (name and hashedEmail) and ensure that characters are turned into factors so we can use them for anlaysis later. 

In [None]:
tidy_players <- players |>
  select(-hashedEmail, -name) |>
  mutate(experience = as.factor(experience),
            gender = as.factor(gender))

tidy_players

### Data Description

Let's look at the players dataset and see what we can note about the data. 

In [None]:
players

In [None]:
stats_summary <- players |>
summarise(mean_age = mean(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    sd_played_hours = sd(played_hours, na.rm = TRUE),
    min_played_hours = min(played_hours, na.rm = TRUE),
    max_played_hours = max(played_hours, na.rm = TRUE), 
    median_played_hours = median(played_hours, na.rm = TRUE), 
    median_age = median(Age, na.rm = TRUE))
stats_summary

newsletter <- players |>
    count(subscribe)
newsletter

nrow(players) 
ncol(players)

gender <- players |>
   count(gender)
gender

#### How data was collected
This data were collected by a UBC Computer Science research group (Frank Wood’s lab). Players voluntarily joined a custom Minecraft server used for behavioral research, so they could build NPCS in Minecraft with AI.  Gameplay metrics and basic demographic information (experience, gender, age) were recorded automatically or via pre-game questionnaires. Newsletter subscription was collected through an opt-in option provided to players.

#### Variable Descriptions
**Experience**: a factor, either Beginner, Amateur, Pro or Veteran

**Subscribe**: a boolean, either true or false: are users subscribed to

**played_hours**: a double, the number of hours a player contributed to the project through playtime

**gender**: a string, the self-identified gender of a player

**Age**: an integer, the age of a player

#### Summary Statistics

- Number of observations (rows): 196
- Number of variables (columns): 7


- Our *maximum* **played_hours** is (hours): 223.1
- Our *minimum* **play_hours** is (hours): 0


- Number of newsletter **subscribers**: 144
- Number of **non-subscribers**: 52


- *Mean* of **Age**: 21.14
- *Standard Deviation* of **Age**: 7.39
- *Median* of **Age** = 19
- *Mean* of **played_hours**: 5.85
- *Standard Deviation* of **played_hours**: 28.36
- *Median* of **played_hours** = 0.1



#### Gender Identity Breakdown

- Agender: 2
- Female: 37
- Male: 124
- Non-binary: 15
- Other: 1
- Prefer not to say: 11
- Two-spirited: 6


Each row represents: One individual Minecraft player on the UBC research server.

#### Potential Issues
Sampling bias: 75% of players are male, which might create demographic issues. 

Newsletter subscription definition: It’s unclear what the newsletter content was, which limits interpretability of “subscription” as an engagement metric.


Self-reported data: Experience level and age may be inaccurate or inconsistent.

Engagement bias: Players who chose to join the UBC Minecraft server may differ from the general gaming population.

Limited behavioral detail: Only playtime is available; no data on play style (e.g., building, exploring).

Skewness: played_hours is heavily skewed right (most players play for very little time). 

### Questions

After exploring our dataset and summarizing notable statistics and looking at potential issues, I have decided I want to further examine the reasearch group's second broad question of interest: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts."

Reflecting on this broad question, I have narrowed it down into a more specific question I will be exploring: **Can self-described experience predict length of playtime?**

- My **response** variable is: *played_hours*
- My **explanatory** variable is: *experience*

I hypothesize that a player that declares themselves as more experienced (i.e. Veteran or Pro), will have more played hours. Therefore, the research team would recruit people that declared themselves as more experienced. 

The dataset contains 196 individual player records, each with information about self-reported experience level, total hours played, age, gender, and subscription status. My question can be addressed directly using these variables because *experience* represents how players perceive their skill or familiarity with the game (the explanatory variable), and *played_hours* provides a quantitative measure of actual engagement (the response variable).

To prepare my dataset, I will convert the variable *experience* to factors variables so I can wrangle it. I will remove all unnecesary variables (i.e. subscribe, hashedEmail, name, gender, Age).

### Methods and Plans

To address the question **"Can players’ self-described experience level predict the amount of time they spend playing?”**, I will use a **linear regression model**. 

This method is appropriate because my response variable, *played_hours*, is continuous, and my explanatory variable, *experience*, is categorical. 

Linear regression will allow me to estimate the average playtime associated with each experience level and test whether differences among these groups are statistically significant. The model is also interpretable, providing clear coefficients that show how playtime changes across experience categories.