In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [5]:
players <- read_csv("https://raw.githubusercontent.com/sstephaniewu/video_game_project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/sstephaniewu/video_game_project/refs/heads/main/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


# 1) Data Description
This project uses two datasets stored as CSV files: "players.csv" and "sessions.csv". Both datasets were collected from a research study done by a computer science group at UBC on a Minecraft server, where "players.csv" contains individual player data and "sessions.csv" contains individual play session data.

### "players.csv"
* 196 total observations, each representing a different player
* 7 variables:
    * experience (chr): categorical variable showing a player's self-reported experience level ("Beginner", "Amateur", "Regular", "Veteran", "Pro").
    * subscribe (lgl): a true/false (boolean) variable showing if a player is subscribed to to a gaming newsletter.
    * hashedEmail (chr): a categorical variable used as a unique identifer for each individual player.
    * played_hours (dbl): a numerical variable to record how many total hours a player's played.
    * name (chr): a categorical variable for player's name in the server.
    * gender (chr): a categorical variable for the player's gender.
    * age (dbl): a numerical variable for the player's age.

### "sessions.csv"
* 1535 total observations, each representing one play session
* 5 variables:
    * hashedEmail (chr): a categorical variable used as a unique identifer for each individual player.
    * start_time (chr): a categorical variable recording start time of a play session.
    * end_time (chr):a categorical variable recording end time of a play session.
    * original_start_time (dbl): a numerical variable storing start time as a Unix timestamp.
    * original_end_time (dbl): a numerical variable storing end time as a Unix timestamp.

### Potential Issues with Data:
* There are several NA values within the "players.csv" and "sessions.csv" datasets, specifically in the columns "age", "end_time", and "original_end_time". Since these are minor issues, I will just remove them.
* The proportions of players of players for some variables are off. For example, there are only a few self-reported "Pro" players (14) compared to the significantly more self-reported "Amateur" players (63). Furthermore, the median hours played is much lower than the average hours played, meaning most players have little hours while a few have many hours.

### Summary Statistics:

In [7]:
avg_num_players <- players |>
    group_by(experience) |>
    summarize (count = n())

avg_hours <- players |>
summarize(mean_hours = round(mean(played_hours, na.rm = TRUE), 2),
              median_hours = median(played_hours, na.rm = TRUE),
              max_hours = max(played_hours, na.rm = TRUE),
              min_hours = min(played_hours, na.rm = TRUE))

avg_num_players
avg_hours

experience,count
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


mean_hours,median_hours,max_hours,min_hours
<dbl>,<dbl>,<dbl>,<dbl>
5.85,0.1,223.1,0


# 2) Questions
This project addresses the broad question of "what player characteristics and behaviours are most predictive of subscribing to the game-related newsletter?" My more specific question is "Can a player's total play time, total sessions, and self-reported experience level predict whether they subscribe to the newletter?" 

To do this, the "sessions.csv" dataset will be processed and aggregated, then combined with the "players.csv" dataset using the "hashedemail" identifier so that it becomes a single, tidy dataset where each row represents a player.