# TITLE

## Introduction

A research group in Computer Science at UBC, led by Frank Wood, has set up a Minecraft server to explore how people play and develop interest in video games. However, to fully understand the financial and technical needs of this project, the researchers must know which kinds of players are likely to join the server and how many resources these users will occupy. The following analysis aims to answer the research group’s broad question of “What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?” More specifically, our analysis explores if number of sessions, total number of played hours, and mean session duration can be used to predict if players will subscribe to a game-related newsletter as well as the potential variations between experience level.

### Question 
#### Broad Question: 
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
<br>
<br>
#### Specific Question:
Can hours played, age, and average session length predict whether a user is subscribed to a game-related newsletter? Additionally does this differ by experience?

### Data
To analyze our question, we will use the following two data sets: 
<br>
<br>


`players.csv` contains information about the users of the Minecraft server, PLAICraft. There are 196 observations and 7 variables as follows: 
- `experience` - character: player's self-determined experience level (`Beginner`, `Amateur`, `Regular`, `Veteran`, or `Pro`)
- `subscribe` - logical: player's subscription status to a game-related newsletter (`TRUE` or `FALSE`)
- `hashedEmail` - character: hashed player's email
- `played_hours` - double: total hours played by each user 
- `name` - character: player's first name
- `gender` - character: player's gender
- `Age` - double: player's age
<br>
<br>

`sessions.csv` contains information about sessions played on PLAICraft. There are 1535 observations and 5 variables as follows:
- `hashedEmail` - character: hashed player's email
- `start_time` - character: session start time in dd/mm/yyyy time 
- `end_time` - character: session end time in dd/mm/yyyy time 
- `orginal_start_time` - double: session start time in milliseconds since January 1st, 1970 at Coordinated Universal Time (UNIX time)
- `orginal_end_time` - double: session end time in milliseconds since January 1st, 1970 at Coordinated Universal Time (UNIX time)

## Methods and Results 

The code below loads the necessary libraries and use the option function to set the viewing of data frames to the first 10 rows. Using a github repository, we will load the players and sessions data to our project.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(lubridate)
library(ggplot2)
options(repr.matrix.max.rows = 10) 

players_data <- read_csv("https://raw.githubusercontent.com/skylv777/dsci100_group_project/refs/heads/main/players.csv")
head(players.csv)

sessions_data <- read_csv("https://raw.githubusercontent.com/skylv777/dsci100_group_project/refs/heads/main/sessions.csv")
head(sessions.csv)

The code below alters the sessions data frame variables start_time and end_time to be the type datetime(dttm). This was done to tidy the data and so that the times could be subtracted to create the new column session_length, as in the next steps of creating the combined_data data frame by email to create a big data set. 

In [None]:
#Tidying sessions_data so there is only one value per cell by converting to dttm format
sessions_data_tidy <- sessions_data |>
         mutate(start_time = dmy_hm(start_time)) |>
         mutate(end_time = dmy_hm(end_time))
#Creating sessions_length column
sessions_data_difference <- sessions_data_tidy |>
        mutate(session_length = end_time - start_time) |>
        mutate(session_length = as.double(session_length))
# Determining average session length per player
average_sessions_data <- sessions_data_tidy |>
         mutate(session_length = end_time - start_time) |>
         mutate(session_length = as.double(session_length)) |>
         group_by(hashedEmail) |>
         select(session_length) |>
         summarize(average_session_length = mean(session_length))
# Combining Data Sets
combined_data <- merge(players_data, average_sessions_data)
head(combined_data)

In [None]:
#Summary Statistics on Quantitative Values of players_data 
combined_data_select <- select(combined_data, Age, played_hours, average_session_length)

players_data_stats <- 
        bind_rows(map_df(combined_data_select, mean, na.rm = TRUE), 
                  map_df(combined_data_select, median, na.rm = TRUE),
                  map_df(combined_data_select, ~{
    x <- na.omit(.x)
    if (length(x) == 0) return(NA_real_)
    tibble(val = x) |>
      count(val, sort = TRUE) |>
      filter(n == max(n)) |>
      slice_head(n = 1) |>   
      pull(val)}),
                  map_df(combined_data_select, min, na.rm = TRUE),
                  map_df(combined_data_select, max, na.rm = TRUE), 
                  map_df(combined_data_select, sd, na.rm = TRUE), #?seems too high
                  map_df(combined_data_select, ~ quantile(.x, probs = 0.25, na.rm = TRUE)[[1]]),
                  map_df(combined_data_select, ~ quantile(.x, probs = 0.5, na.rm = TRUE)[[1]]),
                  map_df(combined_data_select, ~ quantile(.x, probs = 0.75, na.rm = TRUE)[[1]])) |>
        mutate(Summary = c("Mean", "Median", "Mode", "Minimum", "Maximum", "Standard Deviation", "1st Quartile", "2nd Quartile", "3rd Quartile")) |>
        relocate(Summary) |>
        mutate(across(Age:average_session_length, \(x) round(x, digits = 2)))
        
players_data_stats

In [None]:
options(repr.plot.height = 10, repr.plot.width = 15)
hours_age_subscribe_plot <- players_data |>
        ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
        geom_point() +
        labs (x = "Age",
              y = "Total hours played",
              colour = "Subscription Status to Game-Related Newsletter") +
        ggtitle("Total hours played versus Age")
hours_age_subscribe_plot   

## Discussion 