In [2]:
library(tidyverse)
library(janitor)
library(knitr)
library(skimr)       

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




ERROR: Error in library(skimr): there is no package called ‘skimr’


In [3]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
dim(players)
dim(sessions)
colnames(players)
colnames(sessions)

The dataset consists of two CSV files: players.csv and sessions.csv, which contain information about player demographics, experience, and gameplay sessions recorded from a Minecraft research server.
General Information:
players.csv: 196 observations and 7 variables
sessions.csv: 1535 observations and 5 variables

Broad Question:
What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question:
Can player engagement features such as total played hours, number of sessions, and average session duration predict whether a player subscribes to the newsletter?

Description of Variables:
To explore this question, I will use the following variables as explanatory variables:

1. played_hours and experience from players.csv
2. total_sessions and avg_session_duration, which will be computed from sessions.csv

The response variable is subscribe, which indicates whether a player has subscribed to the game-related newsletter (TRUE or FALSE).

In [6]:
mean_table <- players |>
  select(where(is.numeric)) |>       
  summarise(across(everything(), ~mean(.x, na.rm = TRUE))) |>  
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Mean_Value")|>
  mutate(Mean_Value = round(Mean_Value, 2))

kable(mean_table,
      caption = "Mean values of quantitative variables in players.csv",
      align = c("l", "r"))




Table: Mean values of quantitative variables in players.csv

|Variable     | Mean_Value|
|:------------|----------:|
|played_hours |       5.85|
|Age          |      21.14|

This suggests that most players are young adults who spend approximately 1/4 of the time of the whole day playing our game. This pattern could be important for understanding newsletter subscriptions: players who play longer or more frequently might be more likely to stay engaged with the game community and therefore subscribe to the newsletter.

In the last part, to address the question, I plan to use linear regression as the main predictive method.

Why is this method appropriate?
Linear regression is a simple and widely used method for exploring relationships between response variables and explanatory variables. It provides interpretable coefficients showing how each feature affects the likelihood of subscription. And it performs well with small sample sizes and mixed numeric/categorical predictors.

Which assumptions are required, if any, to apply the method selected?
1. Observations are independent of each other.
2. The explanatory variables are not highly correlated with each other.
3. The relationship between response variables and explanatory variables is approximately linear.
4. The variance of the errors is constant.

What are the potential limitations or weaknesses of the method selected?
1. Because the response variable is binary, linear regression might predict values outside the range, which do not directly represent probabilities.
2. It assumes linearity and constant variance, which may not hold perfectly in this dataset.
3. It can be sensitive to outliers, especially when a few players have extremely high playtime or session counts.

Data processing and validation plan:
1. Merge players.csv and sessions.csv using hashedEmail to combine player demographics and activity data.
2. Create new variables to show directly about the explanatory variables.
3. Handle missing values and check for outliers.
4. Standardize numerical variables like "played_hours" and "Age" to make the model easier to interpret.
5. Split the data into training and testing sets to evaluate how well the model generalizes.
Compare model performance using metrics such as R-squared and Root Mean Squared Error.