In [None]:
library(tidyverse)
library(janitor)
library(knitr)
library(skimr)       

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
dim(players)
dim(sessions)
colnames(players)
colnames(sessions)

The dataset consists of two CSV files: players.csv and sessions.csv, which contain information about player demographics, experience, and gameplay sessions recorded from a Minecraft research server.
General Information:
players.csv: 196 observations and 7 variables
sessions.csv: 1535 observations and 5 variables

Broad Question:
What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question:
Can player engagement features such as total played hours, number of sessions, and average session duration predict whether a player subscribes to the newsletter?

Description of Variables:
To explore this question, I will use the following variables as explanatory variables:

played_hours and experience from players.csv
total_sessions and avg_session_duration, which will be computed from sessions.csv

The response variable is subscribe, which indicates whether a player has subscribed to the game-related newsletter (TRUE or FALSE).

In [None]:
mean_table <- players |>
  select(where(is.numeric)) |>       
  summarise(across(everything(), ~mean(.x, na.rm = TRUE))) |>  
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Mean_Value")|>
  mutate(Mean_Value = round(Mean_Value, 2))

kable(mean_table,
      caption = "Mean values of quantitative variables in players.csv",
      align = c("l", "r"))


In [None]:
From the table, we can know 

In the last part, to address the specific question, I plan to use linear regression as the main predictive method.

Why is this method appropriate?
Linear regression is a simple and widely used method for exploring relationships between response variables and explanatory variables. It provides interpretable coefficients showing how each feature affects the likelihood of subscription. And it performs well with small sample sizes and mixed numeric/categorical predictors.

Which assumptions are required, if any, to apply the method selected?
1.Observations are independent of each other.
2.The explanatory variables are not highly correlated with each other.
3.The relationship between response variables and explanatory variables is approximately linear.
4.The variance of the errors is constant.

What are the potential limitations or weaknesses of the method selected?
1.Because the response variable is binary, linear regression might predict values outside the range, which do not directly represent probabilities.
2.It assumes linearity and constant variance, which may not hold perfectly in this dataset.
3.It can be sensitive to outliers, especially when a few players have extremely high playtime or session counts.

How are you going to compare and select the model?


How are you going to process the data to apply the model? 
