In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

To start this project, I have loaded all the necessary packages above to be used for this project. 

Next, I will read the 2 CSV data files, 'players.csv' and 'sessions.csv' onto this project so we can see what we're working with.

In [None]:
players <- read_csv('players.csv')
sessions <- read_csv('sessions.csv')

players
sessions


After loading both data files, we can summarize their contents. The first dataset, players.csv, lists unique players recorded during the data collection period, while the second dataset, sessions.csv, logs every game session initiated by these players. For the players dataset, there are 196 rows, each representing one of the 196 unique players. 

There are also 7 columns within this dataset. 
- The 'experience' column classifies players either as 'Beginner', 'Amateur', 'Regular', 'Pro', or 'Veteran'.

 - The 'subscribe' column represents whether or not the player is subscribed to a game-related newsletter (TRUE/FALSE).

 - The 'hashedEmail' column uniquely identifies each player.

 - The 'played_hours' column records total playtime in hours.

 - The 'name' column labels the player.

 - The 'gender' column provides the gender.

 - The 'Age' column provides the numerical age.



The second dataset, 'sessions.csv' has 1535 rows, representing 1535 sessions. 

There are also 5 columns within this dataset. 
 - The 'hashedEmail' column provides the hashed email of each player, allowing us to know which player is associated with that session.
   
 - The 'start_time' and 'end_time' columns provides the time and date the player started/ended the session.

 - The 'original_start_time' and 'original_end_time' columns provides the start and end time of the session in Unix timestamp format (milliseconds since January 1, 1970).



Overall, for the 'players' dataset, it appears to be clean but contains a few issues. The Age column has missing values that would need to be addressed. The name variable may not provide any analytical value beyond labeling. Hidden concerns include sampling bias, as the dataset may not represent the entire player population. The reliability of self-reported measures is also uncertain—if played_hours were not automatically logged, they could be inaccurate. Despite these limitations, the dataset offers a solid foundation for examining how experience, age, gender, and subscription status influence player engagement. Despite these issues, the datasets provide a strong foundation for exploring player engagement, frequency, and duration of play. When combined with players.csv, it can reveal how factors like experience level, age, or subscription status influence overall session behavior and time investment.

We selected Question 1, which asks: Which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? More specifically, we ask whether a player’s experience level, age, and total playtime can predict subscription status. To explore this, we used players.csv, focusing on subscribe as the response variable and experience, age, and played_hours as explanatory variables.

I will now wrangle the data to prepare it for the analysis, as well as make a few exploratory visualizations of the data to help me understand it.

In [None]:


players <- players |>
  mutate(
    experience = as.factor(experience),
    gender = as.factor(gender))

players_wrangled <- players |>
    rename(
        age = Age,
        hashed_email = hashedEmail)

players_mean <- players_wrangled |>
    summarise(across(c(played_hours, age), ~ mean(.x, na.rm = TRUE)))


players_wrangled
players_mean

In [None]:
experience_plot <- players_wrangled |>
ggplot(aes(x = experience, fill = experience)) +
  geom_bar() +
  labs(
    title = "Distribution of Player Experience Levels",
    x = "Experience Level",
    y = "Number of Players"
  )
experience_plot

In [None]:
subscribed_summarised <- players_wrangled |>
  group_by(subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

subscribed_plot <- subscribed_summarised |>  

ggplot(aes(x = subscribe, y = mean_hours, fill = subscribe)) +
  geom_col(width = 0.6, alpha = 0.8) +
  geom_text(aes(label = round(mean_hours, 2)), vjust = -0.5, size = 4) +
  labs(
    title = "Average Playtime by Subscription Status",
    x = "Subscription Status",
    y = "Average Played Hours (hours)"
  )
subscribed_plot

In [None]:
age_plot <- players_wrangled |> 
ggplot( aes(x = age, y = played_hours)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Relationship Between Age and Playtime",
    x = "Age (years)",
    y = "Total Played Hours")
age_plot

The exploratory visualization plots revealed several trends. The Distribution of Player Experience plot shows that most players are Amateur or Regular, suggesting the platform attracts mid-level users. The Average Playtime by Subscription Status plot shows subscribers spend more time playing on average than non-subscribers, indicating engagement may be related to subscription. The Relationship Between Age and Playtime plot shows no strong correlation, suggesting age is not a major factor in playtime variation.

To address our question, we will apply a k-Nearest Neighbours (k-NN) classification model. This method is suitable because subscribe is binary and k-NN classifies observations based on similarity across multiple predictors. It makes few assumptions about data distribution and can capture non-linear relationships. Categorical variables such as experience will be numerically encoded (e.g., values 0–4), and all numeric variables will be standardized so features on different scales contribute equally to distance calculations. While k-NN can be sensitive to irrelevant features and the choice of k, cross-validation on the training data will be used to select the optimal value. The dataset will be split into 70% training and 30% testing subsets, and performance will be assessed using accuracy and precision. This approach will enable a reliable exploration of how player traits relate to the likelihood of newsletter subscription.