# Data Science Project: Planning Stage - **UBC Minecraft Research Server**

**Students:** Shaurya V. Shastri, Catherine Harris, Jessica Wang                                              
**Date:** 07-12-2025         
**Course:** DSCI100-009

---
GitHub Repository: https://github.com/symkk79/dsci_100_project.git

## 1. Introduction
With the rise of online gaming communities, understanding player engagement has become an essential part of managing servers and designing outreach strategies. Knowing which kinds of players are more likely to stay involved can help developers plan resources and tailor recruitment or communication campaigns.

### Question
The broad question we are focusing on is: 
> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

We aim to answer the following question:
> Can the age of the player predict if the player subscribes to a game-related newsletter in players.csv?

This analysis explores the relationship between player activity and continued interest in the game community.

### Dataset Description
We will be using the players data set from a UBC Computer Science Minecraft research server, which aims to record player activity and session behaviour for the purpose of studying engagement patterns. Two datasets were provided. Players.csv describes whether they chose to subscribe to a game-related newsletter, and sessions.csv describes how individual players interact with the game environment.

**Variables**
| Variable   | Type        | Meaning |
| ---------- | ----------- | ------- |
|experience  | categorical | What category of experience the player falls into|
|subscribe   | categorical | Whether or not the player is subscribed to a game-related newsletter|
|hashedEmail | categorical | The email of the player|
|played_hours| quantitative| The amount of hours played| 
|name        | categorical | The name of the player|
|gender      | categorical | The gender of the player|
|Age         | quantitative| The age of the player |

## 2. Method
Methods & Results:
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.

In [None]:
library(tidyverse)
library(purrr)
library(ggplot2)

In [None]:
Players_URL <- "https://raw.githubusercontent.com/symkk79/dsci-100-project-planning-dataset/main/players.csv"
Sessions_URL <- "https://raw.githubusercontent.com/symkk79/dsci-100-project-planning-dataset-1/main/sessions.csv"
players_data <- read_csv(Players_URL)
sessions <- read_csv(Sessions_URL)

#mutate experience to proper value
summary_data <- players_data |>
    mutate(
    experience = case_when(
    experience == "Beginner" ~ 1,
    experience == "Amateur"  ~ 2,
    experience == "Regular"  ~ 3,
    experience == "Veteran"  ~ 4,
    experience == "Pro"      ~ 5,
    )
  )

summary_data

In [None]:
summary_data <- players_data |>
                summary(digit = 3) 

summary_data

In [None]:
observation_count <- players_data |>
                    count()
observation_count

Data Description:

| Variable   | Type        | Meaning |
| ---------- | ----------- | ------- |
|experience  | categorical | What category of experience the player falls into|
|subscribe   | categorical | Whether or not the player is subscribed to a game-related newsletter|
|hashedEmail | categorical | The email of the player|
|played_hours| quantitative| The amount of hours played| 
|name        | categorical | The name of the player|
|gender      | categorical | The gender of the player|
|Age         | quantitative| The age of the player |

- Number of variables: 8
- Number of observations: 196
- How data was collected: The information is from players on a minecraft server that is being collected as they play.
- Potential issues: Some issues may arise if using the mean because for the variable played_hours the mean and median are very different this means one very large value may be influencing the played_hours mean. 

In [None]:
players_mean <- players_data |>
                select(played_hours, Age) |>
                map_dfr(mean, na.rm = TRUE)
players_mean

| Variable     | Mean |
| ------------ | ---- |
| Hours Played | 6    |
| Age          | 21   |

In [None]:
experience_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = subscribe, fill = experience)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Level of Experience") +
                            ggtitle("How the experience of player influences if they subscribed")
experience_vs_subscription_graph

In [None]:
gender_vs_subscription_graph <- players_data |>
                            filter(gender != "Prefer not to say") |>
                            ggplot(aes(x = subscribe, fill = gender)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Gender") +
                            ggtitle("Gender of player and whether they subscribed")
gender_vs_subscription_graph

In [None]:
age_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = Age, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Player's Age", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("Age of player and whether they subscribed")
age_vs_subscription_graph

In [None]:
age_vs_subscription_graph <- players_data |>
                            filter(played_hours < 3) |>
                            ggplot(aes(x = played_hours, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Playing hours", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("Hours spent playing and if players subscribed")
age_vs_subscription_graph

Insights from the graphs: 

When looking at these graphs it doesn't look like the level of experience of the players, their gender or the hours spent playing had a relationship with whether or not they subscribed. However, there does seem to be a relationship with the age of the player and whether or not they subscribed. It seems as though younger players that are around 15 and younger are more likely to subscribe compared to players over 35. However, it is a weak relationship.

Method and Plan: 

To address my question I would use knn to predict classification. This would work because I am trying to guess which category players fall into, whether they subscribe or not, based on their age. Seeing that subscription is a categorical variable, classification is what is being predicted. The limitations are that this would only look at one variable, ignoring the fact that other variables may also influence the prediction. This also requires the assumption that age can predict whether or not players will subscribe which may be difficult because the graph seems to show a weak relationship between the two. 

To do knn, the data should be split after wrangling but prior to making the model. 75% of the data should go into the training set and the rest into the testing set, ensuring shuffling and stratification happened. After splitting, the training set should be used to build the model. Cross-validation and tuning should also be done to find the best number of neighbours to provide the most accurate predictions that can be achieved. This will also help prevent underfitting or overfitting the model. The data will then be further split and a validation set created. After tuning the optimal k will be inserted and the prediciton will be run.

