# Individual Planning Report

## Project: Predicting Newsletter Subscription Based on Player Demographics

This report is a planning proposal for a data science project using data collected from UBC's MineCraft server. The goal is to investigate whether player data collected can predict whether someone will subscribe to a game-related newsletter.

## 1. Data Description
This project uses 2 datasets collected from UBC's MineCraft server:
- **players.csv**: Contains one row per player, each row contains demographic information as well as whether the player is subscribed to the newsletter.
- **sessions.csv**: Contains one row per play-session, each row contains a hashed email for the player, as well as start and end times of the session.

It is important to note that if necessary, players could be linked to their individual sessions by using their hashed email, which is a variable present in both datasets.

In order to understand the data, both datasets will be previewed and explained, before summary statistics are calculated. The first dataset will be the players.csv set, which contains player data.


In [None]:
library(tidyverse)

players <- read_csv("Data/players.csv")

head(players, n = 10)

As shown above, there are **196** observations across **7** variables. 
- **experience**: This is a character variable that has 5 different options, based on the player's experience on playing Minecraft
- **subscribe**: This is a logical variable that depends on whether the player has subscribed to the newsletter
- **hashedEmail**: A character variable that gives an encoded version of the player's email
- **played_hours**: A double variable that gives the total number of time the player spent playing the game
- **name**: A character variable that gives the player's name
- **gender**: A character variable that gives the player's gender
- **age**: A double variable that gives the player's age

Now, summary statistics will be calculated for the dataframe:

In [None]:
players_factor <- players |>
  #Converts non numerical variables to factors so percentages can be calculated
  mutate(experience = factor(experience),
    gender = factor(gender),
    subscribe = factor(subscribe)
        )
#Summarise to calculate summary statistics for age and hours played
age_summary <- players_factor |>
    summarise(mean_age = round(mean(Age, na.rm = TRUE), 2),
              median_age = median(Age, na.rm = TRUE),
              min_age = min(Age, na.rm = TRUE),
              max_age = max(Age, na.rm = TRUE)
              )
age_summary
hours_summary <- players_factor |>
    summarise(mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
            median_played_hours = median(played_hours, na.rm = TRUE),
            min_played_hours = min(played_hours, na.rm = TRUE),
            max_played_hours = max(played_hours, na.rm = TRUE)
             )

hours_summary

#Percentages for gender, experience and subscribed
gender_pct <- players_factor |>
  count(gender) |>
  mutate(percentage = round(n / sum(n) * 100, 2))|>
  rename(number = n)

experience_pct <- players_factor |>
  count(experience) |>
  mutate(percentage = round(n / sum(n) * 100, 2))|>
  rename(number = n)

subscribe_pct <- players_factor |>
  count(subscribe) |>
  mutate(percentage = round(n / sum(n) * 100, 2))|>
  rename(number = n)

gender_pct
experience_pct
subscribe_pct

Now, the same process will be applied to the sessions.csv dataset, which outlines each individual session played on MineCraft.

In [None]:
sessions <- read_csv("Data/sessions.csv")

head(n = 10,sessions)

There are **1535** observations and **5** variables for this dataset.
- **hashedEmail**: As in the players dataset, hashed email is a character variable that gives an encoded version of the player's email
- **start_time**: A character representation of the date (dd/mm/yy) and time (hh/mm) the player started the play session
- **end_time**: A character representation of the date (dd/mm/yy) and time (hh/mm) the player ended the play session
- **original_start_time**: A double representation of the amount of milliseconds from January 1st, 1970 to when the session was started
- **original_end_time**: A double representation of the amount of milliseconds from January 1st, 1970 to when the session was ended

Once again, summary statistics will be calculated:

In [None]:
library(lubridate)

#advice on how to convert using lubridate taken from here:
#https://www.r-bloggers.com/2024/09/mastering-date-and-time-data-in-r-with-lubridate/
sessions_summary <- sessions |>
  mutate(
    start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    session_minutes = as.numeric(end_time - start_time)
  )|>
summarise(
    mean_minutes = round(mean(session_minutes, na.rm = TRUE), 2),
    median_minutes = round(median(session_minutes, na.rm = TRUE), 2),
    min_minutes = round(min(session_minutes, na.rm = TRUE ), 2),
    max_minutes = round(max(session_minutes, na.rm = TRUE ), 2)
    )

sessions_summary


**Limitations**

Based on the two datasets, it is clear that while they are very useful, they do have some limitations. For example, there are many N/A values in both sets, meaning that in order to predict values accurately, methods such as mean imputation will have to be used. Additionally, there may have been some biases in collecting the data, based on the fact that it was targeted at other university students, as shown in the mean and median ages. This means that any results found in this study may be correct for UBC students and their tendencies, but not for the general population that includes a variety of ages. Neverthless, this study should provide important data on whether a person will subscribe to a game-related newsletter based on their demographics. 