In [None]:
#library load ins:
library(tidyverse)

#data reading libraries
library(dbplyr)
library(rvest)

#data visualization libraries
library(tidymodels)
library(lubridate)
library(scales)
library(RColorBrewer)

#data classification libraries

## 1. Data Description:

This semester our DSCI-100 course has the fortunate opportunity of getting to work with two real and unique data sets provided by Frank Wood’s computer science research team. Each data set, labeled `players_data` and `sessions_data` respectively, contains distinct information on individual player characteristics and statistics on gaming habits while playing on the research team’s custom MineCraft server. Information was collected for `players_data` during each user’s sign-up and was also updated in the `played_hours` column as users continued to log more hours on the server. `sessions_data` has also been updated dynamically throughout the server’s lifespan and consists of the login and log out information for each user. 

Please see a glimpse of each data set in the following cell below:

In [None]:
#reading in raw data urls
url_players <- 'https://raw.githubusercontent.com/rjmc2006/individual_project_portion_dsci100/refs/heads/main/players.csv'
url_sessions <- 'https://raw.githubusercontent.com/rjmc2006/individual_project_portion_dsci100/refs/heads/main/sessions.csv'

players_data <- read_csv(url_players)
sessions_data <- read_csv(url_sessions)

head(players_data)
head(sessions_data)

`players_data` focuses more on storing a catalog of player characteristics within the following seven variables:

1.	`experience`: The level of experience the player has with MineCraft. Can either be one of Pro, Veteran, Regular, Amateur, or Beginner. Data type: character.
2.	`Subscribe`: Whether or not the user is subscribed to a game-related newsletter. Data type: logical. 
3.	`hashedEmail`: Identifier for an individual participant in the server. Each user has a unique hashedEmail that crosses over to `sessions_data`. Data type: character.
4.	`played_hours`: Total number of hours a user has played on the server. Data type: double.
5.	`name`: First name of the user. Data type: character.
6.	`gender`: The gender of the user. Either Male, Female, Non-binary, Two-spirited, Agender, Other, or Prefer not to say. Data type: character.
7.	`Age`: The age of the user. Data type: double.

Please find key summary statistics for `players_data` in the cell below:

In [None]:
players_data_stats <- players_data |>
    summarize(played_hours_max = max(played_hours, na.rm = TRUE), played_hours_min = min(played_hours, na.rm = TRUE), 
              played_hours_mean =mean(played_hours, na.rm = TRUE), Age_max = max(Age, na.rm = TRUE), Age_min = min(Age, na.rm = TRUE),
             Age_mean = mean(Age, na.rm = TRUE), number_of_observations = nrow(players_data), total_hours_played = sum(played_hours))

players_data_stats

`sessions_data` logs each gaming session on the server within its five variables:

1.	`hashedEmail`: Same as the `hashedEmail` variable in `players_data`. Used to identify players. Data type: character.
2.	`start_time`: The start time of a gaming session by date in form day/month/year followed by the time of day the session started (in 24-hour clock). Data type: character.
3.	`end_time`: The end time of a gaming session by date in form day/month/year followed by the time of day the session ended (in 24-hour clock). Data type: character.
4.	`original_start_time`: Data type: double.
5.	`original_end_time`: Data type: double.

Please find key summary statistics for `sessions_data` below:

In [None]:
players_sessions_stats <- sessions_data |>
    separate(col =start_time, into = c('start_date', 'start_time'), sep = ' ') |>
    separate(col =start_time, into = c('start_hour', 'start_min'), sep = ':') |>
    separate(col =end_time, into = c('end_date', 'end_time'), sep = ' ') |>
    separate(col =end_time, into = c('end_hour', 'end_min'), sep = ':') |> 

    mutate(start_hour = as.numeric(start_hour), start_min = as.numeric(start_min), 
        end_hour = as.numeric(end_hour), end_min = as.numeric(end_min), 
           session_length_in_mins = ((end_hour * 60 + end_min) - (start_hour * 60 + start_min))) #|>

    # summarize(avg_start_hour = mean(start_hour), avg_end_hour = mean(end_hour), avg_session_length_in_mins = mean(session_length_in_mins), 
    # n(sessions_data))

players_sessions_stats

Within the data sets there are some initial problems I should point out. NA values are present in some cells within `players_data` which cause issues computing summary statistics along with creating key visualizations with our data. Additionally, `start_time` and `end_time` are untidy data and need to be widened to separate the date and time values in the variable. If they aren’t we won’t be able to manipulate those columns effectively.  

Further along in my project …

## 2. Questions:
For the individual portion of my project, I’ve chosen the broad question: What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

With this in mind, I aim to answer the following specific question: Can the amount of total amount of hours played by a user along with their age predict whether or not they’re likely to subscribe to a MineCraft-related newsletter?

To answer this question I can create a K nearest neighbors classifier model and use the `Age` and `played_hours` variables from `players_data` as my predicters with the `subscribe` variable as my classified observation. These columns won’t require much wrangling to feed into my `workflow()` only needing to be split into training and testing sets before building the model.

## 3. Exploratory Data Analysis and Visualization: