# DSCI 100 Final Project - Group 05 - [Title]
**Group Members:** Caitlyn Woods, Amy Zhang, Ziyang Shen


## Introduction

In this report, we will analyze data collected by a UBC Computer Science research group using strategies taught in the DSCI 100 course to answer a research question. However, before discussing the specifics of the research question and datasets, it is crucial to have a basic understanding of the tools and strategies used throughout this report. Simply put, we will be using a variety of strategies, including summarizing, visualizing, and modelling, to gain a better understanding of and derive useful information from the data we have been provided. These strategies will be explained in the "methods and results" section. All code included in this report will be written in R, and will use functions from several libraries, notably the Tidyverse and ggplot2 libraries [If there are any others that need to be mentioned, let me know or add them here]. When we refer to a "dataset", we are referencing a specific table of data, while an "observation" refers to a row of this table, and a "variable" is a column of the table. 

In this report, we will explore the broad question: "We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players." (DSCI 100, "Project Planning Stage Instructions"). To do so, we will answer the specific research question, "Can the total number of hours a player has accumulated (from players.csv) and the duration of their previous sessions (from sessions.csv) predict whether they will start a new session in the next 24 hours?". By answering the research question, we will learn whether our method for predicting when players will be online (within 24 hours) is successful, thus providing a starting point for predicting demand for more precise time frames. 

To answer the research question, we will use two datasets, "players.csv" and "sessions.csv". The first dataset, which we will call "players" in this report, provides information about individual players and includes the information name, gender, age, hours played, experience level, email (hashed), and whether they are subscribed to the newsletter. The second dataset, which we will call "sessions", is a record of all sessions played by all players, including the start time, end time, and email (hashed). The start and end times are included both in the DD/MM/YYYY HH:MM format and the "original time", which is a standardized time frame often used in computer science. To answer our research question, we will combine the two data sets by player email (hashed) to look at both the sessions played and the total hours played. To do so, we will first have to group the sessions dataset by email (hashed) to have all played sessions for each player.  

## Methods

To address our research question "whether a playerâ€™s total accumulated hours (from players) and the duration of their previous sessions (from sessions) can predict whether they will start a new session within the next 24 hours", we performed a complete data-analysis workflow in R. This section describes the full sequence of methods used, from loading the data to building and evaluating the predictive model. All analysis was completed in R, using functions from the tidyverse, lubridate, and class libraries (if additional packages are used in the final code, they should be added here).

**1. Loading the Data**

We begin by importing the two datasets, players.csv and sessions.csv, into R. Each dataset is loaded as a tibble to support tidyverse workflows.
A reproducible seed is set at the beginning of the analysis to ensure consistent model results.


In [None]:
# Load package + Read data
library(tidyverse)

players  <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
#  Calculate the duration of each game (in hours)

ms_per_hour <- 1000 * 60 * 60 
ms_per_hour

sessions_time <- sessions |>
  mutate(duration_hours =
         (original_end_time - original_start_time) / ms_per_hour)
head(sessions_time)

In [None]:
# Summarise the data of each player
sessions_summary <- sessions_time |>
  
group_by(hashedEmail) |>
  
summarise(n_sessions = n(),
   
    avg_session_duration = mean(duration_hours),  
    # The most recent start time
 
    last_start = max(original_start_time), 
    # How many games (including the last one) have there been in the past 24 hours?
   
    n_in_last_24h = sum(original_start_time >= last_start - 24 * ms_per_hour),
   # If there are more than one game in the last 24 hours, it will be recorded as 1; otherwise, it will be 0.
    
    start_within_24h = as.integer(n_in_last_24h > 1))

sessions_summary

In [None]:
# Combine "players" and "sessions_summary" together
data_full <- players |>
  inner_join(sessions_summary, by = "hashedEmail")
# Inner join is to align and merge the two tables by a certain column (hashedEmail)

In [None]:
# Summary
data_full |>
  summarise(
    n_players = n(),
    prop_start_24h = mean(start_within_24h),
    median_hours = median(played_hours),
    median_sessions = median(n_sessions),
    median_duration = median(avg_session_duration))

## Visiualization

In [None]:
# Figure 1

ggplot(data_full, aes(x = played_hours)) +
  geom_histogram(bins = 20) +
  labs( title = "Distribution of total hours played",
    x = "Total hours played",
    y = "Count") 

In [None]:
#Figure 2

duration_summary <- data_full |>
  mutate(start_within_24h = factor(start_within_24h,
                                   levels = c(0, 1),
                                   labels = c("No", "Yes"))) |>
  group_by(start_within_24h) |>
 
  summarise(mean_duration = mean(avg_session_duration, na.rm = TRUE))

duration_summary


ggplot(duration_summary,
       aes(x = start_within_24h, y = mean_duration)) +
  geom_col() +
  labs(
    title = "Figure 2. Mean session duration by return status",
    x = "Returned within 24 hours?",
    y = "Mean session duration (hours)") 


In [None]:
# KNN classification
library(tidymodels)
set.seed(123)

In [None]:
player_split <- initial_split(data_full, prop = 0.8,
                              strata = start_within_24h)

player_train <- training(player_split)
player_test  <- testing(player_split)

In [None]:
player_recipe <- recipe(
  start_within_24h ~ played_hours + n_sessions + avg_session_duration + Age,
  data = player_train) |>
  step_normalize(all_numeric_predictors())

In [None]:
knn_spec <- nearest_neighbor(neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("classification")

In [None]:
knn_workflow <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_spec)

In [None]:
player_folds <- vfold_cv(player_train,
                         v = 10,
                         strata = start_within_24h)

In [None]:
collect_metrics(knn_tuned)

In [None]:
best_k <- select_best(knn_tuned, "accuracy")

final_knn <- knn_workflow |>
  finalize_workflow(best_k) |>
  fit(data = player_train)

In [None]:
player_test_complete <- player_test |>
  drop_na(played_hours, n_sessions, avg_session_duration, Age)


pred_test <- predict(final_knn, new_data = player_test_complete) |>
  bind_cols(player_test_complete |> select(start_within_24h))


metrics(pred_test,
        truth = start_within_24h,
        estimate = .pred_class)


conf_mat(pred_test,
         truth = start_within_24h,
         estimate = .pred_class)

* The final KNN model reached an accuracy of 80%. It predicts non-returning players quite well, but it has more difficulty identifying players who return within 24 hours.