# DSCI 100 Final Project - Group 05 - [Title]
**Group Members:** Caitlyn Woods, Amy Zhang


## Introduction

In this report, we will analyze data collected by a UBC Computer Science research group using strategies taught in the DSCI 100 course to answer a research question. However, before discussing the specifics of the research question and datasets, it is crucial to have a basic understanding of the tools and strategies used throughout this report. Simply put, we will be using a variety of strategies, including summarizing, visualizing, and modelling, to gain a better understanding of and derive useful information from the data we have been provided. These strategies will be explained in the "methods and results" section. All code included in this report will be written in R, and will use functions from several libraries, notably the Tidyverse and ggplot2 libraries [If there are any others that need to be mentioned, let me know or add them here]. When we refer to a "dataset", we are referencing a specific table of data, while an "observation" refers to a row of this table, and a "variable" is a column of the table. 

In this report, we will explore the broad question: "We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players." (DSCI 100, "Project Planning Stage Instructions"). To do so, we will answer the specific research question, "Can the total number of hours a player has accumulated (from players.csv) and the duration of their previous sessions (from sessions.csv) predict whether they will start a new session in the next 24 hours?". By answering the research question, we will learn whether our method for predicting when players will be online (within 24 hours) is successful, thus providing a starting point for predicting demand for more precise time frames. 

To answer the research question, we will use two datasets, "players.csv" and "sessions.csv". The first dataset, which we will call "players" in this report, provides information about individual players and includes the information name, gender, age, hours played, experience level, email (hashed), and whether they are subscribed to the newsletter. The second dataset, which we will call "sessions", is a record of all sessions played by all players, including the start time, end time, and email (hashed). The start and end times are included both in the DD/MM/YYYY HH:MM format and the "original time", which is a standardized time frame often used in computer science. To answer our research question, we will combine the two data sets by player email (hashed) to look at both the sessions played and the total hours played. To do so, we will first have to group the sessions dataset by email (hashed) to have all played sessions for each player.  

## Methods

To address our research question "whether a player’s total accumulated hours (from players) and the duration of their previous sessions (from sessions) can predict whether they will start a new session within the next 24 hours", we performed a complete data-analysis workflow in R. This section describes the full sequence of methods used, from loading the data to building and evaluating the predictive model. All analysis was completed in R, using functions from the tidyverse, lubridate, and class libraries (if additional packages are used in the final code, they should be added here).

**1. Loading the Data**

We begin by importing the two datasets, players.csv and sessions.csv, into R. Each dataset is loaded as a tibble to support tidyverse workflows.
A reproducible seed is set at the beginning of the analysis to ensure consistent model results.


**2. Data Wrangling and Cleaning**


2.1 Preparing the sessions dataset

The sessions dataset contains multiple rows per player, one per game session. To construct features meaningful for prediction, we:

Convert the start and end times into date-time objects using lubridate.

Compute the duration of each session in hours.

Group sessions by hashed email to calculate:

the average session duration per player,

the number of historical sessions per player,

the timestamp of the player’s most recent session.

Create a new binary outcome variable indicating whether a given player starts a new session within 24 hours after their latest recorded session.


2.2 Preparing the players dataset

The players dataset contains one row per player. We:

Select relevant player-level variables (age, gender, hours played, experience level).

Ensure each column has an appropriate data type (e.g., factors for categorical variables, numeric values for continuous variables).


2.3 Merging datasets

We join the cleaned players and sessions-summary datasets by the hashed email field. This produces a tidy, one-row-per-player dataset containing both player-level and aggregate session-level features.


**3. Exploratory Data Analysis (EDA)**

To better understand the data before modelling, we produce both numerical summaries and visualizations:


3.1 Numerical Summaries

We compute summary statistics (mean, median, range, standard deviation) for variables relevant to prediction, including:

total hours played,

number of sessions,

average session duration,

age,

experience level.

These statistics allow us to inspect variable distributions and identify outliers or unusual patterns.


3.2 Exploratory Visualizations

To visualize relationships between predictors and the outcome, we create at least one exploratory figure (e.g., Figure 1), such as:

a histogram or density plot of total hours played,

a boxplot comparing average session duration for players who did vs. did not start a session within 24 hours.

Each figure is labeled with a figure number and legend.


**4. Modeling Approach: K-Nearest Neighbors (KNN) Classification**

4.1 Rationale for KNN

We use a K-Nearest Neighbors (KNN) classifier to predict whether a player will start a session in the next 24 hours. This method is appropriate because:

The outcome is binary.

KNN is taught in the course and suitable for classification tasks.

It makes few distributional assumptions.

With proper preprocessing, it can handle both numeric and categorical predictors.

4.2 Assumptions

KNN operates under several key assumptions:

Similarity assumption: players close in feature space should have similar outcomes.

Meaningful distance metric: predictors must be scaled to ensure fair comparisons.

Feature standardization: numeric predictors are centered and scaled.

Adequate local data: sufficient nearby neighbors must exist for reliable predictions.


**5. Data Processing for Modeling**

5.1 Train–Test Split

Before any preprocessing, we split the data into:

80% training set – used for preprocessing, tuning, and cross-validation,

20% test set – used only for final evaluation.

This prevents data leakage.


5.2 Feature Preprocessing

To prepare predictors for KNN:

Numeric features (hours played, average session duration, number of sessions, age) are standardized.

Categorical features are converted into numerical indicators as taught in the course.

Preprocessing steps are learned from the training set and applied identically to the test set.


**6. Model Training and Cross-Validation**

We evaluate several values of k (e.g., 3, 5, 7, 9, …) using 10-fold cross-validation on the training data.
For each value of k, we record the classification accuracy.

The optimal value of k is chosen as the one with the highest cross-validated accuracy, and that value is used to fit the final KNN model.

We visualize the cross-validation results (e.g., Figure 2), showing accuracy as a function of k.


**7. Final Model Evaluation**

The selected model is then applied to the test set.
We compute standard classification metrics taught in the course, including:

overall test accuracy,

confusion matrix (optional),

qualitative interpretation of model performance.

A final visualization (e.g., Figure 3) may be included to illustrate predicted vs. actual outcomes or to display decision boundaries if appropriate.