# DSCI 100 Project

**Problem**: Predicting Usage of a Video Game Research Server

"A research group in Computer Science at UBC is collecting data about how people play video games. They have set up a MineCraft serverLinks to an external site., and players actions are recorded as they navigate through the world. But running this project is not simple: they need to target their recruitment efforts, and make sure they have enough resources (e.g., software licenses, server hardware) to handle the number of players they attract." (CANVAS)


## The Question

"Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Question 3: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

In your project, you will select one of these broad questions and use it to formulate a specific question using some of the variables in the dataset.  Your project should answer your specific question." (CANVAS)

For this project, the question we will be focusing on is Question 3. More specifically, using the `sessions.csv` dataset, we will be looking at the difference between the `start_time` and `end_time` to find when there is peak activity on the servers. We can also look at what type of players will play the most using the `players.csv` dataset. 

## Data Description: The Sessions

The aforementioned research group has collected data on what time of day and for how long players will be on the server. In order to accomadate a large number of simultaneous players, we will be looking at when are peak play hours using the `sessions.csv` dataset. 


The columns in the dataset are:
 
- `hashedEmail` - The players email, hashed for privacy
- `start_time` - Start time of day
- `end_time` - End time of day
- `original_start_time` - ________
- `original_end_time` - __________

1535 rows of data

The original dataset is here: https://drive.google.com/file/d/14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB

# Exploratory Data Analysis and Visualization 

In [3]:
# Import the necessary libraries
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [7]:
# Read in the players data
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [5]:
# Read in the sessions data
sessions = pd.read_csv("data/sessions.csv")
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


In [8]:
# Tidy the players data
clean_players = players.drop(columns=["hashedEmail", "individualId", "organizationName", "name"])
clean_players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [10]:
# Tidy the sessions data
clean_sessions = sessions.drop(columns=["hashedEmail"])
clean_sessions

Unnamed: 0,start_time,end_time,original_start_time,original_end_time
0,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...
1530,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12
