# DSCI 100 Project

**Problem**: Predicting Usage of a Video Game Research Server

"A research group in Computer Science at UBC is collecting data about how people play video games. They have set up a MineCraft serverLinks to an external site., and players actions are recorded as they navigate through the world. But running this project is not simple: they need to target their recruitment efforts, and make sure they have enough resources (e.g., software licenses, server hardware) to handle the number of players they attract." (CANVAS)


## The Question

"Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Question 3: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

In your project, you will select one of these broad questions and use it to formulate a specific question using some of the variables in the dataset.  Your project should answer your specific question." (CANVAS)

For this project, the question we will be focusing on is **Question 2**. More specifically, using the `players.csv` dataset, we will be looking at the total `played_hours` to find if the level of `experience` impacts which players contribute the most data to the research. 

## Data Description: The Players

The aforementioned research group has collected data on the people who have signed up to play on the MineCraft server. In search of optimizing their recruitment efforts, they have collected the data below to help find what demographics are most likely to play the most. 

- Number of observations (rows): 196 (each row represents one unique player)  
- Number of variables (columns): 7 (after cleaning: `individualId` and `organizationName` removed)  
- Observational unit: Player-level data (each row = one player)
- Purpose: Used to explore how player characteristics relate to their total data contribution.


The columns in the dataset are:
 
- `experience` - The level of self-proclaimed experience a player has
- `subscribe` - If the player wants to suscribe to the game-related newsletter
- `hashedEmail` - The players email, hashed for privacy
- `played_hours` - Amount of hours played
- `name` - Name of player
- `gender` - Gender of player
- `age` - Age of player


Issues

- Missing data:  variables `individualId`, `organizationName` were entirely missing.  
- Identifiers: `hashedEmail` is anonymized
- Measurement bias: `played_hours` could include idle time, which would overestimate engagement. 

The original dataset is here: https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit 

## Data Description: The Sessionss

The aforementioned research group has collected data on what time of day and for how long players will be on the server. In order to accomadate a large number of simultaneous players, we will be looking at when are peak play hours using the `sessions.csv` dataset. 


The columns in the dataset are:
 
- `hashedEmail` - The players email, hashed for privacy
- `start_time` - Start time of day
- `end_time` - End time of day
- `original_start_time` - The same start time represented as a UNIX-style timestamp in milliseconds (e.g., 1.719770e+12). This is machine-readable but not human-readable.
- `original_end_time` - The same end time recorded as a UNIX-style timestamp in milliseconds. Also unreadable without conversion.

1535 rows of data

The original dataset is here: https://drive.google.com/file/d/14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB

# Exploratory Data Analysis and Visualization 

In [1]:
# Import the necessary libraries
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
# Read in the players data
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# Tidy the players data
clean_players = players.drop(columns=["hashedEmail", "individualId", "organizationName", "name"])
clean_players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [30]:
clean_players_exp = alt.Chart(clean_players).mark_bar().encode(
    x=alt.X("experience").title("Experience"),
    y=alt.Y("played_hours").title("Hours Played"),
    color =alt.Color("experience").title("Experience")
)
clean_players_exp

In [34]:
clean_players_sub = alt.Chart(clean_players).mark_bar().encode(
    x=alt.X("subscribe").title("Subscription"),
    y=alt.Y("played_hours").title("Hours Played"),
    color =alt.Color("experience").title("Experience")
).facet(
    column='experience:N'
).properties(
    title='Subscription play time by Experience'
)
clean_players_sub

In [35]:
clean_players_gender = alt.Chart(clean_players).mark_bar().encode(
    x=alt.X("gender").title("Gender"),
    y=alt.Y("played_hours").title("Hours Played"),
    color =alt.Color("experience").title("Experience")
).facet(
    column='experience:N'
).properties(
    title='Gender play time by Experience'
)

clean_players_gender

In [45]:
clean_players_age = alt.Chart(clean_players).mark_bar(size=10).encode(
    x=alt.X("age").title("Age").scale(zero=False),
    y=alt.Y("played_hours").title("Hours Played"),
    color =alt.Color("experience").title("Experience")
).facet(
    column='experience:N'
).properties(
    title='Age play time by Experience'
)

clean_players_age