### **Data Science Project: Planning Stage (Individual)**

**Problem**: Predicting Usage of a Video Game Research Server

A research group in UBC’s Department of Computer Science is studying how people play Minecraft by collecting data from a research server. Each player’s in-game actions are recorded to analyze gameplay behaviour and engagement patterns. The team needs to understand which players contribute the most data so they can manage server resources efficiently and improve their recruitment strategy.

### **(1) Data Description:**
There are two datasets used in this project: `players.csv` and `sessions.csv`.  
- `players.csv` contains information about each player, such as their total playtime and number of sessions.  
- `sessions.csv` includes details for each gameplay session, such as session length and player ID.  
Both datasets come from a Minecraft research server operated by UBC’s Department of Computer Science.

In [24]:
# Import libraries
import pandas as pd

# Read the datasets
players = pd.read_csv("data/players.csv")
sessions = pd.read_csv("data/sessions.csv")

In [29]:
# Drop columns that are entirely missing
players_1 = players.drop(columns=["individualId", "organizationName"])
# First 5 rows of "players.csv"
players_1

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


**Overview of `players.csv`**

- **Number of observations (rows):** 196 (each row represents one unique player)  
- **Number of variables (columns):** 7 after cleaning (`individualId` and `organizationName` removed)  
- **Observational unit:** Player-level data (each row = one player)  
- **Purpose:** Used to explore how player characteristics relate to their total data contribution.

**Variable**
`experience` — Categorical  
Describes the player’s Minecraft experience level (e.g., Amateur, Regular, Veteran, Pro). It shows how skilled or active the player is.

`subscribe` — Boolean  
Indicates whether the player subscribed to research or server updates. A value of True means the player  is subscrubed.

`hashedEmail` — String (Identifier)  
An anonymized player ID used to link this dataset with `sessions.csv`.

`played_hours` — Numeric  
Total number of hours a player spent on the server. Measures activity and time.

`name` — String (identification)
The player’s in-game name.

`gender` — Categorical  
e.g., Male, Female, Other, Prefer not to say

`age` — Numeric (Integer)  
Player’s age in years.

This dataset is **tidy**, following the three principles of tidy data:
1. **Each variable forms one column** — all attributes are stored in distinct columns.  
2. **Each observation forms one row** — each player is represented once.  
3. **Each type of observational unit forms one table** — player data are separated from session data (`sessions.csv`).

The columns `individualId` and `organizationName` were removed because they contained only `NaN` values.  
The columns `name` and `hashedEmail` remain as identifiers, not analytical variables.

**Issues and Limitations**

- **Missing data:**  variables such as `individualId`, `organizationName` were entirely missing.  
- **Identifiers:** `hashedEmail` is anonymized
- **Measurement bias:** `played_hours` may include idle time, which could overestimate engagement.  

In [31]:
# Looking at the Data Frame
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


**Overview of `players.csv`**

- **Number of observations (rows):** 1535 (each row represents one unique player)  
- **Number of variables (columns):** 5
- **Observational unit:** Session-level data (each row = one gameplay session)  
- **Purpose:** Used to analyze how frequent and how long the individual played for.

The `sessions.csv` dataset is **not tidy**  While each row correctly represents one play session, some columns violate the rules of tidy data:

**Each variable forms one column** — 
   There are duplicate representations of time-related variables:  
   - `start_time` and `original_start_time` record the same information in two different formats (string and numeric timestamp).  
   - `end_time` and `original_end_time` do the same.  
   This duplication means one variable is stored across multiple columns.

**Issues and Limitations**

- **Unreadable timestamps:** The numeric timestamp columns (e.g., `1.7189770e+12`) are stored in scientific notation and are not human-readable without conversion to datetime format.  
- **Possible missing or inconsistent entries:** Some timestamps may be missing, duplicated, or formatted inconsistently, which could affect calculated session lengths.  
- **Limited information:** The dataset only includes session start and end times — no direct measure of in-game activities or player engagement quality.  
- **Linking dependency:** This dataset alone lacks player demographics or characteristics; it must be merged with `players.csv` using `hashedEmail` for complete analysis.