### **Data Science Project: Planning Stage (Individual)**

**Problem**: Predicting Usage of a Video Game Research Server

A research group in UBC’s Department of Computer Science is studying how people play Minecraft by collecting data from a research server. Each player’s in-game actions are recorded to analyze gameplay behaviour and engagement patterns. The team needs to understand which players contribute the most data so they can manage server resources efficiently and improve their recruitment strategy.

### **1) Data Description:**
There are two datasets used in this project: `players.csv` and `sessions.csv`.  
- `players.csv` contains information about each player, such as their total playtime and number of sessions.  
- `sessions.csv` includes details for each gameplay session, such as session length and player ID.  
Both datasets come from a Minecraft research server operated by UBC’s Department of Computer Science.

**Overview of `players.csv`**

- **Number of observations (rows):** 196 (each row represents one unique player)  
- **Number of variables (columns):** 7 after cleaning (`individualId` and `organizationName` removed)  
- **Observational unit:** Player-level data (each row = one player)  
- **Purpose:** Used to explore how player characteristics relate to their total data contribution.

**Variable**

`experience` — Categorical  
Describes the player’s Minecraft experience level (e.g., Amateur, Regular, Veteran, Pro). It shows how skilled or active the player is.

`subscribe` — Boolean  
Indicates whether the player subscribed to research or server updates. A value of True means the player  is subscrubed.

`hashedEmail` — String (Identifier)  
An anonymized player ID used to link this dataset with `sessions.csv`.

`played_hours` — Numeric  
Total number of hours a player spent on the server. Measures activity and time.

`name` — String (identification)
The player’s in-game name.

`gender` — Categorical  
e.g., Male, Female, Other, Prefer not to say

`age` — Numeric (Integer)  
Player’s age in years.

This dataset is **tidy**, following the three principles of tidy data:
1. **Each variable forms one column** — all attributes are stored in distinct columns.  
2. **Each observation forms one row** — each player is represented once.  
3. **Each type of observational unit forms one table** 

The columns `individualId` and `organizationName` were removed because they contained only `NaN` values.  
The columns `name` and `hashedEmail` remain as identifiers, not analytical variables.

**Issues and Limitations**

- **Missing data:**  variables such as `individualId`, `organizationName` were entirely missing.  
- **Identifiers:** `hashedEmail` is anonymized
- **Measurement bias:** `played_hours` may include idle time, which could overestimate engagement.
- **Idling:** The `played_hours` variable may overestimate true player engagement because it might includes periods of idling.

**Overview of `sessions.csv`**

- **Number of observations (rows):** 1535 (each row represents one unique player)  
- **Number of variables (columns):** 5
- **Observational unit:** Session-level data (each row = one gameplay session)  
- **Purpose:** Used to analyze how frequent and how long the individual played for.

The `sessions.csv` dataset is **not tidy**  While each row correctly represents one play session, some columns violate the rules of tidy data:

**Each variable forms one column** — 
   There are duplicate representations of time-related variables:  
   - `start_time` and `original_start_time` record the same information in two different formats (string and numeric timestamp).  
   - `end_time` and `original_end_time` do the same.  
   This duplication means one variable is stored across multiple columns.

**Variables**

`hashedEmail` — String (Identifier)  
A unique anonymized player ID that links each session to a player in `players.csv`.

`start_time` — String  
The combined date and time when the gameplay session began (e.g., 30/06/2024 18:12).

`end_time` — String  
The combined date and time when the gameplay session ended (e.g., 30/06/2024 18:24).

`original_start_time` — Numeric (Float)  
The same start time represented as a UNIX-style timestamp in milliseconds (e.g., 1.719770e+12). This is machine-readable but not human-readable.

`original_end_time` — Numeric (Float)  
The same end time recorded as a UNIX-style timestamp in milliseconds. Also unreadable without conversion.

**Issues and Limitations**

- **Unreadable timestamps:** The numeric timestamp columns (e.g., `1.7189770e+12`) are stored in scientific notation and are not human-readable.
- **Possible missing or inconsistent entries:** Some timestamps may be missing, duplicated, or formatted inconsistently, which could affect calculated session lengths.  
- **Limited information:** The dataset only includes session start and end times — no direct measure of in-game activities or player engagement quality.  
- **Linking dependency:** This dataset alone lacks player demographics or characteristics; it must be merged with `players.csv` using `hashedEmail` for complete analysis.

### **2) Question:**

For this project, I will focus on predicting which **player characteristics** and **in-game behaviours** are most associated with higher total data contribution. Specifically,

` We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.`

**Response and Explanatory Variables**

- **Response variable:** total data contribution (by total playtime or total session duration per player).  
- **Explanatory variables:** average session length, number of sessions, player experience level, subscription status, and age.

**How the Data Will Be Used**

- The `players.csv` dataset provides player-level characteristics such as `experience`, `subscribe`, `gender`, `age`, and total `played_hours`.  
- The `sessions.csv` dataset provides session-level details such as `start_time`, `end_time` to create `session_duration` for each play session.

These two datasets will be merged into a single, larger dataset using the common key hashedEmail so that each player’s information can be linked to their session activity. This process will be done using the`merge` function introduced in Chapter 3.

**Planned Data Wrangling Steps**

1. **Tidy the session data** by separating date and time into distinct variables and calculating `session_duration` for each record.  
2. **Summarize session data** by player to compute:
   - total number of sessions per player  
   - average session duration  
   - total playtime across all sessions  
3. **Merge** `session_tidy.csv` and `players.csv` dataset using `hashedEmail`.  
4. **Prepare** the final dataset for modeling by selecting relevant features and encoding categorical variables (e.g., experience level, gender).  
5. **Apply** a predictive method (K-Nearest Neighbors Regression) to estimate how much data each player is likely to contribute based on their characteristics and in-game behaviour.


**Goal**

This analysis will help identify the types of players who generate the most valuable gameplay data, allowing the research group to focus on recruiting similar participants in future studies.


### **3) Exploratory Data Analysis and Visualization**

In [2]:
# Import libraries
import pandas as pd
import altair as alt
# Read the datasets
players = pd.read_csv("data/players.csv")
sessions = pd.read_csv("data/sessions.csv")

In [3]:
# Drop columns that are entirely missing(NaN)
players_tidy = players.drop(columns=["individualId", "organizationName"])
# Preview tidy players df
players_tidy.head()

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21


In [4]:
# Convert start_time and end_time to datetime
sessions['start_time'] = pd.to_datetime(sessions['start_time'], format='%d/%m/%Y %H:%M')
sessions['end_time'] = pd.to_datetime(sessions['end_time'], format='%d/%m/%Y %H:%M')
# Calculate session duration in hours
sessions["session_duration(minutes)"] = (sessions["end_time"] - sessions["start_time"]).dt.total_seconds()/60
# Drop unneeded columns
sessions_tidy = sessions.drop(columns=["start_time","end_time", "original_start_time", "original_end_time"])
# Drop any NaN data
sessions_tidy = sessions_tidy.dropna(subset=["session_duration(minutes)"])
# Preview tidy sessions df
sessions_tidy.head()

Unnamed: 0,hashedEmail,session_duration(minutes)
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,11.0


In [34]:
# Merged_df
merged_df = players_tidy.merge(sessions_tidy, on="hashedEmail")
merged_df

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,session_duration(minutes)
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,74.0
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,7.0
2,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,44.0
3,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,22.0
4,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,56.0
...,...,...,...,...,...,...,...,...
1528,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,7.0
1529,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,28.0
1530,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,14.0
1531,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,5.0


In [45]:
# Do older or younger player tend to play longer sessions?
age_duration= alt.Chart(merged_df).mark_point(opacity=0.6).encode(
    x=alt.X("age:Q").title("Players age")
    .scale(zero=False),
    y=alt.Y("session_duration(minutes)")
    .title("Session Duration (Minute)")
    .scale(zero=False),
    color=alt.Color("experience:N").title("Experience Level")
)
age_duration

In [43]:
# Do session durations differ by gender and experience?
gender_duration= alt.Chart(merged_df).mark_bar(opacity=0.8).encode(
    x=alt.X("gender").title("Gender") ,
    y=alt.Y("session_duration(minutes)")
    .title("Session Duration (Minute)"),
    color=alt.Color("experience:N").title("Experience Level")
)
gender_duration

**Scatter Plot- Age vs. Session Duration(Minute)**
- Question answered: Examine how player age relates to session length.
- Most players are under 30 and above 10, and their play session vary widely. Older players appears less common and generally have shoryer sessions.
- There is not a consistent patern of how experience affect session duration

**Bar Chart- Gender vs. Session Duration(Minute)**
- shows total session duration for each gender, with colors representing experience levels
- Question answered: Do session durations differ by gender and experience level?
- There are noticeably more Pro, Regular, and Beginner players among males compared to other genders.

**Relevance to our question:**
Together, these two visualization compliment each other by helping to identify which kinds of players are most engaged and contribute the most gameplay data. The scatter plot shows that players between the ages of 10 and 30 are the most active and engaged, as they account for the majority of sessions and display the widest range of play durations. The bar chart complements this finding by showing that male players make up the largest share of total playtime and are represented across multiple experience levels, including Pro, Regular, and Beginner.

This pattern indicates that player engagement is highest among male players aged between 10 and 30, who also display a wide range of experience levels. However, more data wrangling and further investigation are needed to understand what other player characteristics may contribute to larger amounts of data and higher engagement levels.

### **4) Methods and Plan**

