## (1) Data Description 
The project will use two datasets, **`players.csv`** and **`sessions.csv`**, both describing player demographics, in-game behaviour, and engagement metrics from an online game.

- **`players.csv`** contains player-level information such as unique IDs, demographics (e.g., age, gender, region), and whether the player has **subscribed to a game-related newsletter**.
- **`sessions.csv`** records player activity during gameplay sessions, including metrics such as session duration, number of sessions, total playtime, number of actions, and purchase history.

Since the question is focus on the potential relation of player characteristics and subscribing to a game-related newsletter. We will mainly focus on the **`players.csv`** which contains all the important imformations. It has 7 valid variables and 2 variable without any data, 196 observatioons in total. Here is the detailed descriptiion:

- **experience**: category variable, describe the player's experience on game. Range is between "beginner", "regular", "veteran", "amateur" and "pro".
- **subscribe**: category variable, describe whether the player subscribe to a game-related newsletter or not.
- **hashedEmail**: nominal variable, record player's email
- **played_hours**: quantative value, record how many hours the player played.
- **name**: nominal variable, record players' name
- **gender**: category varibale, record players' gender
- **age**: quantative variable. record players' age
- **individualId**: No data
- **organizationName**: No data

And here is for the **`sessions.csv`**, 5 variables with 1535 observations:

- **hashedEmail**: nominal variable, record player's email
- **start_time**: time variable, record the time that player begin to play
- **end_time**: time variable, record the time that player end playing
- **original_start_time**: time variable, record the original time that player begin to play
- **original_end_time**: time variable, record the original time that player end playing

**Potential issues and considerations:**
- Skewed distributions in engagement variables.
- Potential class imbalance in the response variable (`subscribed_newsletter`), as non-subscribers are likely to dominate.

These datasets may have been collected from in-game logs and user profiles, which could introduce selection bias toward active players.

---

In [3]:
import altair as alt
import numpy as np
import pandas as pd

In [5]:

df = pd.read_csv("sessions.csv")
df

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


## (2) Question

**Research Question:**  
*What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?*

The **response variable** is `subscribed_newsletter` (True or False).  
**Explanatory variables** may include:
- age
- experience
- played hours
- gender

We can identify which types of players are most likely to engage beyond gameplay (via newsletter subscription), providing insights for marketing and retention strategies. I plan to first clean up the data so that it only contains the response variable and explanatory so that it's ready for classification.

---