## (1) Data Description 
The project will use two datasets, **`players.csv`** and **`sessions.csv`**, both describing player demographics, in-game behaviour, and engagement metrics from an online game.

- **`players.csv`** contains player-level information such as unique IDs, demographics (e.g., age, gender, region), and whether the player has **subscribed to a game-related newsletter**.
- **`sessions.csv`** records player activity during gameplay sessions, including metrics such as session duration, number of sessions, total playtime, number of actions, and purchase history.

Since the question is focus on the potential relation of player characteristics and subscribing to a game-related newsletter. We will mainly focus on the **`players.csv`** which contains all the important imformations. It has 7 valid variables and 2 variable without any data, 196 observatioons in total. Here is the detailed descriptiion:

- **experience**: category variable, describe the player's experience on game. Range is between "beginner", "regular", "veteran", "amateur" and "pro".
- **subscribe**: category variable, describe whether the player subscribe to a game-related newsletter or not.
- **hashedEmail**: nominal variable, record player's email
- **played_hours**: quantative value, record how many hours the player played.
- **name**: nominal variable, record players' name
- **gender**: category varibale, record players' gender
- **age**: quantative variable. record players' age
- **individualId**: No data
- **organizationName**: No data

And here is for the **`sessions.csv`**, 5 variables with 1535 observations:

- **hashedEmail**: nominal variable, record player's email
- **start_time**: time variable, record the time that player begin to play
- **end_time**: time variable, record the time that player end playing
- **original_start_time**: time variable, record the original time that player begin to play
- **original_end_time**: time variable, record the original time that player end playing

**Potential issues and considerations:**
- Skewed distributions in engagement variables.
- Potential class imbalance in the response variable (`subscribed_newsletter`), as non-subscribers are likely to dominate.

These datasets may have been collected from in-game logs and user profiles, which could introduce selection bias toward active players.

---

## (2) Question

**Research Question:**  
*What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?* 

The **response variable** is `subscribe` (True or False).  
**Explanatory variables** may include:
- age
- experience
- played hours
- gender

So the specific question is **can we predict "subscribe" with "age,experience,played hours,gender"**

We can identify which types of players are most likely to engage beyond gameplay (via newsletter subscription), providing insights for marketing and retention strategies. I plan to first clean up the data so that it only contains the response variable and explanatory so that it's ready for classification.

---

## (3) Exploratory Data Analysis and Visualization
The data player.csv could be load as follow:

In [6]:
import altair as alt
import numpy as np
import pandas as pd
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url)
players.sample(10)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
121,Beginner,True,9c2a5275ada716508a7e66629dfbeaa4c03ca9eb5f30d2...,0.1,Bella,Male,24,,
96,Amateur,True,bd10e7eab5531d5e82c9eebb929e2703f05367a90b0533...,0.1,Gabriel,Male,19,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
127,Amateur,True,56bae029998e9fef1cf18f614dc0a5b300640494787c10...,0.0,Peyton,Male,21,,
22,Beginner,True,f9ac013b2f0bc2bd4928a6a0fc8a0aae8b5c4f2670cf9e...,1.0,Leah,Male,17,,
67,Amateur,True,18936844e06b6c7871dce06384e2d142dd86756941641e...,17.2,Kyrie,Male,14,,
45,Beginner,False,fa7d496b2f74c51ec70395bd8397b49f97a3ce8d7ba7e0...,0.0,Umar,Male,24,,
138,Veteran,True,c7bc394038bc433736fb1ecec22e6490582a5c9c6fb312...,0.0,Jason,Male,17,,
10,Veteran,True,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,1.6,Lane,Female,23,,
65,Veteran,True,dc73467f73263dd4a07838330dd1cc115aa3f8b0353891...,0.1,Felix,Male,21,,


Then we can wrangle the data so only the related variables left:

In [7]:
players = players[['subscribe', 'age', 'gender', 'experience', 'played_hours']]
players.sample(10)

Unnamed: 0,subscribe,age,gender,experience,played_hours
183,True,22,Male,Amateur,32.0
151,True,17,Male,Amateur,0.0
72,True,17,Male,Veteran,0.0
11,True,17,Male,Pro,0.0
105,True,17,Male,Regular,0.0
133,True,17,Two-Spirited,Beginner,0.0
78,True,22,Non-binary,Regular,0.0
100,True,20,Male,Amateur,0.0
130,True,23,Male,Amateur,56.1
187,True,17,Male,Amateur,0.0


In [None]:
Following are some simple graph of response variable and the explantory variable respectively.
subscription distribution by age:

In [12]:
alt.Chart(players).mark_point().encode(
    y='subscribe:N',
    x='age:Q',
    color='subscribe:N'
).properties(title='Age Distribution by Subscription Status')


subscription distribution by experience:

In [19]:
count_chart = (
    alt.Chart(players)
    .mark_bar()
    .encode(
        x=alt.X('experience:N', title='Experience Level'),
        y=alt.Y('count()', title='Count'),
        color=alt.Color('subscribe:N', title='Subscribed')
    )
    .properties(height=200, title='Count of Players by Experience and Subscription')
)
count_chart