https://github.com/sherisezhang/project_individual.git

# Load package and import data

In [1]:
import pandas as pd
import altair as alt

file_id_1 = '1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
url_1 = f'https://drive.google.com/uc?export=download&id={file_id_1}'
players = pd.read_csv(url_1)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [2]:
players[['experience']].drop_duplicates()

Unnamed: 0,experience
0,Pro
1,Veteran
3,Amateur
4,Regular
12,Beginner


In [3]:
players[['played_hours', 'age']].agg(['max', 'min', 'mean'])

Unnamed: 0,played_hours,age
max,223.1,99.0
min,0.0,8.0
mean,5.845918,21.280612


In [4]:
file_id_2 = '14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB'
url_2 = f'https://drive.google.com/uc?export=download&id={file_id_2}'
sessions = pd.read_csv(url_2)
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


# Data Description

players.csv: contains player demographic and behavioural information; sessions.csv: records session counts and duration(when and how long each player played, reflecting activity frequency).
**Only players.csv is used for analysis.**
### "players.csv" dataset (196 rows,9 columns) includes:
+ 'experience'(categorical): player experience level
+ 'subscribe'(boolean)
+ 'played_hours'(numeric): from 0 to 223.1
+ 'gender'(categorical)
+ 'age'(numeric): from 8 to 99
+ 'hashedEmail' is not human readable, 'individualId' and 'organizationName' are missing values, these three along with 'name' are identifier columns and will be removed.
### "session.csv" includes:
+ start time and end time
+ orginal start and end time

In [5]:
players['gender'].value_counts()

gender
Male                 124
Female                37
Non-binary            15
Prefer not to say     11
Two-Spirited           6
Agender                2
Other                  1
Name: count, dtype: int64

In [6]:
players['subscribe'].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

# Potential issues
1. Non-informative columns with missing values are dropped.
2. The target variable('subscribe': 144 True, 52 False) is imbalanced, so precision or recall will complement accuracy.
3. From the table above for gender column, rare gender levels are merged into 'others'.
4. Experience and gender will be one-hot encoded for KNN since it is not numerical originally.

In [7]:
players = players.drop(columns = ['individualId', 'organizationName', 'name', 'hashedEmail'], errors='ignore')
players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [8]:
players['gender'] = players['gender'].replace({
    'Agender',
    'Two-Spirited',
    'Prefer not to say',
    'Other'}, 'Other')
players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Other,17
194,Amateur,False,2.3,Male,17


# Question 

### I will investigate the question 1: **"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"**
Potential predictors: 'experience', 'played_hours', 'age', 'gender'. Since 'subscribe' is categorical. **KNN** is appropriate predictive method because it can naturally handle multi-feature similarity comparisons and does not require assumptions about linear relationships in the data. 

# Exploratory Data Analysis and Visualization

#### (Note: I completed the required minimal wrangling earlier, before writing the question, to keep the cleaned dataset consistent throughout the notebook.)

The plots from visualization allow me to check whether the predictors are related to subscription status.

### Experience vs Subscribe

In [9]:
import altair as alt

experience_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X('experience').title('player experience level'),
    y='count()',
    color='subscribe'
).properties(title='Player subscription count by Experience level'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
experience_plot

Pro and Regular players show the highest subscription proportions (around 80–90%), and Veterans have the lowest. So experience level may influence subscription behavior, it can be the predictor.

### Gender vs Subscribe

In [10]:
gender_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X('gender').title('Gender'),
    y=alt.Y('count()'),
    color='subscribe'
).properties(title='Player subscription count by Gender'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
gender_plot

The proportion of subscription for each gender is similar(around 75%), which means gender does not appear to be associated with subscription.

### Played Hour vs Subscribe

In [11]:
played_hours_plot = alt.Chart(players).mark_boxplot(size=70).encode(
    x=alt.X('subscribe:N').title('subscribe'),
    y=alt.Y('played_hours:Q').title('Played Hours'),
    color='subscribe:N'
).properties(title='Distribution of Played Hours by Subscription'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
played_hours_plot

The median played hours for subscribers is around 20, compared to only 5 for non-subscribers. So played hours can be the predictor.

### Age vs Subscribe

In [12]:
age_plot = alt.Chart(players).mark_boxplot(size=70).encode(
    x=alt.X('subscribe:N').title('subscribe'),
    y=alt.Y('age:Q').title('Age'),
    color='subscribe:N'
).properties(title='Distribution of Played Hours by Age'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
age_plot 

The median age and the interquartile range for both subscription status are nearly the same, indicating the relationship between age and subscription is week.

# Methods and Plan

After visualization, I plan to build a KNN classification model, where the response variable is subscribe, and the predictors include **played_hours, experience.**

### Why KNN?

I use Knn because it can make predictions based on the similarity between observations and naturally capture non-linear relationships.

### Assumptions and preprocessing required

Knn requires numeric and scaled features. Therefore, experience will be converted to numeric using one-hot coding. Then these two predictors will be standardized using StandardScaler.

### Limitations

+ Knn can perform poorly when the response classes are imbalanced. To reduce this effect, I will compare the model performance using both accuracy and precision or recall.
+ One-hot encoding loses the natural order among categories (e.g. Beginner < regular), but allows KNN to handle categorical variables correctly.
+ Since KNN relies on distances, standardization will be applied to avoid one feature from dominating the distance.

### Data Processing

1. Split data into training (80%) and test (20%) sets.
2. Experience is one-hot encoded, and numeric predictors(played_hours, age) are standardized using a pipeline.
3. Use GridSearchCV with 5-fold cross-validation on the training set to pick the best K value.
4. Fit the best model on the training set and evaluate on test set.

### Model Evaluation Plan

Model performance will be assessed using accuracy, precision, and recall to address class imbalance. The confusion matrix will check for class bias, and the model with the highest cross-validated accuracy will be chosen as the final version.