https://github.com/sherisezhang/project_individual.git

# Load package and import data

In [3]:
import pandas as pd
import altair as alt

players = pd.read_csv("players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [4]:
players[['experience']].drop_duplicates()

Unnamed: 0,experience
0,Pro
1,Veteran
3,Amateur
4,Regular
12,Beginner


In [5]:
players[['played_hours', 'age']].agg(['max', 'min', 'mean'])

Unnamed: 0,played_hours,age
max,223.1,99.0
min,0.0,8.0
mean,5.845918,21.280612


# Data Description

### I will use the "players.csv" dataset (196 rows,9 columns)
### Key variables are:
+ 'experience'(categorical): player experience level, one of {Pro, Veteran, Amateur, Regular, Beginner}.
+ 'subscribe'(boolean): whether or not the player has subscribed the game newsletter, and this will be the response variable.
+ 'hashedEmail'(identifier): this records the hashing encoded email addresses of players.
+ 'played_hours'(numeric): total number of hours the player has spent in the game.
+ 'name'(identifier): player name. Not useful for modelling.
+ 'gender'(categorical): gender of the player.
+ 'age'(numeric): age of the player in years.

In [6]:
players['gender'].value_counts()

gender
Male                 124
Female                37
Non-binary            15
Prefer not to say     11
Two-Spirited           6
Agender                2
Other                  1
Name: count, dtype: int64

In [7]:
players['subscribe'].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

# Potential issues
1. The 'hashedEmail' column contains anonymized identifiers, which can't be interpreted as a meaningful predictor, so it will be removed.
2. Removed the columns 'individualId', 'organizationName' because they only contain missing values.
3. Class imbalance is present in the target variable('subscribe': 144 True, 52 False), so accuracy alone isn't enough to evaluate the model later. Then alternative measures such as precision, recall should be considered.
4. From the table above, the quantities of 'Agender', 'Other', and 'Two-Spirited' are very small compared to 'Male' and 'Female'. To avoid instability when fitting a KNN model, maybe we can merge them into a single level called 'others'.
5. If we want to used experience as a predictor, it is not numerical, we need to process it first, one-hot encoding is an option.

In [8]:
players = players.drop(columns = ['individualId', 'organizationName', 'name', 'hashedEmail'], errors='ignore')
players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [9]:
players['gender'] = players['gender'].replace({
    'Agender',
    'Two-Spirited',
    'Prefer not to say',
    'Other'}, 'Other')
players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Other,17
194,Amateur,False,2.3,Male,17


# Question 

### In this project, I will investigate the question 1: **"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"**
#### The goal is to determine which types of players are more likely to subscribe so that game teams can improve their newsletter positioning strategies.
In this dataset, the response variable is 'subscribe'(True/False). I will explore the folloing player characteristics as potential predictors: 'experience', 'played_hours', 'age', 'gender'. Since 'subscibe' is categorical, I will use **knn classification** as the predictive method. KNN is appropriate here because it can naturally handle multi-feature similarity comparisons and does not require assumptions about linear relationships in the data. In addition, I need to convert 'experience' and 'gender' into numerical, so I will use one-hot coding to process these two predictors.


# Exploratory Data Analysis and Visualization

#### (Note: I completed the required minimal wrangling earlier, before writing the question, to keep the cleaned dataset consistent throughout the notebook.)

Next, I use visualizations to explore the relationship between the selected predictor variables and response('subscribe'). These plots allow me to check whether the predictors are related to subscription status, and if so, how they differ across levels or distributions. This exploratory step will help determine which features are likely to be useful for modelling the subscription behaviour. 

### Experience vs Subscribe

In [10]:
import altair as alt

experience_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X('experience').title('player experience level'),
    y='count()',
    color='subscribe'
).properties(title='Player subscription count by Experience level'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
experience_plot

From the plot, players with higher experience tend to subscribe more often, suggesting experience level may influence subsciption behavior. As a result, experience level can be the predictor in the model.

### Gender vs Subscribe

In [11]:
gender_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X('gender').title('Gender'),
    y=alt.Y('count()'),
    color='subscribe'
).properties(title='Player subscription count by Gender'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
gender_plot

This plot suggests that the proportion of subscription for each gender is similar, which means gender does not appear to be associated with subscription. Therefore, it will be excluded as a predictor.

### Played Hour vs Subscribe

In [12]:
played_hours_plot = alt.Chart(players).mark_boxplot(size=70).encode(
    x=alt.X('subscribe:N').title('subscribe'),
    y=alt.Y('played_hours:Q').title('Played Hours'),
    color='subscribe:N'
).properties(title='Distribution of Played Hours by Subscription'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
played_hours_plot

The boxplot shows that the distribution of played hours is higher for subscribed users, which means players who subscribed generally spent more hours playing compared to those who did not subscribe. So played hours can be the predictor.

### Age vs Subscribe

In [13]:
age_plot = alt.Chart(players).mark_boxplot(size=70).encode(
    x=alt.X('subscribe:N').title('subscribe'),
    y=alt.Y('age:Q').title('Age'),
    color='subscribe:N'
).properties(title='Distribution of Played Hours by Age'
).configure_axis(labelFontSize=14, titleFontSize=14).configure_title(fontSize=16)
age_plot 

The plot shows that the median age and the interquartile range for subscribed and non-subscribed players is nearly the same, indicating the relationship between age and subscription is week. Thus, age alone is unlikely to be a useful predictor.

#### Based on the exploratory analysis, I will keep 'experience' and 'played_hours' as predictors for modeling subscription, and exclude 'gender' and 'age' due to weak or no association

# Methods and Plan

To answer the question "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?", I plan to build a K-Nearest Neighbours classification model, where the response variable is subscribe, and the predictors include played_hours, experience.

### Why KNN?

I use Knn because this method doesn't need to assume linearity or other relationships, and it can make predictions based on the similarity between observations, which fits behavioural data. In addition, it can naturally capture non-linear relationships between player features and subscription behaviour.

### Assumptions and preprocessing required

Knn requires a meaningful distance measure, so the predictors must be numeric and on similar scales. Therefore, experience will be converted from category to numeric using one-hot coding. Then these two predictors will be standardized using StandardScaler. Scaling is necessary before fitting the model.

### Limitations

+ Knn can perform poorly when the response classes are imbalanced, since 'True' is more frequent than 'False' in subscribe.
+ When one-hot encoding is used for “experience level,” it loses the natural order among categories (e.g. Beginner < regular), increases the number of features, and produces sparse data. As a result, the model cannot recognize the ordinal relationship and may become less efficient.
+ Since features are on different scales, Knn requires careful scaling, otherwise one feature can dominate the distance.

### Data Processing

1. Split data into training (80%) and test (20%) sets.
2. One-hot encode experience so it can be used in KNN.
3. Standardize numeric predictors (played_hours, age) to make distances meaningful.
4. Use a scikit-learn Pipeline to combine preprocessing and modeling steps.
5. Use GridSearchCV with 5-fold cross-validation on the training set to test different values of k and pick the one with the best score.
6. Fit the best model on the training set, then check performance on the separate test set.

### Model Evaluation Plan

1. Set an train–test split to evaluate model performance and use 5-fold cross-validation on the training set to select the best value of k.
2. Choose the k with the highest average validation accuracy.
3. Train the final model using the full training set, and evaluate the final model on the test set.
4. Main metric: accuracy (classes are fairly balanced).
5. Review the confusion matrix to check for class bias.