# Methods and Results

### Methods

We build a KNN classification model, where the response variable is subscribe, and the predictors include **played_hours, age.** Because we are trying to predict the subscription status which is a categorical variable, we can't use regression. We don't know how the data will look like. Therefore, we use KNN since it requires few assumptions about what the data must look like, and works well with non-linear relationships. In addition, Knn requires numeric and scaled features, so these two predictors will be standardized using StandardScaler.

### Load package and import data

In [1]:
import pandas as pd
import altair as alt
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [2]:
players[['played_hours', 'age']].agg(['max', 'min', 'mean']) 

Unnamed: 0,played_hours,age
max,223.1,99.0
min,0.0,8.0
mean,5.845918,21.280612


In [3]:
players['subscribe'].value_counts() 

subscribe
True     144
False     52
Name: count, dtype: int64

### Data Description
players.csv: contains player demographic and behavioural information. 
Only players.csv is used for analysis. According to our question, we only care about the subscription status, age and played_hours so we will keep those columns. We also keep column”name” since that will increase the readability of the data frame.
### "players.csv" dataset (196 rows,9 columns) includes:
+ 'experience'(categorical): player experience level
+ 'subscribe'(boolean)
+ 'played_hours'(numeric): from 0 to 223.1
+ 'gender'(categorical)
+ 'age'(numeric): from 8 to 99
+ 'hashedEmail' is not human readable, 'individualId' and 'organizationName' are missing values

In [4]:
players_clean = players[["played_hours", "subscribe", "name","age"]]
players_clean

Unnamed: 0,played_hours,subscribe,name,age
0,30.3,True,Morgan,9
1,3.8,True,Christian,17
2,0.0,False,Blake,17
3,0.7,True,Flora,21
4,0.1,True,Kylie,21
...,...,...,...,...
191,0.0,True,Bailey,17
192,0.3,False,Pascal,22
193,0.0,False,Dylan,17
194,2.3,False,Harlow,17


### Exploratory Data Analysis and Visualization

#### (Note: I completed the required minimal wrangling earlier, before writing the question, to keep the cleaned dataset consistent throughout the notebook.)

The data will first be split into training and test sets, and all EDA will be performed only on the training set to avoid information leakage.
The plots from visualization allow me to check whether the predictors are related to subscription status.

#### Played Hour vs Subscribe

In [5]:
main_plot = alt.Chart(players_clean).mark_boxplot(size=70).encode(
    x=alt.X('subscribe:N').title('subscribe'),
    y=alt.Y('played_hours:Q').title('Played Hours'),
    color='subscribe:N'
).properties(title='Distribution of Played Hours by Subscription'
)
caption = alt.Chart().mark_text(
    align='center',
    dy=230,          
    fontSize=14
).encode(
    text=alt.value("Figure 1")
)
played_hours_plot = alt.layer(main_plot, caption).configure_axis(
    labelFontSize=14,
    titleFontSize=14
).configure_title(
    fontSize=16
)

played_hours_plot #

Figure 1 shows the median played hours for subscribers is around 20, compared to only 5 for non-subscribers. Then players with very high played hours like 150-230 hours are almost all subscribers, while non-subscribers are concentrated at very low played hours which around 0-10 hours. This suggests that higher played hours are strongly associated with having a subscription, so played hours is a useful predictor, even though there is still some overlap at lower play times.

#### Age vs Subscribe

In [6]:
main_plot = alt.Chart(players_clean).mark_bar(size=50).encode(
    x=alt.X('age:Q', bin=alt.Bin(step=10), title="Age"),
    y=alt.Y('count()'),
    color='subscribe:N'
).properties(title='Distribution of subscription by Age', width=600, height=350
)
caption = alt.Chart().mark_text(
    align='center',
    dy=230,          
    fontSize=14
).encode(
    text=alt.value("Figure 2")
)
age_plot = alt.layer(main_plot, caption).configure_axis(
    labelFontSize=14,
    titleFontSize=14
).configure_title(
    fontSize=16
)
age_plot 

From Figure 2, younger players tend to subscribe more often, especially from ages 0 to 30, where subscribed users are much more common than non-subscribed users. Between 30 and 50 year-old, the proportion of non-subscribed users begins to become higher than the proportion of subcribed users. Even for players over 50 years old, most are non-subscribers, while a small group in the 90-100 age range is mostly subscribed. This pattern shows that age does play a role in the subscription status, even though the strength of the relationship varies across age groups.

### Process

1. We used two numerical predictors, age and played_hours, to model whether a player subscribes. First, we extracted these variables into X and stored the subscription outcome in y. The dataset was then split into 80% training data and 20% test data using train_test_split with a fixed random state for reproducibility. This ensured that the model was trained and evaluated on separate observations.
2. Because KNN is sensitive to the scale of the predictors, we standardized both numerical features. We created a ColumnTransformer that applies StandardScaler to the two predictor columns while dropping anything else. This preprocessing step was combined with a KNeighborsClassifier in a pipeline constructed with make_pipeline, ensuring that scaling is always applied before fitting or predicting.
3. To determine an appropriate number of neighbors, we performed a grid search over values of k from 1 to 9. The grid search used 5-fold cross-validation on the training set, fitting the entire pipeline each time so that both scaling and model fitting were included in the cross-validation process. The model with the highest cross-validated accuracy was selected, and we selected the optimal number of neighbors.
4. After choosing the best model, we used it to predict the subscription outcome for the test set and added these predicted values next to the actual outcomes. To assess performance, we calculated accuracy, precision, and recall, summarizing the model’s overall correctness as well as how well it identified the positive class. We also produced a confusion matrix to show the distribution of true positives, true negatives, false positives, and false negatives, providing a more detailed view of the model’s classification behavior.

In [7]:
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X=players_clean[['age', 'played_hours']]
y=players_clean['subscribe']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
preprocessor=make_column_transformer(
    (StandardScaler(), ['age', 'played_hours']),
    remainder='drop'
)
knn=KNeighborsClassifier()
knn_pipeline=make_pipeline(preprocessor, knn)

param_grid = {'kneighborsclassifier__n_neighbors': range(1,10)}
grid=GridSearchCV(knn_pipeline, param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
best_k=grid.best_params_['kneighborsclassifier__n_neighbors']
best_k

7

In [8]:
player_test = players_clean.loc[X_test.index].copy()
player_test["predicted"] = grid.predict(X_test)
player_test[["name", "subscribe", "predicted"]]

Unnamed: 0,name,subscribe,predicted
136,Jamal,True,True
4,Kylie,True,True
81,Suki,True,True
181,Hunter,True,True
161,Finley,False,True
154,Mikael,True,True
62,Knox,True,True
187,Jasper,True,True
122,Bodhi,False,False
185,Sam,False,True


In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
metric_df = pd.DataFrame({
    "metric": ["accuracy", "precision", "recall"],
    "value": [accuracy_score(y_test, y_pred), precision_score(y_test, y_pred), recall_score(y_test, y_pred)]})
metric_df 

Unnamed: 0,metric,value
0,accuracy,0.8
1,precision,0.783784
2,recall,1.0


In [10]:
pd.crosstab(
    y_test,
    grid.predict(X_test),
    rownames=["Actual"],
    colnames=["Predicted"]
)

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,3,8
True,0,29


In [12]:
players_clean.loc[:, "pred_subscribe"] = best_model.predict(players_clean[["played_hours", "age"]])
scatter_plot = alt.Chart(players_clean).mark_circle(size=60).encode(
    x=alt.X("played_hours:Q").title("Played Hours"),
    y=alt.Y("age:Q").title("Age"),
    color=alt.Color("pred_subscribe:N").title("Predicted Subscribe"),
    tooltip=["played_hours", "age", "pred_subscribe"]
).properties(
    title="Predicted Subscription by Played Hours and Age",
    height=400,
    width=500
)
caption = alt.Chart().mark_text(
    align='center',
    dy=250,          
    fontSize=14
).encode(
    text=alt.value("Figure 3")
)
scatter_pred = alt.layer(scatter_plot, caption).configure_axis(
    labelFontSize=14,
    titleFontSize=14
).configure_title(
    fontSize=16
)
scatter_pred

In Figure 3, although neither age nor played_hours can perfectly distinguish between subscribers and non-subscribers, the combination of these two features still provides a meaningful learning structure for KNN to learn from. Played_hours shows clear patterns in both low and extremely high ranges, while age helps differentiate between young and older player groups. Therefore, these two predictors create distinguishable neighborhoods, leading the model to achieve reasonable accuracy and make interpretable predictions.