# NOTE

This is the 3rd part of the series where I explore and model the stroke prediction dataset. <br>
Here are the notebooks in the series in the correct order:

1. **[Exploratory Data Analysis (EDA)](https://www.kaggle.com/ansonnn/stroke-prediction-eda)** <br> - Exploring the dataset to derive insights about distributions and relationships between features.<br><br>
2. **[Statistical Analysis](https://www.kaggle.com/ansonnn/stroke-prediction-statistical-analysis)** <br> - Analyzing the normality of the features and their correlations. <br><br>
3. **Feature Engineering and Modelling** - current notebook <br> - Preprocessing the features and building a model for evaluation

I have spent quite a long time trying to model this as this is my first time trying out so many things while building a model... <br>
So please have a look and thanks for checking out! <br>

There are numerous methods I tried to improve the model performance but did not help much, but I decided to include them under the APPENDIX section.

Some of the notable problems in the dataset are:
1. Imbalanced `stroke` classes - tried using SMOTE + Tomek to overcome this but seems like did not help much (can refer under APPENDIX section)
2. Many outliers in the `avg_glucose_level` and `bmi` features - Tried removing using IsolationForest or IQR (APPENDIX section)
3. Many missing values in the `bmi` features - Tried median imputation, DecisionTreeRegressor imputation, and KNearestNeighbor Imputation and settled with KNNImputer.

# Setup

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install lazypredict

In [None]:
!pip install -U pandas

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import display, clear_output

In [None]:
%config Completer.use_jedi = False

In [None]:
sns.set()

In [None]:
pd.set_option('display.max_columns', 30)

In [None]:
def load_preprocess_df(drop_missing=False):
#     df = pd.read_csv('stroke_det_cat.csv')
#     df.drop(columns='bmi_range', inplace=True)
    
    df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
    df.drop(columns='id', inplace=True)
    
    if drop_missing:
        df = df.dropna().reset_index(drop=True)
    
    cats = list(df.select_dtypes(include=['object', 'category']).columns)
    nums = list(df.select_dtypes(exclude=['object', 'category']).columns)
    
    features_to_conv = ['hypertension', 'heart_disease', 'stroke']
    cats.extend(features_to_conv)
    for feature in features_to_conv:
        if feature in nums:
            nums.remove(feature)
    print(f'Categorical variables:  {cats}')
    print(f'Numerical variables:  {nums}')
    
    df = df.astype({i: 'object' for i in cats})
    df = df.astype({i: 'int64' for i in features_to_conv})
    
    df = pd.concat([df[cats], df[nums]], axis=1)
    return df

In [None]:
df = load_preprocess_df(drop_missing=False)
df.head()

In [None]:
df.dtypes

In [None]:
cats = list(df.select_dtypes(exclude=['float64']).columns)
nums = list(df.select_dtypes(include=['float64']).columns)
print(f'Categorical variables:  {cats}')
print(f'Numerical variables:  {nums}')

In [None]:
# Check for the number of unique values in each column
pd.DataFrame([df.nunique(), df.dtypes], index=['nunique', 'dtype'])

In [None]:
for col in cats:
    print(df[col].value_counts())
    print()

In [None]:
df = df.drop(df[df['gender'] == 'Other'].index).reset_index(drop=True)
df.shape

In [None]:
df.gender.value_counts()

In [None]:
# Nothing useful for predicting stroke, consider joining with `children` category
df.loc[df['work_type'] == 'Never_worked']

In [None]:
df.work_type.replace('children', 'Never_worked', inplace=True)

In [None]:
df.work_type.value_counts()

In [None]:
df.shape

# Preprocessing

## Missing BMI values

In [None]:
df_copy = df.copy()

In [None]:
missing_df = df_copy[df_copy.isna().any(axis=1)]
missing_df.head()

In [None]:
missing_df['stroke'].value_counts()

In [None]:
df.stroke.value_counts()

In [None]:
len(missing_df[missing_df.stroke == 1]) / len(df[df.stroke == 1])

- Out of all the missing BMI values, the percentage of data that accounts for stroke sufferers is a whopping **16%**. Therefore, this has to be dealt with appropriately.

### Imputing missing BMI values using a regressor

- **NOTE**: The results seem to be not better than using KNN imputer. Therefore, not using this later.

In [None]:
# A really fantastic and intelligent way to deal with blanks, 
# from Thoman Konstantin in: 
# https://www.kaggle.com/thomaskonstantin/analyzing-and-modeling-stroke-data
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

DT_bmi_pipe = Pipeline(steps=[
                              ('scale',StandardScaler()),
                              ('lr',DecisionTreeRegressor(random_state=42))
                             ])
X = df_copy[['age','gender','bmi']].copy()
X.gender = X.gender.replace({'Male':0,'Female':1,'Other':-1}).astype(np.uint8)
# print(X.gender.value_counts())

Missing = X[X.bmi.isna()]
X = X[~X.bmi.isna()]
Y = X.pop('bmi')
DT_bmi_pipe.fit(X,Y)
predicted_bmi = pd.Series(DT_bmi_pipe.predict(Missing[['age','gender']]),
                          index=Missing.index)
df_copy.loc[Missing.index,'bmi'] = predicted_bmi

In [None]:
df_copy.loc[missing_df.index].head()

In [None]:
fig = plt.figure(figsize=(16, 6), facecolor='white')
sns.set_style('white')
gs=fig.add_gridspec(1,2)

ax = [None, None]

ax[0]=fig.add_subplot(gs[0,0])
ax[1]=fig.add_subplot(gs[0,1])

sns.kdeplot(data=df, x='bmi', ax=ax[0], color='coral', zorder=2)
sns.kdeplot(data=df_copy, x='bmi', ax=ax[0], color='xkcd:sky blue', zorder=2)
ax[0].set_title('BMI KDE plot - Orig vs Imputed', fontsize=20,fontweight='bold', fontfamily='monospace')

sns.histplot(data=df, x='bmi', ax=ax[1], element='step', color='coral', alpha=0.1)
sns.histplot(data=df_copy, x='bmi', ax=ax[1], element='step', color='xkcd:sky blue', alpha=0.1)
ax[1].set_title('BMI Histogram - Orig vs Imputed', fontsize=20,fontweight='bold', fontfamily='monospace')

for ax_ in ax:
    ax_.legend(['original', 'imputed'])
    ax_.set_ylabel(None)
    ax_.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
    for direction in ['top','right','left']:
        ax_.spines[direction].set_visible(False)

# sns.despine(left=True)
plt.tight_layout()
plt.show()

## Splitting Dataset

In [None]:
df_copy = df.copy()

In [None]:
from sklearn.model_selection import train_test_split

def train_validation_test_split(
    X, y, train_size=0.8, val_size=None, test_size=None,
    stratify=None, random_state=42, shuffle=True
):
    if not val_size:
        val_size = (1 - train_size) / 2.
        test_size = val_size
        
    assert int(train_size + val_size + test_size + 1e-7) == 1
    
    stratify_1 = y if stratify else None
    
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, stratify=stratify_1, 
        random_state=random_state, shuffle=shuffle)
    
    stratify_2 = y_train_val if stratify else None
    
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val, test_size=val_size/(train_size+val_size), 
        random_state=random_state, stratify=stratify_2, shuffle=shuffle)
    
    return X_train, X_val, X_test, y_train, y_val, y_test

In [None]:
# After spending some time trying to make this function work,
#  I realized that train_test_split already has a `stratify` parameter that serves the same purpose...
# So this function is actually not needed.

from sklearn.model_selection import StratifiedShuffleSplit

def stratified_split(
    X, y, train_size=0.8, val_size=None, test_size=None, 
    random_state=42, shuffle=True
):
    if not val_size:
        # print("[INFO] Validation size and test size are inferred from train_size!")
        val_size = (1 - train_size) / 2.
        test_size = val_size
    elif not test_size:
        test_size = train_size - val_size
        
    assert int(train_size + val_size + test_size + 1e-7) == 1
    
    split = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=42)
    for train_val_index, test_index in split.split(X, y):
        X_train_val = X.loc[train_val_index]
        X_test = X.loc[test_index]
        y_train_val = y.loc[train_val_index]
        y_test = y.loc[test_index]
    
    # Must reset the index for the DataFrame to locate the proper indices from the splits
    X_train_val.reset_index(drop=True, inplace=True)
    y_train_val.reset_index(drop=True, inplace=True)
    
    split = StratifiedShuffleSplit(n_splits=1, 
                                   test_size=val_size/(train_size+val_size), 
                                   random_state=42)
    for train_index, val_index in split.split(X_train_val, y_train_val):
        X_train = X_train_val.loc[train_index]
        X_val = X_train_val.loc[val_index]
        y_train = y_train_val.loc[train_index]
        y_val = y_train_val.loc[val_index]
    
    return X_train, X_val, X_test, y_train, y_val, y_test

In [None]:
from functools import partial

stratified_func = partial(train_validation_test_split, stratify=True)
split_df = pd.DataFrame(columns=['not_stratified', 'stratified'])

for i, split_func in enumerate((train_validation_test_split, stratified_func)):
    column_name = 'not_stratified' if i == 0 else 'stratified'
    
    *_, y_train, y_val, y_test = split_func(X=df_copy.drop(columns='stroke'),
                                                                y=df_copy.stroke,
                                                                train_size=0.70)
    split_df[column_name] = pd.concat([df_copy.stroke.value_counts(), y_train.value_counts(), 
                          y_val.value_counts(), y_test.value_counts()], axis=0)

split_df = split_df.reset_index(drop=True)\
           .rename(index={0: "stroke_full", 1: "no_stroke_full", 
                          2: "stroke_train", 3: "no_stroke_train",
                          4: "stroke_val", 5: "no_stroke_val", 
                          6: "stroke_test", 7: "no_stroke_test"})

background_color = "#E3EDF0"

fig = plt.figure(figsize=(8, 6))
ax = sns.heatmap(split_df, annot=True, cmap="Paired", fmt="", linewidths=2, cbar=False, annot_kws={"size":14, "fontfamily":"monospace"})
fig.patch.set_facecolor(background_color)
ax.set_facecolor(background_color) 
plt.title('Dataset Splits | Stratification', fontsize=15, fontweight='bold', fontfamily="monospace")
plt.xticks(fontsize=13); plt.yticks(fontsize=13)
plt.show()

- Stratified split generated very consistent splits by taking into account the percentage of samples for each class of the `stroke` labels.
- Although it might not improve the model performance, this format is preferable as it is more representative of the proportion of real data.

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(X=df_copy.drop(columns='stroke'),
                                                                             y=df_copy.stroke, 
                                                                             stratify=True,
                                                                             train_size=0.70)

# Visualizations for preprocessed numerical features

## Original distribution

In [None]:
plt.figure(figsize=(12,6))
sns.set_style('white')

sns.boxplot(x="variable", y="value", data=pd.melt(X_train[nums]), palette="cubehelix")

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Original Distribution of Numerical Features', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- `avg_glucose_level` and `bmi` are right-skewed and have a lot of outliers.

## Standard Scaler

In [None]:
from sklearn.preprocessing import StandardScaler
std_sc = StandardScaler()
X_train_std = X_train.copy()
X_train_std[nums] = std_sc.fit_transform(X_train_std[nums])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_std[nums]), palette="cubehelix")

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Standard Scaler', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- Standard scaler is significantly affected by outliers.

## Robust Scaler

In [None]:
from sklearn.preprocessing import RobustScaler
rs = RobustScaler()
X_train_rs = X_train.copy()
X_train_rs[nums] = rs.fit_transform(X_train_rs[nums])
# X_val_rs[nums] = rs.transform(X_val_rs[nums])
# X_test_rs[nums] = rs.transform(X_test_rs[nums])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_rs[nums]), palette="cubehelix")

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Robust Scaler', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- Robust scaler scales them to about the same scale while still retaining their distributions, much better than Standard Scaler.
- Robust scaler is commonly used to combat outliers in the distributions, by scaling the features in a way (using quartiles) that will not be influenced by outliers. 

## Quantile Transformer

In [None]:
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(n_quantiles=10, random_state=42)
X_train_qt = X_train.copy()
X_train_qt[nums] = qt.fit_transform(X_train_qt[nums])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_qt[nums]), palette='cubehelix')

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Quantile Transformer', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- Quantile transformer totally removed all the outliers and changed the original distribution of the features, <br>
which could result in loss of original information and correlation with other features, particularly the target feature

## Power Transformer

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
X_train_pt = X_train.copy()
X_train_pt[nums] = pt.fit_transform(X_train_pt[nums])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_pt[nums]), palette='cubehelix')

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Power Transformer', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- Power Transformer generated more Gaussian-like distributions as expected, as implied by the almost identical lengths of the two tails of the each of the box plots.
- Some outliers are still found in `bmi`, which also indicates that the original distribution of `bmi` was very right skewed.
- This is still not ideal because their scales are still not perfectly same with one another.

## Log Transformation

In [None]:
df_nums_log = np.log(df_copy[nums])
df_nums_log.head()

In [None]:
plt.figure(figsize=(12,6))
sns.set_style('white')
sns.boxplot(x="variable", y="value", data=pd.melt(df_nums_log), palette='cubehelix')

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Log Transformation', fontsize=20, fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- Log transformation resulted in very different scales, which is not good for training.

# Training

**NOTE**

- The most important metrics to look out for are: precision, recall and especially F1 scores (with its variations, e.g. F1 macro-averaged).
- F1 score will be the primary metric to monitor here due to the highly imbalanced `stroke` class (i.e. our `target` variable).
- After conducting some quick research, it is decided that `ROC AUC` score is not chosen here because it is generally not good for imbalanced datasets, <br>
F1 score would work better for evaluating the model's performance on imbalanced datasets.

## Evaluation function

In [None]:
from sklearn.metrics import make_scorer, classification_report, confusion_matrix

target_names = ['No Stroke', 'Stroke']  # [0, 1]

def eval_model_on_train_val(model, transformed=False, 
                            return_pred=False, show_results=True):
    # global X_train, y_train, X_val, y_val, X_train_tf, X_val_tf
    
    if transformed:
        # model requires transformed dataset
        X_train_ = X_train_tf.copy()
        X_val_ = X_val_tf.copy()
    else:
        X_train_ = X_train.copy()
        X_val_ = X_val.copy()
    
    # display(pd.DataFrame(X_train_).head())
    y_pred = model.predict(X_train_)
    if show_results:
        print('[INFO] Evaluating on training set ...')
        print(confusion_matrix(y_train, y_pred))
        print(classification_report(y_train, y_pred, target_names=target_names))
    
    y_pred = model.predict(X_val_)
    if show_results:
        print('\n[INFO] Evaluating on validation set ...')
        print(confusion_matrix(y_val, y_pred))
        print(classification_report(y_val, y_pred, target_names=target_names))
    
    if return_pred:
        return y_pred

## Training Pipelines

In [None]:
df_copy = df.copy()

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(X=df_copy.drop(columns='stroke'),
                                                                             y=df_copy.stroke, 
                                                                             stratify=True,
                                                                             train_size=0.70)

In [None]:
cats = list(df.select_dtypes(exclude=['float64']).columns)
nums = list(df.select_dtypes(include=['float64']).columns)
print(f'Categorical variables:  {cats}')
print(f'Numerical variables:  {nums}')

In [None]:
categorical_cols = cats.copy()
if 'stroke' in categorical_cols:
    categorical_cols.remove('stroke')
print(categorical_cols)

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, LabelEncoder, OrdinalEncoder
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
import xgboost as xgb

cat_pipeline = Pipeline([
    # ("one_hot_encoder", OneHotEncoder(drop='first')),
    ("ordinal_encoder", OrdinalEncoder()),
    # ('std_scaler', StandardScaler(with_mean=True)),
    ('rob_scaler', RobustScaler(with_centering=False)),
])

bmi_pipeline = Pipeline([
    # ('std_scaler_bmi', StandardScaler(with_mean=True)),
    ('rob_scaler_bmi', RobustScaler(with_centering=False)),
    # add_indicator is very useful to add extra columns denoting missing values
    ('KNN_imputer', KNNImputer(n_neighbors=3, add_indicator=True))
])

pipe_1 = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols),
    # ('std_scaler_nums', StandardScaler(with_mean=True), ['avg_glucose_level', 'age']),
    ('rob_scaler_nums', RobustScaler(with_centering=False), ['avg_glucose_level', 'age']),
    # ('rob_scaler_glucose', RobustScaler(with_centering=False), ['avg_glucose_level']),
    # ('std_scaler_age', StandardScaler(with_mean=False), ['age']),
    ('bmi_pipeline', bmi_pipeline, ['bmi'])
], remainder="passthrough")

train_pipe_1 = Pipeline([
    ('feature_transform', pipe_1),
    
    ('select_kbest', SelectKBest(k=10)),
    # ('pca_2', PCA(n_components=2)),
    
    ('rfc', RandomForestClassifier(random_state=42))
])

In [None]:
# prepare transformed datasets for certain use cases
X_train_tf = pipe_1.fit_transform(X_train)
X_val_tf = pipe_1.transform(X_val)

**Some extra pipelines for testing if necessary**

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, OrdinalEncoder
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

cat_pipeline = Pipeline([
    ("one_hot_encoder", OneHotEncoder(drop='first')),
    ('cat_scaler', RobustScaler(with_centering=False)),
])

cat_no_scale_pipeline = Pipeline([
    ("one_hot_encoder", OneHotEncoder(drop='first')),
])

pipe_2 = ColumnTransformer([
    ('num_std_scaler', RobustScaler(with_centering=False), nums),
    ('cat', cat_pipeline, categorical_cols),
], remainder="passthrough")

pipe_3 = ColumnTransformer([
    ('num_std_scaler', RobustScaler(with_centering=False), nums),
    ('cat', cat_no_scale_pipeline, categorical_cols),
], remainder="passthrough")

train_pipe_2 = Pipeline([
    ('pipe_2', pipe_2),
    ('KNN_imputer', KNNImputer(n_neighbors=3, add_indicator=True)),
    ('xgb', xgb.XGBClassifier(random_state=42)),
])

train_pipe_3 = Pipeline([
    ('pipe_3', pipe_3),
    ('KNN_imputer', KNNImputer(n_neighbors=3, add_indicator=True)),
    ('xgb', xgb.XGBClassifier(random_state=42)),
])

train_pipe_4 = Pipeline([
    ('pipe_1', pipe_1),
    ('log_reg', LogisticRegression(random_state=42)),
])

**First trial of training**

In [None]:
train_pipe_1.fit(X_train, y_train)

In [None]:
eval_model_on_train_val(train_pipe_1)

## Choosing a model by using `lazypredict` library

In [None]:
X_train_tf = pipe_1.fit_transform(X_train)
X_val_tf = pipe_1.transform(X_val)

In [None]:
from lazypredict.Supervised import LazyClassifier
from sklearn.metrics import f1_score
from functools import partial

custom_metric = partial(f1_score, average='binary')
custom_metric.__name__ = 'f1_binary'
clf1 = LazyClassifier(verbose=0, ignore_warnings=True, random_state=42, custom_metric=custom_metric)
clf2 = LazyClassifier(verbose=0, ignore_warnings=True, random_state=42, custom_metric=custom_metric)

models_train, predictions_train = clf1.fit(X_train_tf, X_train_tf, y_train, y_train)
models_val, predictions_val = clf2.fit(X_train_tf, X_val_tf, y_train, y_val)

models_train

In [None]:
models_dict = clf2.provide_models(X_train_tf, X_val_tf, y_train, y_val)

In [None]:
models_dict.keys()

In [None]:
eval_model_on_train_val(models_dict['AdaBoostClassifier'], transformed=True)

In [None]:
models_train.columns

In [None]:
metric = "f1_binary"
models_train.sort_values(by=metric, ascending=False, inplace=True)

fig = plt.figure(figsize=(6, 11))
sns.set_theme(style="whitegrid")

ax = sns.barplot(y=models_train.index, x=metric, data=models_train)


for p in ax.patches:
    value = f'{(p.get_width() * 100):.2f}%'
    x = p.get_x() + p.get_width() + 0.01
    y = p.get_y() + p.get_height()/2 + 0.15
    ax.annotate(value, (x, y))

plt.title('Training | F1 Score for Stroke Prediction', 
          fontdict=dict(fontsize=16,
                        fontweight='bold', 
                        fontfamily='monospace'))
plt.ylabel(None)
plt.xlabel('F1 Score', fontsize=13, fontweight='bold')
sns.despine()
plt.grid(axis='x')
plt.yticks(fontsize=13)
# plt.savefig('training_result.png', bbox_inches='tight')
plt.show()

In [None]:
metric = "f1_binary"
models_val.sort_values(by=metric, ascending=False, inplace=True)

plt.figure(figsize=(6, 10))
sns.set_theme(style="whitegrid")
ax = sns.barplot(y=models_val.index, x=metric, data=models_val)
plt.title('Validation | F1 Score for Stroke Prediction', 
          fontdict=dict(fontsize=15,
                        fontweight='bold', 
                        fontfamily='monospace'
                       ))
plt.ylabel(None)
plt.xlabel('F1 Score', fontsize=13, fontweight='bold')
sns.despine()

for p in ax.patches:
    value = f'{(p.get_width() * 100):.2f}%'
    x = p.get_x() + p.get_width() + 0.002
    y = p.get_y() + p.get_height()/2 + 0.2
    ax.annotate(value, (x, y))

plt.yticks(fontsize=13)
# plt.savefig('validation_result.png', bbox_inches='tight')
plt.show()

In [None]:
background_color = "#f7fdff"
fig = plt.figure(figsize=(15, 6), facecolor=background_color)
sns.set_style('whitegrid',  {"axes.facecolor": background_color})
sns.lineplot(x=models_train.index, y=metric, data=models_train, color='xkcd:sky blue')
sns.lineplot(x=models_val.index, y=metric, data=models_val, color='coral')
sns.despine()
plt.xticks(rotation=90)

plt.title('F1 Scores for Stroke Prediction', 
          fontdict=dict(fontsize=15,
                        fontweight='bold', 
                        fontfamily='monospace'
                       ))
plt.xlabel(None)
plt.ylabel('F1 score - Stroke')
plt.legend(['Training', 'Validation'])
plt.show()

In [None]:
avg_model = models_train + models_val / 2

metric = "f1_binary"
avg_model.sort_values(by=metric, ascending=False, inplace=True)

plt.figure(figsize=(8, 10))
sns.set_theme(style="whitegrid")
ax = sns.barplot(y=avg_model.index, x=metric, data=avg_model)
plt.title('Training & Validation | Average F1 Scores\nStroke Prediction', 
          fontdict=dict(fontsize=15,
                        fontweight='bold', 
                        fontfamily='monospace'
                       ))
plt.ylabel(None)
plt.xlabel('Average F1 Score', fontsize=13, fontweight='bold')
sns.despine()

for p in ax.patches:
        value = f'{(p.get_width() * 100):.2f}%'
        x = p.get_x() + p.get_width() + 0.002
        y = p.get_y() + p.get_height()/2 + 0.2
        ax.annotate(value, (x, y))


- It seems that `ExtraTreeClassifier` works the best when averaging both training and validation sets. <br>
- `DecisionTreeClassifer` and `RandomForestClassifier` also performed quite well in comparison. <br>
These are some of the most popular models used. Therefore they will be used as the primary models for further exploring.

## Testing with a shorter pipeline on pre-encoded features

### One Hot Encoding

In [None]:
df_copy = df.copy()

In [None]:
one_hot_cats = pd.get_dummies(df_copy.loc[:, 'gender':'smoking_status'], drop_first=True)

In [None]:
df_copy = df_copy.loc[:, 'hypertension': 'stroke']
df_copy = pd.concat([one_hot_cats, df_copy, df[nums]], axis=1)
df_copy.head()

### Ordinal Encoding

In [None]:
df_copy = df.copy()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
df_copy[categorical_cols] = ord_enc.fit_transform(df_copy[categorical_cols])

In [None]:
df_copy.head()

- Ordinal encoding works the same with label encoding.

### Train & Fine-tuning

- Using Ordinal Encoding as it produced the best results after testing

In [None]:
df_copy = df.copy()

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(X=df_copy.drop(columns='stroke'), 
                                                                             y=df_copy.stroke, 
                                                                             stratify=True,
                                                                             train_size=0.70)

In [None]:
X_train.head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
X_train[categorical_cols] = ord_enc.fit_transform(X_train[categorical_cols])
X_val[categorical_cols] = ord_enc.transform(X_val[categorical_cols])
X_test[categorical_cols] = ord_enc.transform(X_test[categorical_cols])

In [None]:
from sklearn.tree import ExtraTreeClassifier

etc_pipeline = Pipeline([('scaler', RobustScaler(with_centering=True)),
                         ('imputer', KNNImputer(n_neighbors=6, add_indicator=True)),
                         ('select_kbest', SelectKBest(k=10)),
                         ('etc_clf', ExtraTreeClassifier(random_state=42))])

In [None]:
etc_pipeline.fit(X_train, y_train)

In [None]:
etc_pred = eval_model_on_train_val(etc_pipeline, return_pred=True)

 - In this case, `Ordinal Encoding` performs better than `One Hot Encoding` after testing on both.
 - Now move on to perform some hyperparameter tuning.

In [None]:
etc_cm = confusion_matrix(y_val, etc_pred)

In [None]:
import matplotlib as mpl

background_color = "#E3EDF0"


colors = ["lightblue", "xkcd:sky blue", "xkcd:sky blue", "xkcd:sky blue"]
colormap = mpl.colors.LinearSegmentedColormap.from_list("",  colors)

fig = plt.figure(figsize=(6, 3))
ax = sns.heatmap(etc_cm, cmap=colormap, annot=True, fmt="d", linewidths=5, cbar=False,
            yticklabels=['Actual Non-Stroke','Actual Stroke'],
            xticklabels=['Predicted Non-Stroke','Predicted Stroke'],
            annot_kws={"fontsize": 13, "fontfamily": 'monospace'})

plt.title('ExtraTreeClassifier | Confusion Matrix', size=15, fontfamily='serif')
plt.yticks(size=13, fontfamily='serif')
plt.xticks(size=13, fontfamily='serif')
fig.patch.set_facecolor(background_color)
# ax.set_facecolor(background_color)
plt.show()

In [None]:
# etc_pipeline.get_params()

#### ExtraTreeClassifier

In [None]:
# from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# param_grid = {
#     'select_kbest__k': [6, 7, 8, 9, 10],
#     'imputer__n_neighbors': [2, 4, 6, 8],
#     'etc_clf__max_depth': [35, 40, 45],
#     'etc_clf__max_features': [2, 4, 6],
#     'etc_clf__min_samples_leaf': [2, 4, 6],
#     'etc_clf__min_samples_split': [2, 4, 6]
# }

# grid = GridSearchCV(estimator=etc_pipeline, 
#                      param_grid=param_grid, cv=3,
#                      scoring='f1', refit=True,
#                      verbose=3, n_jobs=4)

# grid.fit(X_train, y_train)

# print(f'\nBest params -> {grid.best_params_}')
# print(f'Best score -> {grid.best_score_}')
# print(f'Validation score -> {grid.score(X_val, y_val)}')

In [None]:
# Best params -> {'etc_clf__max_depth': 35, 'etc_clf__max_features': 6, 'etc_clf__min_samples_leaf': 2, 'etc_clf__min_samples_split': 6, 'imputer__n_neighbors': 8, 'select_kbest__k': 9}
# Best score -> 0.1898148148148148
# Validation score -> 0.20338983050847456

In [None]:
# etc_pred = eval_model_on_train_val(grid, return_pred=True)

In [None]:
# [INFO] Evaluating on training set ...
# [[3369   31]
#  [ 103   72]]
#               precision    recall  f1-score   support

#    No Stroke       0.97      0.99      0.98      3400
#       Stroke       0.70      0.41      0.52       175

#     accuracy                           0.96      3575
#    macro avg       0.83      0.70      0.75      3575
# weighted avg       0.96      0.96      0.96      3575


# [INFO] Evaluating on validation set ...
# [[714  16]
#  [ 31   6]]
#               precision    recall  f1-score   support

#    No Stroke       0.96      0.98      0.97       730
#       Stroke       0.27      0.16      0.20        37

#     accuracy                           0.94       767
#    macro avg       0.62      0.57      0.59       767
# weighted avg       0.93      0.94      0.93       767

- Fine-tuning did not help much with improving the F1 score on `Stroke`, i.e. 0.20 after fine-tuning, before was 0.27.
- Fine-tuning also made the training scores much worse than before.

### Final Evaluation on Test Set

In [None]:
# y_pred = etc_pipeline.predict(X_test)
# print(confusion_matrix(y_test, y_pred))
# print(classification_report(y_test, y_pred, target_names=target_names))

In [None]:
# [INFO] Evaluating on training set ...
# [[3369   31]
#  [ 103   72]]
#               precision    recall  f1-score   support

#    No Stroke       0.97      0.99      0.98      3400
#       Stroke       0.70      0.41      0.52       175

#     accuracy                           0.96      3575
#    macro avg       0.83      0.70      0.75      3575
# weighted avg       0.96      0.96      0.96      3575


# [INFO] Evaluating on validation set ...
# [[714  16]
#  [ 31   6]]
#               precision    recall  f1-score   support

#    No Stroke       0.96      0.98      0.97       730
#       Stroke       0.27      0.16      0.20        37

#     accuracy                           0.94       767
#    macro avg       0.62      0.57      0.59       767
# weighted avg       0.93      0.94      0.93       767

- On the test set that has never been seen before, it performed very poorly with F1-score of below 0.10.
- This is likely due to the imbalance of the dataset itself, or due to insufficient features to train a good predictive model.

## Interpreting the model with SHAP

SHAP (SHapley Additive exPlanations) is a very good way to explain and interpret a model as it is much more intuitive compared to traditional means like feature importance.

In [None]:
scaler = RobustScaler()
X_train_ = scaler.fit_transform(X_train.copy())
X_val_ = scaler.transform(X_val.copy())
X_test_ = scaler.transform(X_test.copy())


imputer = KNNImputer(n_neighbors=6, add_indicator=True)
X_train_ = imputer.fit_transform(X_train_)
X_val_ = imputer.transform(X_val_)
X_test_ = imputer.transform(X_test_)

X_train_[:, :-1] = scaler.inverse_transform(X_train_[:, :-1])
X_val_[:, :-1] = scaler.inverse_transform(X_val_[:, :-1])
X_test_[:, :-1] = scaler.inverse_transform(X_test_[:, :-1])

In [None]:
# X_train_ = X_train.copy()
# X_val_ = X_val.copy()
# X_train_['bmi_NaN'] = X_train_tf[:, -1]
# X_val_['bmi_NaN'] = X_val_tf[:, -1]

- Typical usage of `KNNImputer`:
- Scale -> Impute -> Inverse scale
- The reverse scaling is also performed in order to preserve the original values for interpreting later.

In [None]:
X_train_ = pd.DataFrame(X_train_, columns=list(X_train.columns) + ['bmi_NaN'])
X_val_ = pd.DataFrame(X_val_, columns=list(X_train.columns) + ['bmi_NaN'])
X_val_ = pd.DataFrame(X_val_, columns=list(X_train.columns) + ['bmi_NaN'])
X_train_.head()

In [None]:
# Have to train a new classifier because SHAP does not support scikit-learn's pipeline
# Using RandomForest here because it produced much better SHAP plot than ExtraTreeClassifier for some reason...
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_, y_train)

In [None]:
y_pred = clf.predict(X_val_)
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred, target_names=target_names))

- F1 score for stroke is still around 10-15%

In [None]:
import shap

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_val_)

In [None]:
sns.set_style('white')

In [None]:
shap.summary_plot(shap_values[1], X_val_, alpha=0.5)

- SHAP value is just like feature importance, but this graph can provide more insights by showing how much each feature can affect the target variable (model output) in both negative and positive predictions.
- In this graph, the colors also represent the values of the features, from low to high value -> from blue to red
- In this case, all the numerical features have significant impact on the model output, particularly the `age` feature, <br>
which has the highest impact for predicting `stroke`.
- `bmi_NaN` (just an indicator for missing BMI values) also shows very high impact on predicting `stroke`.

In [None]:
shap.dependence_plot('age', shap_values[1], X_val_, interaction_index="age", 
                     alpha=0.5, show=False)
plt.title("Age dependence plot", 
          fontfamily='monospace', fontweight='bold', fontsize=16)

- From the dependence plot, we can see that `age` has a trend of increasing with SHAP values too.
- This graph also shows that higher `age` -> more `stroke` predictions.

In [None]:
shap.dependence_plot('avg_glucose_level', shap_values[1], X_val_, 
                     interaction_index="avg_glucose_level", alpha=0.5, show=False)
plt.title("avg_glucose_level dependence plot", 
          fontfamily='monospace', fontweight='bold', fontsize=16)
plt.show()

In [None]:
shap.dependence_plot('bmi', shap_values[1], X_val_, 
                     interaction_index="bmi", alpha=0.5, show=False)
plt.title("BMI dependence plot", 
          fontfamily='monospace', fontweight='bold', fontsize=16)
plt.show()

- Both dependence plots of `avg_glucose_level` and `bmi` also show their respective thresholds where stroke is more prevalent at higher values.
- Threshold is at around 150 for `avg_glucose_level`; while the threshold for `bmi` can be seen more clearly at closer to 30.
- **NOTE**: The numerical features here are not scaled with any sort of algorithm yet, after sorting, the SHAP values would likely be different.
- This is to allow us to see the impact of the actual values themselves in the dependence plots.

# CONCLUSION
- More data needs to be collected especially for stroke sufferers in order to build a more robust model to predict whether a person is suffered from stroke or not.
- The selection of data to be collected also needs to be more balanced and contains more features if possible.

# APPENDIX

- Tried other methods and found not much improvement.

## Hyperparameter Tuning on XGBoost

In [None]:
# To check the trained etc parameters
for k, v in etc_pipeline.get_params().items():
    if str(k).startswith('etc_clf__'):
        print(k, v)

In [None]:
etc_pipeline.get_params().keys()

In [None]:
# from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# param_grid = {
#     "imputer__n_neighbors": ([3, 4, 5, 6]),
#     "select_kbest__k": ([6, 8, 10]),
#     'etc_clf__min_child_weight': ([1, 5, 10]),
#     'etc_clf__gamma': ([0, 0.5, 1, 1.5, 2, 5]),
#     'etc_clf__subsample': ([0.6, 0.8, 1.0]),
#     'etc_clf__colsample_bytree': ([0.6, 0.8, 1.0, 1.2]),
#     'etc_clf__max_depth': ([3, 4, 5, 6, 7, 8])
# }

# rs_cv = RandomizedSearchCV(estimator=etc_pipeline, n_iter=100, 
#                            param_distributions=param_grid, cv=3,
#                            scoring='f1_macro', refit=True,
#                            verbose=3, n_jobs=4, random_state=42)

# rs_cv.fit(X_train, y_train)

# print(f'\nBest params -> {rs_cv.best_params_}')
# print(f'Best score -> {rs_cv.best_score_}')
# print(f'Validation score -> {rs_cv.score(X_val, y_val)}')

Best params -> {'xgb_clf__subsample': 0.6, 'xgb_clf__min_child_weight': 1, 'xgb_clf__max_depth': 3, 'xgb_clf__gamma': 0, 'xgb_clf__colsample_bytree': 1.0, 'select_kbest__k': 10, 'imputer__n_neighbors': 5} <br>
Best score -> 0.5599306011204763 <br>
Validation score -> 0.5379518072289157

- Validation score of f1 is not much better than before tuning.

In [None]:
# rs_cv_df = pd.DataFrame(rs_cv.cv_results_)
# rs_cv_df.sort_values('rank_test_score', inplace=True)
# rs_cv_df.head()

In [None]:
# eval_model_on_train_val(rs_cv)

In [None]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     "imputer__n_neighbors": range(1, 5),
#     "select_kbest__k": range(3, 9),
#     "etc_clf__max_depth": [4],
#     "etc_clf__learning_rate": [0.2, 0.3, 0.4],
#     "etc_clf__gamma": [1, 2, 3],
#     "etc_clf__reg_lambda": [9, 10, 11, 12],
#     "etc_clf__scale_pos_weight": [5],
#     "etc_clf__subsample": [0.8],
#     "etc_clf__colsample_bytree": [0.5],
# }

# grid = GridSearchCV(estimator=etc_pipeline, 
#                     param_grid=param_grid, cv=3,
#                     scoring='f1_macro', refit=True,
#                     verbose=1, n_jobs=-1)

# grid.fit(X_train, y_train)

# print(f'\nBest params -> {grid.best_params_}')
# print(f'Best score -> {grid.best_score_}')
# print(f'Validation score -> {grid.score(X_val, y_val)}')

Best params -> {'imputer__n_neighbors': 1, 'select_kbest__k': 3, 'xgb_clf__colsample_bytree': 0.5, 'xgb_clf__gamma': 1, 'xgb_clf__learning_rate': 0.4, 'xgb_clf__max_depth': 4, 'xgb_clf__reg_lambda': 9, 'xgb_clf__scale_pos_weight': 5, 'xgb_clf__subsample': 0.8} <br>
Best score -> 0.5945212854754867 <br>
Validation score -> 0.5837303262377134

In [None]:
# grid_df = pd.DataFrame(grid.cv_results_)
# grid_df.sort_values('rank_test_score', inplace=True)
# grid_df.head()

In [None]:
# eval_model_on_train_val(grid)

- Training score has dropped significantly after tuning for some reason I am not sure.
- Validation score of macro-averaged is still not much better than before tuning.

## Removal of outliers via IQR

In [None]:
def tukey_outliers(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    
    iqr = q3 - q1
    
    lower_boundary = q1 - (iqr * 1.5)
    upper_boundary = q3 + (iqr * 1.5)
    
    outliers = x[(x < lower_boundary) | (x > upper_boundary)]
    return outliers, lower_boundary, upper_boundary

outliers_bmi, lower_boundary_bmi, upper_boundary_bmi = tukey_outliers(X_train["bmi"])
outliers_glucose, lower_boundary_glucose, upper_boundary_glucose = tukey_outliers(X_train["avg_glucose_level"])
len(outliers_bmi), len(outliers_glucose)

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(X=df_copy.drop(columns='stroke'), y=df_copy.stroke, train_size=0.70)

In [None]:
X_train_iqr = X_train.copy()
X_train_iqr['bmi'].clip(lower_boundary_bmi, upper_boundary_bmi, inplace=True)
X_train_iqr['avg_glucose_level'].clip(lower_boundary_glucose, upper_boundary_glucose, inplace=True)

In [None]:
X_train_iqr.head()

In [None]:
X_train_rs = X_train.copy()
X_train_rs.loc[:, 'age': 'bmi'] = rs.fit_transform(X_train_rs.loc[:, 'age': 'bmi'])

plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_rs[nums]), palette="cubehelix")

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Robust Scaler | Before Outlier Removal', fontsize=20,fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

In [None]:
X_train_iqr_rs = X_train_iqr.copy()
X_train_iqr_rs.loc[:, 'age': 'bmi'] = rs.fit_transform(X_train_iqr_rs.loc[:, 'age': 'bmi'])

plt.figure(figsize=(12,6))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_iqr_rs[nums]), palette="cubehelix")

sns.despine(left=True, bottom=True)
plt.grid(which='both', axis='y', zorder=0, color='black', linestyle=':', dashes=(2,7), alpha=0.3)
plt.title('Robust Scaler | After Outlier Removal', fontsize=20,fontweight='bold', fontfamily='monospace')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ord_encoder = OrdinalEncoder()
X_val_iqr = X_val.copy()
X_train_iqr.loc[:, 'gender': 'smoking_status'] = ord_encoder.fit_transform(X_train_iqr.loc[:, 'gender': 'smoking_status'])
X_val_iqr.loc[:, 'gender': 'smoking_status'] = ord_encoder.transform(X_val_iqr.loc[:, 'gender': 'smoking_status'])

In [None]:
etc_pipeline.fit(X_train_iqr, y_train)

In [None]:
y_pred = etc_pipeline.predict(X_train_iqr)
print('[INFO] Evaluating on training set ...')
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred, target_names=target_names))

y_pred = etc_pipeline.predict(X_val_iqr)
print('\n[INFO] Evaluating on validation set ...')
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred, target_names=target_names))

- Removing outliers did not improve the validation F1 score. F1_macro is still around 0.60, <br>
this is as expected because tree models are generally robust to outliers.

## Trying IsolationForest for outlier removal

In [None]:
from sklearn.ensemble import IsolationForest

X_train_tf = pipe_1.fit_transform(X_train)
X_val_tf = pipe_1.transform(X_val)

# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1, random_state=42)

yhat = iso.fit_predict(X_train_tf)
# select all rows that are not outliers
mask = (yhat != -1)
X_train_iso, y_train_iso = X_train_tf[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train_iso.shape, y_train_iso.shape)

In [None]:
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_iso, y_train_iso)

In [None]:
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, random_state=42)
xgb_clf.fit(X_train_iso, y_train_iso)

In [None]:
eval_model_on_train_val(xgb_clf, transformed=True)

- Once again, removing outliers this way did not improve the validation F1 score. F1_macro is still lower than 0.60.

## Try adding BMI_range

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

BMI_idx = 13

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bmi_range=True): # no *args or **kargs
        self.add_bmi_range = add_bmi_range
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        df = pd.DataFrame(X)
        # display(df.head())
        
        if self.add_bmi_range:
        
            def create_bmi_range(bmi):
                # create bmi_range
                if bmi < 18.5:
                    return 0  # underweight
                elif bmi < 25.0:
                    return 1  # normal
                elif bmi < 30.0:
                    return 2  # overweight
                elif bmi < 40.0:
                    return 3  # obesity
                else:
                    return 4  # extreme obesity
            
            # 'bmi_range' will become the last column -> 14
            df['bmi_range'] = df.loc[:, BMI_idx].apply(create_bmi_range)
            
            # drop the bmi column
            df = df.drop(columns=BMI_idx)
            # display(df.head())
            
            return df.to_numpy()
        else:
            return X

In [None]:
categorical_cols = ['gender', 'ever_married', 
                    'work_type', 'Residence_type', 
                    'smoking_status', 'hypertension', 
                    'heart_disease']

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, OrdinalEncoder
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

cat_pipeline = Pipeline([
    ("one_hot_encoder", OneHotEncoder(drop='first')),
    # ("label_encoder", OrdinalEncoder()),
    ('std_scaler', StandardScaler(with_mean=False)),
])

bmi_pipeline = Pipeline([
    ('std_scaler_bmi', StandardScaler(with_mean=False)),
    # ('robust_scaler_bmi', RobustScaler(with_centering=False)),
    # add_indicator is very useful to add extra columns denoting missing values
    ('KNN_imputer', KNNImputer(n_neighbors=10, add_indicator=True))
])

transformer_all = ColumnTransformer([
    ('std_scaler_glucose', StandardScaler(with_mean=False), ['avg_glucose_level']),
    ('std_scaler_age', StandardScaler(with_mean=False), ['age']),
    ('cat', cat_pipeline, categorical_cols),
    ('bmi_pipe', bmi_pipeline, ['bmi'])  # index = 13
], remainder="passthrough")

pipe_1 = Pipeline([
    ('feature_transform', transformer_all),
    ('bmi_range_adder', CombinedAttributesAdder()),
    
    # ('select_kbest_8', SelectKBest(k=8)),
    # ('pca_2', PCA(n_components=2)),
    
    # ('rfc', RandomForestClassifier()),
    # ('xgb', xgb.XGBClassifier(use_label_encoder=False)),
])

In [None]:
# X_train_tf = pd.DataFrame(pipe_1.fit_transform(X_train))
X_train_tf = (pipe_1.fit_transform(X_train))
X_val_tf = (pipe_1.transform(X_val))

In [None]:
# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1, random_state=42)

yhat = iso.fit_predict(X_train_tf)
# select all rows that are not outliers
mask = (yhat != -1)
X_train_iso, y_train_iso = X_train_tf[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train_iso.shape, y_train_iso.shape)

In [None]:
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_iso, y_train_iso)

In [None]:
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, random_state=42)
xgb_clf.fit(X_train_iso, y_train_iso)

In [None]:
eval_model_on_train_val(xgb_clf, transformed=True)

## Try tuning other classifiers

In [None]:
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# from sklearn.metrics import f1_score, make_scorer

# X_train_tf = pipe_1.fit_transform(X_train)
# X_val_tf = pipe_1.transform(X_val)

# gbc = GradientBoostingClassifier(random_state=42)

# gbc_param_grid = {    
#     'n_estimators': [i for i in range(50,201,50)],
#     'max_depth': range(4, 9),
#     'min_samples_split': [i for i in range(8, 17, 2)],
#     'max_leaf_nodes': [i for i in range(8, 15, 2)],
# }

# grid = GridSearchCV(estimator=gbc, param_grid=gbc_param_grid, cv=3,
#                     scoring={'f1': make_scorer(f1_score)}, refit='f1',
#                     verbose=1, n_jobs=-1)

# grid.fit(X_train_tf, y_train)

# print('\nBest params -> {}'.format(grid.best_params_))
# print('Best score -> {}'.format(grid.best_score_))
# print('Validation score -> {}'.format(grid.score(X_val_tf, y_val)))

In [None]:
# X_train_tf = pipe_1.fit_transform(X_train)
# X_val_tf = pipe_1.transform(X_val)

# xgb_clf = xgb.XGBClassifier(use_label_encoder=False, random_state=42)

# param_grid = {
#     "max_depth": [3, 4, 5, 7],
#     "learning_rate": [0.1, 0.01, 0.05],
#     "gamma": [0, 0.25, 1],
#     "reg_lambda": [0, 1, 10],
#     "scale_pos_weight": [1, 3, 5],
#     "subsample": [0.8],
#     "colsample_bytree": [0.5],
# }

# grid = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, cv=3,
#                     scoring={'f1': make_scorer(f1_score)}, refit='f1',
#                     verbose=1, n_jobs=-1)

# grid.fit(X_train_tf, y_train)

# print('\nBest params -> {}'.format(grid.best_params_))
# print('Best score -> {}'.format(grid.best_score_))
# print('Validation score -> {}'.format(grid.score(X_val_tf, y_val)))

In [None]:
# eval_model_on_train_val(grid, transformed=True)

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import RandomizedSearchCV, RepeatedStratifiedKFold
# from sklearn.metrics import f1_score, make_scorer

# X_train_tf = pipe_1.fit_transform(X_train)
# X_val_tf = pipe_1.transform(X_val)

# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# # Number of features to consider at every split
# max_features = ['auto', 'sqrt']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

# # cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
# cv = 3

# # Use the random grid to search for best hyperparameters
# # First create the base model to tune
# rf = RandomForestClassifier()
# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# rf_random = RandomizedSearchCV(estimator=rf,
#                                param_distributions=random_grid,
#                                n_iter=100,
#                                cv=cv,
#                                scoring={'f1': make_scorer(f1_score)}, 
#                                refit='f1',
#                                verbose=1,
#                                random_state=42,
#                                n_jobs=-1)
# # Fit the random search model
# rf_random.fit(X_train_tf, y_train)

In [None]:
# print('\nBest params -> {}'.format(rf_random.best_params_))
# print('Best score -> {}'.format(rf_random.best_score_))
# print('Validation score -> {}'.format(rf_random.score(X_val_tf, y_val)))

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# from sklearn.metrics import f1_score, make_scorer

# X_train_tf = pipe_1.fit_transform(X_train)
# X_val_tf = pipe_1.transform(X_val)

# rfc = RandomForestClassifier(random_state=42)

# # param_grid = {    
# #     'n_estimators': [i for i in range(50,251,50)],
# #     'max_depth': range(4, 11),
# #     'min_samples_split': [i for i in range(8, 17, 2)],
# #     'max_leaf_nodes': [i for i in range(8,17,2)],
# # }

# param_grid = {
#     'bootstrap': [False],
#     'max_depth': range(35, 46),
#     'max_features': [1, 2, 3],
#     'min_samples_leaf': [1, 2, 3, 4],
#     'min_samples_split': [1, 2, 3, 4],
#     'n_estimators': [1300, 1350, 1400, 1450, 1500]
# }

# # cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
# cv = 3

# grid = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=cv,
#                     scoring={'f1': make_scorer(f1_score)}, refit='f1',
#                     verbose=1, n_jobs=-1)

# grid.fit(X_train_tf, y_train)

# print('\nBest params -> {}'.format(grid.best_params_))
# print('Best score -> {}'.format(grid.best_score_))
# print('Validation score -> {}'.format(grid.score(X_val_tf, y_val)))

## Random Forest with Resampling

In [None]:
X_train_tf = pipe_1.fit_transform(X_train)
X_val_tf = pipe_1.transform(X_val)

In [None]:
# random forest with random undersampling for imbalanced classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=10, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
scores = cross_val_score(model, X_train_tf, y_train, scoring='f1', cv=cv, n_jobs=-1)
print('Mean F1 score: %.3f' % np.mean(scores))

In [None]:
from imblearn.ensemble import EasyEnsembleClassifier

model = EasyEnsembleClassifier(n_estimators=10, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
scores = cross_val_score(model, X_train_tf, y_train, scoring='f1', cv=cv, n_jobs=-1)
print('Mean F1 score: %.3f' % np.mean(scores))

## SMOTE Tomek Resampling

In [None]:
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks

resample = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'), random_state=42)

In [None]:
X_res, y_res = resample.fit_resample(X_train_tf, y_train)

In [None]:
X_res_df, y_res_series = pd.DataFrame(X_res), pd.Series(y_res)

In [None]:
X_res_df.head()

In [None]:
X_res_df.isna().sum()

In [None]:
y_res_series.value_counts(), y_train.value_counts(), y_val.value_counts(), y_test.value_counts()

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
model = xgb.XGBClassifier(use_label_encoder=False, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
scores = cross_val_score(model, X_res, y_res, scoring='f1', cv=cv, n_jobs=-1)
print('Mean F1 score: %.3f' % np.mean(scores))

In [None]:
model = xgb.XGBClassifier(random_state=42)
model.fit(X_res, y_res)

In [None]:
eval_model_on_train_val(model, transformed=True)

- The validation score of `f1` is still not much improvement, `f1_macro` always hovers around 60%.

# PyCaret Library

In [None]:
# !pip install pycaret

In [None]:
# # compare machine learning algorithms on the sonar classification dataset
# from pycaret.classification import setup
# from pycaret.classification import compare_models, tune_model, create_model
# from sklearn.ensemble import ExtraTreesClassifier

In [None]:
# # setup the dataset
# grid = setup(data=df, target='stroke', silent=True, n_jobs=None, imputation_type='simple')

In [None]:
# # evaluate models and compare models
# best = compare_models(errors='raise')
# # report the best model
# print(best)

In [None]:
# model = create_model('qda')

In [None]:
# # tune model hyperparameters
# tuned_model = tune_model(best)
# # report the best model
# print(tuned_model)

In [None]:
# model.score()