Stroke is third most common cause of death and main cause of disability or complete dependance on performing activities of daily living within adults. There are two main varieties:
* Hemorrhagic stroke (sudden bleeding may occur due to a ruptured brain aneurysm, which damages brain structure)
* Ischemic (usuallt caused by blockage of a blood vessel) 

We can highlight several factors of a stroke:
* Age
* Family history of stroke
* Hypertension
* Heart diseases
* Diabetes
* Smoking status
* Alcohol abuse
* Amphetamine, cocaine abuse
* Obesity


In [None]:
import pandas as pd
df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.columns = [col.lower() for col in df.columns]

In [None]:
df[((df["age"] < 18) & 
    (df["work_type"] != "children"))].head(10)

Observations, that include a 7 or 8 yeard old running a business (or other unusuall) would require consulting with a specialist. Propable errors in collecting data.

In [None]:
df = df[((df["age"] >= 18) |
         (df["work_type"] == "children"))]

In [None]:
categorical_columns = ["gender",
                       "hypertension",
                       "heart_disease",
                       "ever_married",
                       "work_type",
                       "residence_type",
                       "smoking_status"]

numerical_columns = ["age",
                     "avg_glucose_level",
                     "bmi"]

target = "stroke"

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "whitegrid")

def plot_cnt_prc(feature, target, data, axes):
    sns.barplot(x = feature, y = "id",
                data = data.groupby([feature, target]).count().reset_index(),
                color = "#d6d6f5", hue = target, ax = axes[0])
    
    axes[0].set_xlabel(axes[0].get_xlabel(), size = 16)
    axes[0].set_ylabel("Quantity", size = 16)
    
    sns.barplot(x = feature, y = target,
                data = (data.groupby(feature).mean() * 100).reset_index(),
                color = "#d6d6f5", ax = axes[1])
    
    axes[1].set_xlabel(axes[1].get_xlabel(), size = 16)
    axes[1].set_ylabel("Percentage", size = 16)

    
columns = ["gender",
           "age",
           "hypertension",
           "heart_disease",
           "ever_married",
           "work_type",
           "residence_type",
           "avg_glucose_level",
           "bmi",
           "smoking_status"]
    

fig, axes = plt.subplots(10, 2, figsize = (20, 70))
data = df.copy()
for ax, col in zip(axes, columns):
    if col in numerical_columns:
        data[col] = pd.qcut(data[col], q = 5,
                            duplicates = "drop")
    plot_cnt_prc(col, target, data, ax)

plt.show()

In [None]:
counts = df[target].value_counts()
plt.figure(figsize = (12, 6))

plt.pie(x = counts,
        labels = counts.keys(),
        autopct = "%.1f%%",
        explode = (0, 0.1),
        colors = ["#99b3ff", "#4d79ff"])
plt.show()

In [None]:
import scipy.stats as stats
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

def correlation_plot(df, columns, 
                     method = "pearson", 
                     figsize = (12, 6)):
    corr = df.loc[:, columns].corr(method = method)
    mask = np.triu(np.ones_like(corr, dtype = np.bool))
    
    plt.figure(figsize = figsize)
    heatmap = sns.heatmap(data = corr, mask = mask,
                          vmin = -1, vmax = 1,
                          annot = True, cmap = "coolwarm")
    
    heatmap.set_title("Correlation Heatmap", fontdict = {"fontsize": 15})
    plt.show()

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [None]:
data = df.copy()
encoder = OrdinalEncoder()
columns = categorical_columns + [target]

data = pd.DataFrame(encoder.fit_transform(data[columns]), 
                    columns = columns)

correlation_plot(df = data,
                 columns = columns,
                 method = cramers_v)
del data

Dataset does not include highly correlated categorical features

In [None]:
columns = numerical_columns + [target]
correlation_plot(df = df,
                 columns = columns)

Dataset does not include highly correlated numerical features.

In [None]:
pd.DataFrame(df.isna().sum(), 
             columns = ["na_quantity"])

Assuming, that glucose measurment
* Was performed on empty stomach
* Is given with set of units mg/dL

We can highlight (surely this would require consulting a specialist) four categories:
* avg_glucose_level < 70 – too low glucose level
* 70 < avg_glucose_level < 88 – normal blood glucose level
* 100 < avg_glucose_level < 125 – pre-diabetes
* 126 < avg_glucose_level – diabetes

We can assign BMI values to 8 categories (this would also require consulting a specialist):
* BMI < 16 – severely underweight
* 16 < BMI < 16.99 - emaciation
* 17 < BMI < 18.49 - underweight
* 18.5 < BMI < 24.99 – normal weight
* 25 < BMI < 29.99 - overweight
* 30 < BMI < 34.99 – obesity class I 
* 35 < BMI < 39.99 - obesity class II
* 40 < BMI - obesity class III


In [None]:
def glucose_level(glucose):
    if glucose <= 70:
        return "TOO_LOW_GLUCOSE_LEVEL"
    elif glucose <= 99:
        return "NORMAL_BLOOD_GLUCOSE_LEVEL"
    elif glucose <= 125:
        return "PRE_DIABETES"
    else:
        return "DIABETES"

def bmi(bmi_level):
    if str(bmi_level) == "nan":
        return "NAN"
    elif bmi_level < 16:
        return "SEVERELY_UNDERWEIGHT"
    elif bmi_level < 16.99:
        return "EMACIATION"
    elif bmi_level < 18.49:
        return "UNDERWEIGHT"
    elif bmi_level < 24.99:
        return "NORMAL_WEIGHT"
    elif bmi_level < 29.99:
        return "OVERWEIGHT"
    elif bmi_level < 34.99:
        return "OBESITY_CLASS_I"
    elif bmi_level < 39.99:
        return "OBESITY_CLASS_II"
    else:
        return "OBESITY_CLASS_III"

data = df.copy()
data["avg_glucose_level"] = data["avg_glucose_level"].apply(glucose_level)
data["bmi"] = data["bmi"].apply(bmi)

In [None]:
categorical_columns = ["gender",
                       "hypertension",
                       "heart_disease",
                       "ever_married",
                       "work_type",
                       "residence_type",
                       "smoking_status",
                       "avg_glucose_level",
                       "bmi"]

numerical_columns = ["age"]

In [None]:
encoder = OrdinalEncoder()
columns = categorical_columns + [target]

data = pd.DataFrame(encoder.fit_transform(data[columns]), 
                    columns = columns)

correlation_plot(df = data,
                 columns = columns,
                 method = cramers_v)
del data

In [None]:
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

class Transformer(BaseEstimator, TransformerMixin):
    def __bmi(self, value):
        if str(value) == "nan":
            return "NAN"
        elif value < 16:
            return "SEVERELY_UNDERWEIGHT"
        elif value < 16.99:
            return "EMACIATION"
        elif value < 18.49:
            return "UNDERWEIGHT"
        elif value < 24.99:
            return "NORMAL_WEIGHT"
        elif value < 29.99:
            return "OVERWEIGHT"
        elif value < 34.99:
            return "OBESITY_CLASS_I"
        elif value < 39.99:
            return "OBESITY_CLASS_II"
        else:
            return "OBESITY_CLASS_III"
  
    def __glucose(self, value):
        if value <= 70:
            return "TOO_LOW_GLUCOSE_LEVEL"
        elif value <= 99:
            return "NORMAL_BLOOD_GLUCOSE_LEVEL"
        elif value <= 125:
            return "PRE_DIABETES"
        else:
            return "DIABETES"

    def transform(self, X, y = None):
        X = X.copy()
        X["bmi"] = X["bmi"].apply(self.__bmi)
        X["avg_glucose_level"] = X["avg_glucose_level"].apply(self.__glucose)
        return X

    def fit(self, X, y = None):
        return self

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from category_encoders import WOEEncoder
from sklearn.compose import ColumnTransformer

X = df[categorical_columns + numerical_columns]
Y = df[target]

X_train, X_test, Y_train, Y_test =\
  train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 1)

In [None]:
pipeline = Pipeline([("bmi_glucose", Transformer()),
                     ("woe", WOEEncoder(cols = categorical_columns))])

transformer = ColumnTransformer([("pipeline",
                                  pipeline,
                                  categorical_columns),
                                 ("scale",
                                  StandardScaler(),
                                  numerical_columns)])

X_train = pd.DataFrame(transformer.fit_transform(X_train, Y_train),
                       columns = X_train.columns,
                       index = X_train.index)
X_test = pd.DataFrame(transformer.transform(X_test),
                      columns = X_test.columns,
                      index = X_test.index)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr = LogisticRegression(class_weight = "balanced")
lr.fit(X_train, Y_train)

print(classification_report(Y_test, lr.predict(X_test)))

In [None]:
import optuna 
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

X_train_skf = X_train.values
Y_train_skf = Y_train.values

def objective(trial):
    param_grid = {
          "class_weight": "balanced",
          "random_state": 1,
          "solver": "liblinear",
          "C": trial.suggest_float("C", 0.01, 1),
          "penalty": trial.suggest_categorical("penalty", ["l1", "l2"])
      }
  
    skf = StratifiedKFold(n_splits = 3)
    test_scores = []

    for train_index, test_index in skf.split(X_train_skf, Y_train_skf):
        X_train_fold, X_test_fold = X_train_skf[train_index], X_train_skf[test_index]
        Y_train_fold, Y_test_fold = Y_train_skf[train_index], Y_train_skf[test_index]
  
    classifier = LogisticRegression(**param_grid)
    classifier.fit(X_train_fold, Y_train_fold)
    test_scores.append(roc_auc_score(Y_test_fold, classifier.predict_proba(X_test_fold)[:, 1]))

    return np.asarray(test_scores).mean()

optuna.logging.disable_default_handler()
study = optuna.create_study(direction = "maximize")
study.optimize(objective, n_trials = 300)

In [None]:
from optuna.visualization import plot_parallel_coordinate
plot_parallel_coordinate(study)

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer

param_grid = {
      "class_weight": ["balanced"],
      "random_state": [1],
      "solver": ["liblinear"],
      "C": [0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2],
      "penalty": ["l1", "l2"]
}

grid = GridSearchCV(estimator = LogisticRegression(),
                    param_grid = param_grid,
                    cv = StratifiedKFold(n_splits = 3), 
                    n_jobs = -1,
                    verbose = 10,
                    scoring = make_scorer(roc_auc_score, needs_proba = True))

grid.fit(X_train, Y_train)

In [None]:
lr = LogisticRegression(**grid.best_params_)
lr.fit(X_train, Y_train)

print(classification_report(Y_test, lr.predict(X_test)))

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

fig, ax = plt.subplots(figsize = (10, 6))
plot_confusion_matrix(lr, 
                      X_test, 
                      Y_test, 
                      ax = ax, 
                      values_format = '.0f')
plt.grid(False)
plt.show()

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, threshold = roc_curve(Y_test, lr.predict_proba(X_test)[:, 1])
data = list(zip(threshold, tpr, fpr))
trh = pd.DataFrame(data, 
                   columns = ["threshold", 
                              "true_positive_rate", 
                              "false_positive_rate"])
trh["tpr-fpr"] = trh["true_positive_rate"] - trh["false_positive_rate"]

In [None]:
trh.sort_values(by = "tpr-fpr", ascending = False).head(3)

In [None]:
data = X_test.copy()
columns = data.columns
data["label"] = Y_test
data["pred_proba"] = lr.predict_proba(data[columns])[:, 1]

data = data.reset_index()

In [None]:
import shap

explainer = shap.Explainer(lr.predict_proba, data.loc[:, columns])
explainer_output = explainer(data.loc[:, columns])

expected_values = explainer_output.base_values[:1, :].reshape(-1)
shap_values = explainer_output.values

In [None]:
shap.summary_plot(shap_values[:, :, 1], data.loc[:, columns])

In [None]:
data.sort_values(by = "pred_proba").head(3)

In [None]:
shap.initjs()
shap.force_plot(expected_values[1], shap_values[508].T[1], df.loc[4581, columns])

In [None]:
data.sort_values(by = "pred_proba", ascending=False).head(5)

In [None]:
shap.initjs()
shap.force_plot(expected_values[1], shap_values[65].T[1], df.loc[218, columns])

In [None]:
shap.initjs()
shap.force_plot(expected_values[1], shap_values[313].T[1], df.loc[4164, columns])

Dataset includes part of observations, where the target variable is classified as 0, and simultaneously they are very similar to obesrvations reffered to as „success”. Age turned out to be the most relevant attribute out of the accessible set. Perhaps inserting additional features to the set would improvement of the results.