In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

A. **Exploratory Data Analysis**

Initially the data must be analysed to detect outliers, impute missing values and find patterns in
data. Feature Engineering is also an important part of this step, to determine features thata re not so useful and to create new features that correlate more to the target.

In [None]:
df = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

In [None]:
df.info()

The following can be noted from the info function.
1. **id** is an unique identifier, it has no role in prediction. It has to be dropped.
2. The target variable is **stroke**.
3. The following are numerical inputs {**age**, **hypertension**, **heart_disease**, **avg_glucose_level**, **bmi**}.
4. The following are categorical inputs {**gender**, **ever_married**, **work_type**, **residence_type**, **smoking_status**}.
5. There are missing values in **bmi** column.

In [None]:
df.drop(columns=["id"], inplace=True)
df.hist(bins=50, figsize=(20, 15))
plt.show()

6. The data distribution of numerical parameters are fairly robust
7. **heart_disease** and **hypertension** are binary values rather than being a continuous distribution. It might be better to convert them to strings and then use one hot encoding for such discrete classes.

In [None]:
corr_matrix = df.corr()
corr_matrix["stroke"].sort_values(ascending=False)

8. All the numerical factors seem to be lightly correlated to the target column. The correlation is also positive for all the numerical columns. None of the numerical features seem to be useless and hence all needs to be kept.
9. **bmi** is the least correlated so we may find more useful features later on that can be considered.

In [None]:
print("Distribution of gender")
print(df["gender"].value_counts(), "\n")

print("Distribution of ever_married")
print(df["ever_married"].value_counts(), "\n")

print("Distribution of work_type")
print(df["work_type"].value_counts(), "\n")

print("Distribution of Residence_type")
print(df["Residence_type"].value_counts(), "\n")

print("Distribution of smoking_status")
print(df["smoking_status"].value_counts(), "\n")

B. **Data cleaning and preparation**

1. Caterogical data **gender** has just a single instance of **other**. This can be safely removed.

In [None]:
df = df[df["gender"] != "Other"]
print("Distribution of gender")
print(df["gender"].value_counts())

2. Next let's split the data into train and test set.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["stroke"])
y = df["stroke"]

np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

3. Now lets process the data through a pipeline so that ML ready data is obtained after the pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

num_cols = ["age", "hypertension", "heart_disease", "avg_glucose_level", "bmi"]
cat_cols = ["gender", "ever_married", "work_type", "Residence_type", "smoking_status"]

num_pipeline = Pipeline([
    ("imputer", KNNImputer()),
    ("std_scale", StandardScaler())
])

pipeline = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", OneHotEncoder(), cat_cols)
])

In [None]:
X_prepared = pipeline.fit_transform(X_train)
y_prepared = np.array(y_train)
print(X_prepared.shape)
print(y_prepared.shape)

X_test_prepared = pipeline.transform(X_test)
y_test_prepared = np.array(y_test)
print(X_test_prepared.shape)
print(y_test_prepared.shape)

4. Let's determine the important features by training a Random Forest Classifier. We wouldn't bother much about hyperparameter tuning, since we are just interested in feature importances. We can expect the importance to remain roughly same.

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=500)
forest_clf.fit(X_prepared, y_prepared)

In [None]:
i = 0
cat_encoder = pipeline.named_transformers_["cat"]
for attribute, categories in zip(cat_cols, cat_encoder.categories_):
    for category in categories:
        print(forest_clf.feature_importances_[i], attribute, category)
        i += 1
    print()

There is not single attribute which is unimportant. However it is clear that children and people without heart disease have almost no hance of having a stroke.

C. **Resampling**

It is clear that the dataset is highly skewed. So we need to undersample the majority class or oversample the minority class.

In [None]:
true_mask = y_prepared == 1
false_mask = y_prepared != 1

X_true, y_true = X_prepared[true_mask], y_prepared[true_mask]
X_false, y_false = X_prepared[false_mask], y_prepared[false_mask]

print(X_true.shape)
print(X_false.shape)

In [None]:
batch_size = X_false.shape[0]
mini_batch_size = X_true.shape[0]

permuted_indices = np.random.permutation(batch_size)
start_indices = range(0, batch_size, mini_batch_size)
X_batch, y_batch = list(), list()
for i in range(len(start_indices)):
    try:
        start, stop = start_indices[i], start_indices[i+1]
        indices = permuted_indices[start:stop]
    except:
        start = start_indices[i]
        indices = permuted_indices[start:]
    
    X_temp = X_false[indices]
    X_minibatch = np.vstack([X_temp, X_true])
    
    y_temp = y_false[indices]
    y_temp = np.reshape(y_temp, newshape=(y_temp.shape[0], -1))
    y_true = np.reshape(y_true, newshape=(y_true.shape[0], -1))
    y_minibatch = np.vstack([y_temp, y_true])
    
    permutation = np.random.permutation(len(X_minibatch))
    
    X_batch.append(X_minibatch[permutation])
    y_batch.append(y_minibatch[permutation])

D. **Outlier Detection**

We would use three algorithms to detect outliers. If all algorithms predict that instance is an outlier we would discard those instances.

In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

lof = LocalOutlierFactor(n_jobs=-1)
IF = IsolationForest(n_jobs=-1, random_state=0, bootstrap=True)
svm = OneClassSVM()

for i in range(len(X_batch)):
    y1 = lof.fit_predict(X_batch[i])
    y2 = IF.fit_predict(X_batch[i])
    y3 = svm.fit_predict(X_batch[i])
    y_ = y1 + y2 + y3
    
    X_batch[i] = X_batch[i][y_ != -3]
    y_batch[i] = y_batch[i][y_ != -3]
    print(f"Minibatch {i+1} : ", X_batch[i].shape, y_batch[i].shape)

E. **Model Preparation**

First let's create the base estimators that would be used

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

log_reg = LogisticRegression(max_iter=50000, class_weight="balanced", n_jobs=-1)
svm_clf = LinearSVC(max_iter=50000, class_weight="balanced")
tree_clf = DecisionTreeClassifier(class_weight="balanced")
extra_clf = ExtraTreeClassifier(class_weight="balanced")

Next the final aggregator model is made. Since multi layer stacking is used, the aggreagtor function itself is stack.

In [None]:
from sklearn.ensemble import StackingClassifier

estimators = [
    ("svm", svm_clf),
    ("tree", tree_clf),
    ("extra_tree", extra_clf)
]

aggregator = StackingClassifier(
    estimators=estimators,
    final_estimator=log_reg,
    n_jobs=-1
)

Finally the base layer of the stack is made. They all use bagging or boosting in order to combine the base estimators.

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier

log_adaboost = AdaBoostClassifier(
    base_estimator=log_reg, n_estimators=100,
    algorithm="SAMME.R"
)

log_bag = BaggingClassifier(
    log_reg, n_estimators=100, n_jobs=-1,
    max_samples=1.0, bootstrap=True,
    max_features=1.0, bootstrap_features=True,
    oob_score=True
)

svm_adaboost = AdaBoostClassifier(
    base_estimator=svm_clf, n_estimators=100,
    algorithm="SAMME"
)

svm_bag = BaggingClassifier(
    svm_clf, n_estimators=100, n_jobs=-1,
    max_samples=1.0, bootstrap=True,
    max_features=1.0, bootstrap_features=True,
    oob_score=True
)

random_forest = RandomForestClassifier(
    n_estimators=100, n_jobs=-1,
    bootstrap=True, oob_score=True,
    class_weight="balanced_subsample", min_samples_leaf=0.05
)

extra_trees = ExtraTreesClassifier(
    n_estimators=100, n_jobs=-1,
    bootstrap=True, oob_score=True,
    class_weight="balanced_subsample", min_samples_leaf=0.05
)

gb_boost = GradientBoostingClassifier(
    n_estimators=100, subsample=0.8,
    min_samples_leaf=0.1, min_samples_split=0.2
)

xg_boost = XGBClassifier(
    n_estimators=100, n_jobs=-1
)

estimators = [
    ("logistic_bagging", log_bag),
    ("logistic_boosting", log_adaboost),
    ("svm_bagging", svm_bag),
    ("svm_boosting", svm_adaboost),
    ("random_forest", random_forest),
    ("extra_trees", extra_trees),
    ("gradient_boosting", gb_boost),
    ("xg_boost", xg_boost)
]

minibatch_classifier = StackingClassifier(
    estimators=estimators,
    final_estimator=aggregator,
    n_jobs=-1
)

Now we would one instance of the multi stack for each of the minibatches and use soft voting to determine the final prediction.

In [None]:
from sklearn.base import clone
from sklearn.ensemble import VotingClassifier
from sklearn.base import BaseEstimator, ClassifierMixin

class Classifier(BaseEstimator, ClassifierMixin):
    def __init__(self, estimator):
        self.estimator_ = estimator
        self.fitted_estimators_ = []
        self.classes_ = [0, 1]
        
    def fit(self, X, y):
        for X_minibatch, y_minibatch in zip(X, y):
            estimator = clone(self.estimator_)
            estimator.fit(X_minibatch, y_minibatch.ravel())
            self.fitted_estimators_.append(estimator)
            
    def predict_proba(self, X):
        probability = []
        for estimator in self.fitted_estimators_:
            probability.append(estimator.predict_proba(X))
        probability = np.array(probability)
        return np.mean(probability, axis=0)
    
    def predict(self, X):
        probability = self.predict_proba(X)
        return np.argmax(probability, axis=1)        
        
classifier = Classifier(minibatch_classifier)
classifier.fit(X_batch, y_batch)

F. **Model Evaluation**

Now let's compute the various performance metrices according to the predictions on the test set. Notice that since stacking classifier is used there is no hyperparameter to tune. Later we may attempt to tune each component of the stacking classifier separately.

In [None]:
y_predict = classifier.predict(X_prepared)
y_test_predict = classifier.predict(X_test_prepared)
print(y_predict)
print(y_test_predict)

In [None]:
from sklearn.metrics import plot_confusion_matrix, plot_det_curve
from sklearn.metrics import plot_precision_recall_curve, plot_roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Confusion matrix and various curves that characterize classifier on training set.

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20, 15))
plot_confusion_matrix(classifier, X_prepared, y_prepared, ax=ax[0, 0])
plot_det_curve(classifier, X_prepared, y_prepared, ax=ax[0, 1])
plot_precision_recall_curve(classifier, X_prepared, y_prepared, ax=ax[1, 0])
plot_roc_curve(classifier, X_prepared, y_prepared, ax=ax[1, 1])
plt.show()

Evaluation metrices on training set

In [None]:
print("Accuracy : ", accuracy_score(y_prepared, y_predict))
print("Precision : ", precision_score(y_prepared, y_predict))
print("Recall : ", recall_score(y_prepared, y_predict))
print("F1_score : ", f1_score(y_prepared, y_predict))

Confusion matrix and various curves that characterize the classifier on test set.

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20, 15))
plot_confusion_matrix(classifier, X_test_prepared, y_test_prepared, ax=ax[0, 0])
plot_det_curve(classifier, X_test_prepared, y_test_prepared, ax=ax[0, 1])
plot_precision_recall_curve(classifier, X_test_prepared, y_test_prepared, ax=ax[1, 0])
plot_roc_curve(classifier, X_test_prepared, y_test_prepared, ax=ax[1, 1])
plt.show()

Evaluation metrices on test set.

In [None]:
print("Accuracy : ", accuracy_score(y_test_prepared, y_test_predict))
print("Precision : ", precision_score(y_test_prepared, y_test_predict))
print("Recall : ", recall_score(y_test_prepared, y_test_predict))
print("F1_score : ", f1_score(y_test_prepared, y_test_predict))