The objective of this project is to create models (i.e. Naive Bayes and Bagging Ensemble) that could help predict a future stroke based on certain lifestyle features such as Gender, Age, Hypertension, Heart Disease, Ever Married, Work Type, Residence Type, Avg. Glucose Level, BMI, Smoking Status.

These are the libraries used to help achieve this project

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from sklearn import metrics, model_selection
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


Importing healthcare dataset

In [None]:
df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
                 
df.columns = ["ID", "Gender", "Age", "Hypertension", "Heart Disease", "Ever Married", "Work Type", "Residence Type", "Avg. Glucose Level", "BMI", "Smoking Status", "Stroke"]


Dropping ID column. This feature does not correspond with the data analysis being conducted.

In [None]:
#drop ID column
df = df.drop('ID', axis=1)

Here we want to visualize our dataset and understand the distribution of each feature and label.

In [None]:
#plots other feature sets
df.hist(figsize = (15, 15))
plt.show()

We have identified that stroke label set is very unbalanced which could affect how we process our data. We need to fix this issue.

In [None]:
#plots Stroke feature
df['Stroke'].value_counts(dropna = False).plot.bar(color = 'blue')
plt.title('Imbalanced Stroke Feature')
plt.xlabel('zero & one')
plt.ylabel('count')
plt.show()

Identifying null values in our dataset. BMI column had missing values therefore deleting the entire row was the decision. We have enough data that deleting will not affect our classification significantly.

In [None]:
print(df.isnull().sum())
print(df.count())

#removing null values in BMI column
df.dropna(axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)


Verifying null values were deleted.

In [None]:
print(df.isnull().sum())
print(df.count())

Transforming categorical data into binary integers. Method used is get_dummies. Label_encoders was not used since it is more fit for ranking hierarchy. Get_dummies allows us to transform features with multiple categories into separate feature. This increases our feature set size. 

In [None]:

#transforming dataset with dummies variables to replace characters with binary integers
df["Hypertension"].replace([0,1], ["No","Yes"], inplace=True)
df["Heart Disease"].replace([0,1], ["No","Yes"], inplace=True)

df2 = df[["Gender","Age","Hypertension","Heart Disease","Ever Married","Work Type","Residence Type","Avg. Glucose Level","BMI", "Smoking Status","Stroke"]]

gender = pd.get_dummies(df2["Gender"], drop_first=True)
hypertension = pd.get_dummies(df2["Hypertension"], drop_first=True, prefix="HT")
heartdisease = pd.get_dummies(df2["Heart Disease"], drop_first=True, prefix="HD")
evermarried = pd.get_dummies(df2["Ever Married"], drop_first=True, prefix="EM")
worktype = pd.get_dummies(df2["Work Type"], drop_first=True)
residence = pd.get_dummies(df2["Residence Type"],drop_first=True)
smoking = pd.get_dummies(df2["Smoking Status"], drop_first=True)

df3 = pd.concat([df2,gender,hypertension,heartdisease,evermarried,worktype,residence,smoking], axis=1, join='outer', ignore_index=False)
print(df3.head(15))


Dropping original feature set after the split using get_dummies. Relabeling all new feature sets.

In [None]:
df3.drop(["Gender","Hypertension","Heart Disease","Ever Married","Work Type", "Residence Type","Smoking Status"], axis=1, inplace=True)

#relabeling dataset with proper headers
df4 = df3.reindex(labels=["Age","Male","HT_Yes","HD_Yes","EM_Yes","Never_worked","Private","Self-employed","children","BMI","Urban","Avg. Glucose Level","formerly smoked", "never smoked", "smokes","Stroke"], axis=1)
print(df4.head(15))


Feature set and Label set. Verifying the number of rows present.

In [None]:

#feature set
X = df4[["Age","Male","HT_Yes","HD_Yes","EM_Yes","Never_worked","Private","Self-employed","children","BMI","Avg. Glucose Level","formerly smoked", "never smoked", "smokes"]]

#label set
y = df4["Stroke"]

print(X.count())
print(y.count())

Train test split our dataset

In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Ploting the label set after the train test split to visually see the train/test size. train size 70% and test size 30%.

In [None]:
y_train.value_counts(dropna = False).plot.bar(color = 'blue')
plt.title('Stroke Feature Training Set')
plt.xlabel('zero & one')
plt.ylabel('count')
plt.show()

Feature scaling to standardize the independent features
Rescale feature with distribution value of 0 mean and variance equal to 1

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


Now we are fixing the unbalanced dataset. Here we can understand the total count.

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

Oversampling dataset via SMOTE. This oversamples the dataset. We did not want to downsample since we would lose important parts of the dataset.

In [None]:
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train)

print('After OverSampling, the shape of train_X: {}'.format(X_train.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train==0)))


Fitting our dataset to a Gaussian Naive Bayes model.

In [None]:
#Gaussian naive bayes model
clf = GaussianNB()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy Score Gaussian = ", accuracy_score(y_test, y_pred))


Setting up the bagging classifier for ensemble method.

In [None]:
seed = 42
kfold = model_selection.KFold(n_splits = 3,random_state = seed,shuffle=True)
  
# initialize the base classifier
base_cls = DecisionTreeClassifier()
  
# no. of base classifier
# #Total Number of decision trees that will be used to train an ensemble
num_trees = 100

In [None]:
# bagging classifier
model = BaggingClassifier(base_estimator = base_cls,            # base estimator to fit on random subsets of the datraset
                            n_estimators = num_trees,           # number of base estimators in the ensemble
                            max_samples=50,                     # the number of features to draw from X to train each base estimator
                            bootstrap = True,                   # Bootstrap = True means use bagging method
                            random_state = seed)

Obtaining the accuracy score for bagging method via cross_val_score

In [None]:
results = model_selection.cross_val_score(model, X_train, y_train, cv = kfold)

print("Bagging Accuracy Score:\t", results.mean())