# Stroke Prediction


According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient




## Data cleaning

In [None]:
# Load the dataset
import pandas as pd
df = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

In [None]:
del df['id']
df.info()

In [None]:
print(df.isnull().sum())

We have NaN data in bmi, for now let's do an EDA 

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 5)
axs[0, 0].bar(df.gender.unique(),  df.gender.value_counts())
axs[0, 0].set_title('Gender')
axs[0, 1].bar(df.age.unique(),  df.age.value_counts())
axs[0, 1].set_title('Age')
axs[0, 2].bar(df.hypertension.unique(),  df.hypertension.value_counts())
axs[0, 2].set_title('Hypertension')
axs[0, 3].bar(df.heart_disease.unique(),  df.heart_disease.value_counts())
axs[0, 3].set_title('Heart Disease')
axs[0, 4].bar(df.ever_married.unique(),  df.ever_married.value_counts())
axs[0, 4].set_title('Ever Married')
axs[1, 0].bar(df.work_type.unique(),  df.work_type.value_counts())
axs[1, 0].set_title('Work Type')
axs[1, 1].bar(df.Residence_type.unique(),  df.Residence_type.value_counts())
axs[1, 1].set_title('Residence type ')
axs[1, 2].bar(df.avg_glucose_level.unique(),  df.avg_glucose_level.value_counts())
axs[1, 2].set_title('avg_glucose_level')
axs[1, 3].bar(df.stroke.unique(),  df.stroke.value_counts())
axs[1, 3].set_title('stroke')
axs[1, 4].bar(df.smoking_status.unique(),  df.smoking_status.value_counts())
axs[1, 4].set_title('smoking_status')
fig.set_figheight(8)
fig.set_figwidth(25)
fig.tight_layout()
fig.show()

**We can see that our dataset is imbalanced on stroke subjects**

In [None]:
import numpy as np

neg, pos = np.bincount(df['stroke'])
total = neg + pos
print('Examples:\n    Total: {}\n    Stroke: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Now we can encode our categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder
# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
df_encoded

In [None]:
# Scaling
from sklearn.preprocessing import StandardScaler

features = ['gender','age','hypertension','heart_disease','ever_married', 'work_type', 'Residence_type', 'avg_glucose_level','bmi', 'smoking_status']
ft_to_scale = ['age', 'work_type', 'avg_glucose_level', 'bmi', 'smoking_status']
scaler = StandardScaler()
df_encoded[ft_to_scale] = scaler.fit_transform(df_encoded[ft_to_scale])

In [None]:
df_encoded

In [None]:
import seaborn as sns
sns.catplot(y="work_type", hue="stroke", kind="count",
            palette="pastel", edgecolor=".6",
            data=df)

In [None]:
sns.catplot(y="smoking_status", hue="stroke", kind="count",
            palette="pastel", edgecolor=".6",
            data=df)

In [None]:
sns.catplot(y="heart_disease", hue="stroke", kind="count",
            palette="pastel", edgecolor=".6",
            data=df)

We can see from the previous plot that there is some sort of correlation between heart disease history and stroke occurrancies. 
Instead, in this dataset we don't have a major incidence of strokes with smokers. Probably due to the lack of data. 

Deal with NaN values


In [None]:
column_means = df_encoded.bmi.mean()
df_encoded.bmi = df_encoded.bmi.fillna(column_means)

In [None]:
!pip install heatmapz


In [None]:
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
plt.figure(figsize=(8, 8))
corrplot(df_encoded.corr(), size_scale=1000);

In [None]:
df_encoded.info()

In [None]:
df_encoded.describe()

We have an imbalanced dataset, we can apply SMOTE technique

In [None]:
from sklearn.model_selection import train_test_split

X = df_encoded[features]
y = df_encoded['stroke']
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
  
# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
  

sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())
  
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from imblearn.over_sampling import RandomOverSampler

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train_res, y_train_res)
Y_pred = log_reg.predict(X_test)
sns.heatmap(confusion_matrix(y_test, Y_pred),annot=True,fmt='d',cmap='Blues')
print(classification_report(y_test, Y_pred))

In [None]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train_res, y_train_res)
Y_pred = clf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, Y_pred),annot=True,fmt='d',cmap='Blues')
print(classification_report(y_test, Y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 200, random_state = 1, criterion='gini')
# Train the model on training data
rf.fit(X_train_res, y_train_res)
Y_pred = rf.predict(X_test)
sns.heatmap(confusion_matrix(y_test, Y_pred),annot=True,fmt='d',cmap='Blues')
print(classification_report(y_test, Y_pred))
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0)


**LogisticRegression** gives best results for F1 score, let's tune it!

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_models = [(LogisticRegression(),[{'C':[0.25,0.5,0.75,1],'random_state':[0],'solver': ['liblinear','lbfgs']}])]
for i,j in grid_models:
    grid = GridSearchCV(estimator=i,param_grid = j, scoring = 'accuracy')
    grid.fit(X_train_res, y_train_res)
    best_accuracy = grid.best_score_
    best_param = grid.best_params_
    print('{}:\nBest Accuracy : {:.2f}%'.format(i,best_accuracy*100))
    print('Best Parameters : ',best_param)
    print('')
    print('----------------')
    print('')

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, classification_report, roc_curve, plot_roc_curve, auc, precision_recall_curve, plot_precision_recall_curve, average_precision_score
from sklearn.model_selection import cross_val_score

In [None]:
#Using best fit parameters
classifier = LogisticRegression(C= 0.25, random_state= 0, solver= 'liblinear')
classifier.fit(X_train_res, y_train_res)
y_pred = classifier.predict(X_test)
y_prob = classifier.predict_proba(X_test)[:,1]
cm = confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))
print(f'ROC AUC score: {roc_auc_score(y_test, y_prob)}')
print('Accuracy Score: ',accuracy_score(y_test, y_pred))
print('F1 Score: ',f1_score(y_test,y_pred))
print('Recall: ', recall_score(y_test,y_pred))
# Visualizing Confusion Matrix
plt.figure(figsize = (8, 5))
sns.heatmap(cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15}, 
            yticklabels = ['No stroke', 'Stroke'], xticklabels = ['Predicted no stroke', 'Predicted stroke'])
plt.yticks(rotation = 0)
plt.show()

# Roc AUC Curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)

sns.set_theme(style = 'white')
plt.figure(figsize = (8, 8))
plt.plot(false_positive_rate,true_positive_rate, color = '#b01717', label = 'AUC = %0.3f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], linestyle = '--', color = '#174ab0')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()

**81.1% AUC** **75.6% Accuracy** **71.05% Recall** 