# STROKE PREDICTION

Stroke, also known as "paralysis", is a sudden interruption or decrease in blood flow to the brain. As a result, brain cells are damaged due to insufficient oxygenation and nutrition in the brain. Brain cells begin to die rapidly.

According to the World Health Organization, 15 million people have a stroke each year. Of these, 5 million die and 5 million are permanently disabled, making stroke the second most common cause of death and a major cause of disability.

If the stroke patient can come to the hospital within the first four and a half hours, one out of 3-9 patients can be saved with thrombolytic (vasodilation) treatment, according to the early admission rate, but this situation cannot be achieved, so the patient is lost or 	become permanently disabled. Therefore, it is very important that the risk of stroke can be predicted.

In [None]:
!pip install ycimpute

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from ycimpute.imputer import knnimput

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import plot_confusion_matrix


from scipy.stats import friedmanchisquare
from statsmodels.stats.contingency_tables import mcnemar

# DATA CHECK

* **id:** unique identifier
* **gender:** "Male", "Female" or "Other"
* **age:** age of the patient
* **hypertension:** 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* **heart_disease:** 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* **ever_married:** "No" or "Yes"
* **work_type:** "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* **Residence_type:** "Rural" or "Urban"
* **avg_glucose_level:** average glucose level in blood
* **bmi:** body mass index
* **smoking_status:** "formerly smoked", "never smoked", "smokes" or "Unknown"
* **stroke:** 1 if the patient had a stroke or 0 if not
* **Note: "Unknown" in smoking_status means that the information is unavailable for this patient**

In [None]:
data = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
data.head(10)

In [None]:
data.info()

* **Categorical :** gender, ever_married, work_type, residence_type, smoking_status

* **Numerical :** age, hypertension, heart_disease, avg_glucose_level, bmi

hyoertension & heart_disease have int dtypes, but we can check out that they are in categorical style

In [None]:
data.describe(include='all').T

In [None]:
stroke = data["stroke"]
stroke

In [None]:
stroke.describe()

# Exploratory Data Analysis(EDA)

First of all, we will look at features based on the target values ( 'stroke').

In [None]:
# correlation matrix
corrmat = data.corr()
corrmat

In [None]:
plt.subplots(figsize=(10,7))
sns.heatmap(corrmat, vmax=1,cmap="GnBu", square=True)

In [None]:
corr_stroke = data.corrwith(stroke, axis=0)
corr_stroke = pd.DataFrame(corr_stroke)
corr_stroke.rename (columns = {0: 'stroke'}, inplace = True)

#Oluşturulan korelasyon matrisinin görselleştirilmesi
plt.subplots(figsize=(10,7))
sns.set(font_scale=1.1)
sns.heatmap(corr_stroke, vmax=1, cmap="GnBu",fmt='.4f',annot=True);

**Gender & Stroke**

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)

gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='gender', hue='stroke', data=data, ax=gender_stroke, palette="Set2")
sns.despine()

gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=gender_stroke)
sns.countplot(x='stroke', hue='gender', data=data, ax=gender_stroke, palette="Set2")
sns.despine()

plt.show()

In [None]:
data.groupby('gender')["stroke"].count()

In [None]:
data.groupby(['gender', 'stroke'])['stroke'].count()

**Worktype & Stroke**

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='work_type', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='work_type', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()


plt.show()

When the graphs above are examined, it is not possible for children to have a stroke. The number of stroke  in private and self-employed groups is similar. However, the probability of people coming under government administration (govt_job) not having a stroke is higher than both classes. Perhaps this can be explained by the degree of pressure felt by the workers.

**Residence & Stroke**

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='Residence_type', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='Residence_type', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()


plt.show()

In [None]:
data.groupby(['Residence_type', 'stroke'])['stroke'].count()

People living in rural areas  less prone to stroke than urban residents. It can be said that air pollution in cities may be related to having a stroke.

**Ever Married & Stroke**

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='ever_married', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='ever_married', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

plt.show()

In [None]:
data.groupby(['ever_married', 'stroke'])['stroke'].count()

There is a greater chance of stroke among people who have been married. It can be a meaningful feature.

**Smoking & Stroke**

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='smoking_status', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='smoking_status', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

plt.show()

In [None]:
data['smoking_status'].value_counts()

In [None]:
data.groupby(['smoking_status', 'stroke'])['stroke'].count()

It can be said that the correlation between stroke and smoking status is low, since the rate of stroke survivors is close between different smoking situations.

**Age & Stroke**

In [None]:
f,ax = plt.subplots(1,2, figsize=(20,10));

data.loc[data['stroke'] ==0]['age'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='lightsteelblue');
ax[0].set_title('stroke = 0');
ax1 = list(range(0, 85, 5));
ax[0].set_xticks(ax1);

data[data['stroke']==1]['age'].plot.hist(ax=ax[1], color='salmon', bins=20, edgecolor='black');
ax[1].set_title('stroke=1');
x2=list(range(0, 85, 5));
ax[1].set_xticks(x2);
plt.show();

It is seen that the risk of stroke increases with age

**Hypertension & Stroke**

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='hypertension', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='hypertension', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()


plt.show()

In [None]:
data['hypertension'].value_counts()

In [None]:
data.groupby(['hypertension', 'stroke'])['stroke'].count()

**Heart Disease & Stroke**

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)

ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='heart_disease', hue='stroke', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='heart_disease', data=data, ax=ax_gender_stroke, palette='Set2')
sns.despine()

plt.show()

In [None]:
data['heart_disease'].value_counts()

In [None]:
data.groupby(['heart_disease', 'stroke'])['stroke'].count()

Features of hypertension and heart disease have been found to be correlation with stroke.

**Glucose_level**

In [None]:
sns.kdeplot('avg_glucose_level', data=data, shade=True,color="salmon")
sns.set_style("white")
sns.despine()

In [None]:
f,ax = plt.subplots(1,2, figsize=(20,10))

data.loc[data['stroke'] ==0]['avg_glucose_level'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='lightsteelblue')
ax[0].set_title('stroke = 0')
ax1 = list(range(30, 300, 10))
ax[0].set_xticks(ax1)

data.loc[data['stroke']==1]['avg_glucose_level'].plot.hist(ax=ax[1], color='salmon', bins=20, edgecolor='black')
ax[1].set_title('stroke=1')
x2= list(range(30, 300, 10))
ax[1].set_xticks(x2)
plt.show()

**BMI**

In [None]:
f,ax = plt.subplots(1,2, figsize=(15,7))

data.loc[data['stroke'] ==0]['bmi'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='lightsteelblue')
ax[0].set_title('stroke = 0')
ax1 = list(range(0, 70, 5))
ax[0].set_xticks(ax1)

data.loc[data['stroke']==1]['bmi'].plot.hist(ax=ax[1], color='salmon', bins=20, edgecolor='black')
ax[1].set_title('stroke=1')
x2= list(range(0, 70, 5))
ax[1].set_xticks(x2)
plt.show()

# DATA CLEANING AND EXAMINATION OF MISSING DATA

In [None]:
data = data.drop("id", axis="columns")
data.head()

In [None]:
data_delete = data[data['gender'] == 'Other'].index
data.drop(data_delete)

The "other" attribute in the gender column was deleted as an outlier.

In [None]:
smoking_status = data['smoking_status']
smoking_status

In [None]:
for index in range(smoking_status.size):
    if smoking_status.iloc[index] == "Unknown":
        smoking_status.iloc[index] = np.nan

smoking_status

In [None]:
total = data.isnull().sum().sort_values(ascending=False)

missing_data = pd.concat([total], axis=1, keys=['Toplam'])
missing_data = missing_data
missing_data.head(12)

In [None]:
ever_married = data["ever_married"]
ever_married

In [None]:
for index in range(ever_married.size):
    if ever_married.iloc[index] == "Yes":
        ever_married.iloc[index] = 1
    else:
        ever_married.iloc[index] = 0

data["ever_married"] = ever_married

In [None]:
data["ever_married"].value_counts()

In [None]:
total = data.isnull().sum().sort_values(ascending=False)

missing_data = pd.concat([total], axis=1, keys=['Toplam'])
missing_data = missing_data
missing_data.head(12)

Body mass index (BMI) can be defined as the measurement of a person's weight in proportion to his or her weight. In other words, it is obtained by dividing the person's weight in kilograms by the square of their height in meters.

BMI = body weight (kg) / (height(m) x height(m))

We will use KNN to complete the missing data in the BMI column, the first we will create dummy data

In [None]:
data_copy = data.copy()

numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorial_cols = []

for col in data_copy.columns:
    if data_copy[col].dtype not in numeric_dtypes:
        categorial_cols.append(col)
    else:
        continue

data_dummy = pd.get_dummies(data_copy[categorial_cols])
data_copy = data.drop(categorial_cols, axis='columns')
data_dummy  = pd.concat([data_copy,data_dummy], axis=1)

data_dummy.head()

In [None]:
var_names = list(data_dummy)
array_data = np.array(data_dummy)
data_dummy = knnimput.KNN(k = 4).complete(array_data)
data_dummy = pd.DataFrame(data_dummy, columns = var_names)

In [None]:
data_dummy.head(10)

In [None]:
total = data_knn_dummy.isnull().sum().sort_values(ascending=False)

missing_data = pd.concat([total], axis=1, keys=['Toplam'])
missing_data = missing_data
missing_data.head(12)

# MODEL

In [None]:
X = data_dummy
y = stroke
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
X_train

**RANDOM FOREST**

In [None]:
rf_model = RandomForestClassifier()

rf_params = {"max_depth": [2,5],
             "max_features": [2,3],
             "n_estimators": [2,5,10,15],
             "min_samples_split": [2,3]}

rf_cv_model = GridSearchCV(rf_model, rf_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)

In [None]:
print("Best Parameters: " + str(rf_cv_model.best_params_))

In [None]:
data_grid_rf = pd.DataFrame(rf_cv_model.cv_results_)
data_grid_rf

In [None]:
rf_best_model=rf_cv_model.best_estimator_

#Best model score

rf_crossVal = rf_cv_model.best_score_
rf_crossVal

In [None]:
rf_best_model.fit(X_train, y_train)

In [None]:
rf_best_model.score(X_test, y_test)

In [None]:
y_pred_rf = rf_best_model.predict(X_test)
y_pred_rf

**SUPPORT VECTOR MACHINE**

In [None]:
svc_model = SVC()

svc_params = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-6],'C': [2, 10,25]}]

svc_cv_model = GridSearchCV(svc_model, svc_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)


In [None]:
print("Best parameters: " + str(svc_cv_model.best_params_))

In [None]:
data_grid_svc = pd.DataFrame(svc_cv_model.cv_results_)
data_grid_svc

In [None]:
svc_best_model = svc_cv_model.best_estimator_

svc_crossVal = svc_cv_model.best_score_
svc_crossVal

In [None]:
svc_best_model.fit(X_train, y_train)

In [None]:
svc_best_model.score(X_test, y_test)

In [None]:
y_pred_svc = svc_best_model.predict(X_test)
y_pred_svc

**LOGISTIC REGRESSION**

In [None]:
lg_model = LogisticRegression()

lg_params = {"C":np.logspace(-3,-4,4,3,7), "penalty":["l1","l2"]}


lg_cv_model = GridSearchCV(lg_model, lg_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)

In [None]:
print("Best Parameters: " + str(lg_cv_model.best_params_))

In [None]:
data_grid_lg = pd.DataFrame(lg_cv_model.cv_results_)
data_grid_lg

In [None]:
lg_best_model = lg_cv_model.best_estimator_

lg_crossVal = lg_cv_model.best_score_
lg_crossVal

In [None]:
lg_best_model.fit(X_train, y_train)

In [None]:
lg_best_model.score(X_test, y_test)

In [None]:
y_pred_lg = lg_best_model.predict(X_test)
y_pred_lg

**BEST MODEL SELECT**

In [None]:
models = [rf_best_model,svc_best_model,lg_best_model]

for model in models:
    name = model.__class__.__name__
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("-"*40)
    print(name + ":" )
    print("Test Accuracy: {:.4%}".format(accuracy))
    plot_confusion_matrix(model,X_test,y_test)
    plt.show()
    print(classification_report(y_test,y_pred))

In [None]:
seeds=np.arange(10**4)
np.random.shuffle(seeds)
seeds

In [None]:
seeds=seeds[:35]
seeds

In [None]:
accuracy_rf = []
accuracy_svc = []
accuracy_lg = []

for i in seeds:
    X_train,X_test,y_train,y_tets=train_test_split(X,y,
                                                  test_size=0.30,
                                                  random_state=i,
                                                  stratify=y)
    rf_best_model.fit(X_train,y_train)
    rf_i_acc = rf_best_model.score(X_test,y_test)
    accuracy_rf.append(rf_i_acc)
    
    
    svc_best_model.fit(X_train,y_train)
    svc_i_acc = svc_best_model.score(X_test,y_test)
    accuracy_svc.append(svc_i_acc)
    
    lg_model.fit(X_train,y_train)
    lg_acc = lg_model.score(X_test,y_test)
    accuracy_lg.append(lg_acc)

In [None]:
accuracy_rf

In [None]:
accuracy_svc

In [None]:
accuracy_lg

In [None]:
d = {'rf_accuracy': accuracy_rf, 'svc_accuracy': accuracy_svc,'lg_accuracy': accuracy_lg}
accuracies = pd.DataFrame(data=d)
accuracies

In [None]:
accuracies.describe().T

In [None]:
sns.boxplot(data = accuracies, orient="h", palette="Set2")

In [None]:
stat, p = friedmanchisquare(accuracy_rf, accuracy_svc ,accuracy_lg)
print('Statistics = %.3f, p=%.3f' % (stat, p))

alpha = 0.05
if p > alpha:
    print('Same proportions of errors (fail to reject H0)')
else:
    print('Different proportions of errors (reject H0)')

There is a significant difference between the algorithms. Random forest was chosen as the best model because it is the random forest algorithm with the highest average of the prediction values

**FINAL MODEL**

In [None]:
final_model = RandomForestClassifier(max_depth = 5, max_features = 3, n_estimators = 2)
final_tuned = final_model.fit(X_train,y_train)

y_pred_test = final_tuned.predict(X_test)
y_pred_test

In [None]:
print('Training Accuracy: ',accuracy_score(y_train,final_model.predict(X_train)))
print('Test Accuracy: ',accuracy_score(y_test,final_model.predict(X_test)))

**FEATURE SELECTION BY IMPORTANCE**

In [None]:
rf_Importance = pd.DataFrame({"Importance":final_tuned.feature_importances_*100},index = X_train.columns)

s = rf_Importance.sort_values(by = "Importance", axis=0, ascending = False)
s


In [None]:
new_train_columns = []

for index, row in rf_Importance.iterrows():
    if(row["Importance"] >= 0.06):
        new_train_columns.append(index)
    
new_train = X_train[new_train_columns]

X_train = new_train
X_test = X_test[new_train_columns]
rf_model = RandomForestClassifier(max_depth = 5, max_features = 3, n_estimators = 2)
rf_tuned = rf_model.fit(X_train,y_train)

y_pred = rf_tuned.predict(X_test)
y_pred

In [None]:
print('Training Accuracy: ',accuracy_score(y_train,rf_model.predict(X_train)))
print('Test Accuracy: ',accuracy_score(y_test,rf_model.predict(X_test)))