# **STROKE DATASET EDA & PREDICTION PERFORMANCE**

![](https://www.heart.org/-/media/images/news/2019/october-2019/1017strokeptsd_sc.jpg)

# Overview

A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.

A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.
Stroke Statistics
In 2018, 1 in every 6 deaths from cardiovascular disease was due to stroke.1
Someone in the United States has a stroke every 40 seconds. Every 4 minutes, someone dies of stroke.2
Every year, more than 795,000 people in the United States have a stroke. About 610,000 of these are first or new strokes.2
About 185,000 strokes—nearly 1 of 4—are in people who have had a previous stroke.2
About 87% of all strokes are ischemic strokes, in which blood flow to the brain is blocked.2
Stroke-related costs in the United States came to nearly 46 billion dollars between 2014 and 2015.2 This total includes the cost of health care services, medicines to treat stroke, and missed days of work.
Stroke is a leading cause of serious long-term disability.2 Stroke reduces mobility in more than half of stroke survivors age 65 and over.2

# How will we proceed ?

1. **Understanding the Data**

2. **EDA**

3. **Model Building**

4. **Model Performance**

5. **Inference**


# **UNDERSTANDING THE DATA**

# Including Required Packages 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**READING THE DATA**

In [None]:
df= pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head()


In [None]:
df.shape

So we know that there are 12 features that has been included in the dataset needed to determine Heart Attack

In [None]:
df.info()

**DESCRIPTION OF THE DATASET**

In [None]:
df.describe()

**Let Us Know if We Have any missing values**

In [None]:
features_with_na=[features for features in df.columns if df[features].isnull().sum()>1]
## 2- step print the feature name and the percentage of missing values
for feature in features_with_na:
    print(feature, np.round(df[feature].isnull().mean(), 4),  ' % missing values')

So we have found that missing values are presnet in the feature 'bmi'. We will have to take care of it otherwise it will cause problems

# **EDA**

**Number of Numerical Variables**

In [None]:
numerical_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
len(numerical_features)

In [None]:
numerical_features

Wow!! We got to know all of the features are numerical variables ! 

**NOTICE THAT BMI IS A NUMERICAL VARIABLE AND WE NEED TO REPLACE THE NaN VALUES**

In [None]:
numerical_with_nan=[feature for feature in df.columns if df[feature].isnull().sum()>1 and df[feature].dtypes!='O']

## We will print the numerical nan variables and percentage of missing values

for feature in numerical_with_nan:
    print("{}: {}% missing value".format(feature,np.around(df[feature].isnull().mean(),4)))

In [None]:
for feature in numerical_with_nan:
    ## We will replace by using median since there are outliers
    median_value=df[feature].median()
    
    ## create a new feature to capture nan values
    df[feature+'nan']=np.where(df[feature].isnull(),1,0)
    df[feature].fillna(median_value,inplace=True)
    
df[numerical_with_nan].isnull().sum()

**We need to know the number of discrete variables, Let us find it out !**

In [None]:
discrete_feature=[feature for feature in numerical_features if len(df[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

In [None]:
discrete_feature

**Now let's deal with the Continuous Variables**

In [None]:
continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature]
print("Continuous feature Count {}".format(len(continuous_feature)))

In [None]:
for feature in continuous_feature:
    data=df.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,ax=ax)

**Results against the Age**

In [None]:
sns.displot(x='age', hue='stroke', data=df, alpha=0.6)
plt.show()

In [None]:
stroke = df[df['stroke']==1]
sns.displot(stroke.age, kind='kde')
plt.show()

In [None]:
sns.displot(stroke.age, kind='ecdf')
plt.grid(True)
plt.show()

In [None]:
categorical_variables=['gender','ever_married','work_type','Residence_type','smoking_status']
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
for feature in categorical_variables:
    df[feature]= label_encoder.fit_transform(df[feature])
    df[feature].unique()
    
# Encode labels in column 'species'.

  
df.head()

In [None]:
stroke = df[df['stroke']==1]

In [None]:
ranges = [0, 30, 40, 50, 60, 70, np.inf]
labels = ['0-30', '30-40', '40-50', '50-60', '60-70', '70+']

stroke['age'] = pd.cut(stroke['age'], bins=ranges, labels=labels)
stroke['age'].head()

In [None]:
sns.countplot(stroke.age)

**WE SEE THAT AGES BETWEEN 50-60 ARE THE MOST PRONE TO HEART ATTACKS**

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
sns.countplot(x='gender', hue='age', data=stroke, ax=ax)



**WE NOTICE THAT MALE HAVE A HIGHER TENDENCY TO HAVE HEART ATTACK**

In [None]:
sns.countplot(stroke['age'])

In [None]:
for feature in continuous_feature:
    data=df.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import  BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

**PREPARING THE DATASET FOR MODEL**

In [None]:
#Creating a copy
data= df

In [None]:
data = data.drop(['id'],axis=1)


In [None]:
data.head()

In [None]:
categorical_vars = ['gender','hypertension','heart_disease','ever_married','work_type','Residence_type','smoking_status','bminan']

In [None]:
columns = data.columns
columns

In [None]:
continuous_vars= np.setdiff1d(columns, categorical_vars, assume_unique=False)

In [None]:
continuous_vars = np.setdiff1d(continuous_vars,['stroke'],assume_unique=False)

In [None]:
continuous_vars

In [None]:

scaler = StandardScaler()

# define the columns to be encoded and scaled


# encoding the categorical columns
data = pd.get_dummies(data, columns = categorical_vars, drop_first = True)

X = data.drop(['stroke'],axis=1)
y = data[['stroke']]

data[continuous_vars] = scaler.fit_transform(X[continuous_vars])

# defining the features and target
X = data.drop(['stroke'],axis=1)
y = data[['stroke']]



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)

# **Models**

In [None]:
lr = LogisticRegression(random_state=42)

knn = KNeighborsClassifier()
para_knn = {'n_neighbors':np.arange(1, 50)}

grid_knn = GridSearchCV(knn, param_grid=para_knn, cv=5)

dt = DecisionTreeClassifier()
para_dt = {'criterion':['gini','entropy'],'max_depth':np.arange(1, 50), 'min_samples_leaf':[1,2,4,5,10,20,30,40,80,100]}
grid_dt = GridSearchCV(dt, param_grid=para_dt, cv=5)

rf = RandomForestClassifier()

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators':[100, 350, 500],
    'min_samples_leaf':[2, 10, 30]
}
grid_rf = GridSearchCV(rf, param_grid=params_rf, cv=5)

In [None]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=9, min_samples_leaf=10, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=2, random_state=42)

In [None]:
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt), ('Random Forest', rf)]

In [None]:
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

**WE SEE THAT LOGISTIC REGRESSION PERFORMS THE BEST**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(base_estimator=rf, n_estimators=100, random_state=1)

ada.fit(X_train, y_train)

y_pred = ada.predict(X_test)

accuracy_score(y_pred, y_test)

In [None]:
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
plt.figure(figsize=(10, 10))
importances_sorted.plot(kind='bar',color='orange')
plt.title('Features Importances')
plt.show()

# NEURAL NETWORK APPROACH

**IMPORTING THE NECESSARY LIBRARIES**

In [None]:
from tensorflow.keras.layers import Dense,Dropout,Flatten
from tensorflow.keras.layers import MaxPooling2D,GlobalAveragePooling2D,BatchNormalization,Activation
from tensorflow import keras
import tensorflow as tf

In [None]:
X_train.shape

In [None]:

model = tf.keras.Sequential()
model.add(Dense(1024, input_dim=17, activation= "relu"))
model.add(Dropout(0.3))
model.add(Dense(512, activation= "relu"))
model.add(Dropout(0.4))
model.add(Dense(128, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(32, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(1))
model.summary() #Print model Summary

In [None]:
model.compile(loss= "binary_crossentropy" , optimizer="adam", metrics=["accuracy"])

In [None]:
Performance = model.fit(X_train, y_train, validation_split =0.1,epochs=5)

In [None]:
model.evaluate(X_test,y_test)

In [None]:
my_dpi = 50 # dots per inch .. (resolution)
plt.figure(figsize=(400/my_dpi, 400/my_dpi), dpi = my_dpi)
plt.plot(Performance.history['accuracy'], label='train accuracy')
plt.plot(Performance.history['val_accuracy'], label='val accuracy')
plt.legend()
plt.show()
plt.savefig('AccVal_acc')

# Inference

The accuracy of the following models are 
1. **Logistic Regression : 0.965**
2. **K Nearest Neighbours : 0.963**
3. **Classification Tree : 0.963**
4. **Random Forest : 0.967**
5. **Adaboost Classifier: 0.964**
6. **ANN : 0.964**

**So we see that the most important factor which leads to stroke is age, so it is advisable to the general people to take proper care of the aged people as much as they can and following are the few guidelines that help them.
Trouble speaking and understanding what others are saying. You may experience confusion, slur your words or have difficulty understanding speech.
Paralysis or numbness of the face, arm or leg. You may develop sudden numbness, weakness or paralysis in your face, arm or leg. This often affects just one side of your body. Try to raise both your arms over your head at the same time. If one arm begins to fall, you may be having a stroke. Also, one side of your mouth may droop when you try to smile.
Problems seeing in one or both eyes. You may suddenly have blurred or blackened vision in one or both eyes, or you may see double.
Headache. A sudden, severe headache, which may be accompanied by vomiting, dizziness or altered consciousness, may indicate that you're having a stroke.
Trouble walking. You may stumble or lose your balance. You may also have sudden dizziness or a loss of coordination.**




# THANK YOU , IF YOU LIKE THE NOTEBOOK PLEASE DO UP VOTE