##About dataset

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



##Goal
Be able to predict if the patient will have a Stroke.

##Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##Import dataset

In [None]:
dataset = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

##Exploratory Data Analysis

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
dataset.describe()

In [None]:
dataset.isnull().sum()

In [None]:
dataset.head()

Drop id vareable, is not important.

In [None]:
dataset = dataset.drop(columns=['id'], axis=1)

###Gender

In [None]:
dataset.gender.value_counts()

In [None]:
sns.countplot(data=dataset, x=dataset['gender'],palette='mako')
plt.show()

Lets remove 'Other' since there is only 1 entry.

In [None]:
dataset= dataset[dataset['gender'] != 'Other']

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, x=dataset['gender'],palette='mako', ax=ax[0] )
sns.countplot(data=dataset, x= dataset['gender'], hue = 'stroke',palette='mako' , ax= ax[1])

ax[0].set_title('gender')
ax[1].set_title('gender and stroke')
plt.show()

Even though there are more females than males recorded in the data, the relative % of males and females having experienced a stroke is roughly equal.

###age

In [None]:
dataset.age.value_counts()

In [None]:
fig = plt.figure(figsize=(20,10))

ax = [None for _ in range(2)] 

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)

sns.histplot(x='age', hue='stroke', multiple='stack', binwidth=5, data=dataset, ax=ax[0], palette='mako')
sns.histplot(x='age',  binwidth=5, data=dataset[dataset['stroke'] == 1], ax=ax[1], palette='mako')

ax[0].set_title('Age and Stroke/No Stroke')
ax[1].set_title('Age and Stroke')

fig.tight_layout()
plt.show()

This data suggests that the older patiens have more probability to have strokes.

###hypertension

In [None]:
dataset.hypertension.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, x=dataset['hypertension'],palette='mako', ax=ax[0] )
sns.countplot(data=dataset, x= dataset['hypertension'], hue = 'stroke',palette='mako' , ax= ax[1])

ax[0].set_title('hypertension')
ax[1].set_title('hyoertencion and stroke')
plt.show()

Patients with hypertension have a higher occurrance of stroke.

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(16,8))

sns.violinplot(x='hypertension', y='age', hue='stroke', data=dataset, split=True, ax=ax[0], palette='mako')
sns.boxplot(x='hypertension', y='age', hue='stroke', data=dataset, ax=ax[1], palette= 'mako')

ax[0].set_title('Hypertension, age, stroke')
ax[1].set_title('Hypertension, age, stroke')

plt.show()

The larger distribution of having hypertension is located at older ages.

###heart_disease

In [None]:
dataset.heart_disease.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, x=dataset['heart_disease'],palette='mako', ax=ax[0] )
sns.countplot(data=dataset, x= dataset['heart_disease'], hue = 'stroke',palette='mako' , ax= ax[1])

ax[0].set_title('heart disease')
ax[1].set_title('heart disease and stroke')
plt.show()

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(16,8))

sns.violinplot(x='heart_disease', y='age', hue='stroke', data=dataset, split=True, ax=ax[0], palette='mako')
sns.boxplot(x='heart_disease', y='age', hue='stroke', data=dataset, ax=ax[1], palette= 'mako')

ax[0].set_title('heart_disease, age, stroke')
ax[1].set_title('heart_disease, age, stroke')

plt.show()

The larger distribution of having heart disease  is located at older ages.


###work_type

In [None]:
dataset.work_type.value_counts()

In [None]:
fig = plt.figure(figsize=(18,15))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, y='work_type', order= dataset['work_type'].value_counts().index, palette='mako', ax=ax[0] )
sns.countplot(data=dataset, y = 'work_type', hue = 'stroke', order= dataset['work_type'].value_counts().index, palette='mako', ax= ax[1])

ax[0].set_title('work_type')
ax[1].set_title('work_type and stroke')
plt.show()


###Residence_type

In [None]:
dataset.Residence_type.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, x=dataset['Residence_type'],palette='mako', ax=ax[0] )
sns.countplot(data=dataset, x= dataset['Residence_type'], hue = 'stroke',palette='mako' , ax= ax[1])

ax[0].set_title('Residence_type')
ax[1].set_title('Residence_type and stroke')
plt.show()

###avg_glucose_level

In [None]:
dataset.avg_glucose_level.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.histplot(data = dataset, x=dataset['avg_glucose_level'], palette='mako', binwidth=14, ax = ax[0])
sns.histplot(data = dataset, x=dataset['avg_glucose_level'], palette='mako', hue = dataset['stroke'],binwidth=14, ax = ax[1])


ax[0].set_title('avg_glucose_level')
ax[1].set_title('avg_glucose_level and stroke')
plt.show()

This data suggest that there is occurance of stroke in patients with a higher average glucose level.



###bmi

In [None]:
dataset.bmi.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.histplot(data = dataset, x=dataset['bmi'], palette='mako', binwidth=5, ax = ax[0])
sns.histplot(data = dataset, x=dataset['bmi'], palette='mako', hue = dataset['stroke'],binwidth=5, ax = ax[1])


ax[0].set_title('bmi')
ax[1].set_title('bmi')
plt.show()

This data suggest that there is occurance of stroke in patients with a higher average bmi.

###smoking_status

In [None]:
dataset.smoking_status.value_counts()

In [None]:
fig = plt.figure(figsize=(20,17))

ax= [None for _ in range(2)]

ax[0] = plt.subplot2grid((3,4), (0,0), colspan = 2)
ax[1] = plt.subplot2grid((3,4), (1,0), colspan = 2)


sns.countplot(data= dataset, x=dataset['smoking_status'],palette='mako', ax=ax[0] )
sns.countplot(data=dataset, x= dataset['smoking_status'], hue = 'stroke',palette='mako' , ax= ax[1])

ax[0].set_title('smoking_status')
ax[1].set_title('smoking_status and stroke')
plt.show()

Patients who formerly smoked or smokes had a slightly higher occurance of stroke than patients who never smoked.

###stroke

In [None]:
sns.countplot(data = dataset, x= 'stroke', palette='mako')
plt.show()

Note that we are dealing with an imbalanced data set.

###correlation

In [None]:
dataset_1 = dataset

In [None]:
from sklearn.preprocessing import LabelEncoder
LabelEncoder = LabelEncoder()
dataset_1.iloc[:,0] = LabelEncoder.fit_transform(dataset_1.iloc[:,0].values)
dataset_1.iloc[:,4] = LabelEncoder.fit_transform(dataset_1.iloc[:,4].values)
dataset_1.iloc[:,5] = LabelEncoder.fit_transform(dataset_1.iloc[:,5].values)
dataset_1.iloc[:,6] = LabelEncoder.fit_transform(dataset_1.iloc[:,6].values)
dataset_1.iloc[:,9] = LabelEncoder.fit_transform(dataset_1.iloc[:,9].values)


In [None]:
corr_dset=dataset_1.corr()
corr_dset
sns.heatmap(corr_dset,)
plt.show()

The lifestyle variables, residence_type, work_type and smoking_status don't correlate well with stroke. 

##Preparing the Data

Drop Residence_type colum, is not important.

In [None]:
dataset = dataset.drop(columns=['Residence_type'], axis=1)

###Split data into dependent and independent variables 

In [None]:
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:, -1].values 

###Handling of the missing data

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 7:8])
X[:, 7:8] = imputer.transform(X[:, 7:8])


###Split dataset into test and train sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state= 0)

###Scale data

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

##Models with Over Sampling SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 0)
smote.fit(X_train, Y_train)

In [None]:
def models(X_train, Y_train):
  #Use LogisticRegression
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)

  #Use KNEighbors 
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors= 5, metric = 'minkowski', p = 2)
  knn.fit(X_train, Y_train)
  
  #Use SVC(Linear kernel)
  from sklearn.svm import SVC
  svc_lin = SVC(kernel= 'linear', random_state =  0)
  svc_lin.fit(X_train, Y_train)

  #Use SVC (RBF kernel)
  from sklearn.svm import SVC
  svc_rbf = SVC(kernel = 'rbf', random_state = 0)
  svc_rbf.fit(X_train, Y_train)

  #use GaussianNB
  from sklearn.naive_bayes import GaussianNB
  gauss =   GaussianNB()
  gauss.fit(X_train, Y_train)

  #Use DecisionTree
  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
  tree.fit(X_train, Y_train)


  #Use the RandomForestClassifier
  from sklearn.ensemble import RandomForestClassifier
  forest = RandomForestClassifier(n_estimators = 200, criterion= 'entropy', random_state= 0, class_weight = 'balanced')
  forest.fit(X_train, Y_train)

  #print the training accuracy for each model
  print('[0] Logistic Regression Training Accuracy', log.score(X_train, Y_train))
  print('[1] K Neighbors Training Accuracy', knn.score(X_train, Y_train))
  print('[2] SVC Linear Training Accuracy', svc_lin.score(X_train, Y_train))
  print('[3] SVC RBF Training Accuracy', svc_rbf.score(X_train, Y_train))
  print('[4] Gaussian NB Training Accuracy', gauss.score(X_train, Y_train))
  print('[5] Decision Tree Training Accuracy', tree.score(X_train, Y_train))
  print('[6] Random Forest Training Accuracy', forest.score(X_train, Y_train))

  ##Visualization
  models_data = {'MODELOS': ['Logistic Regression','K Neighbors','SVC Linear','SVC RBF', 'Gaussian NB','Decision Tree','Random Forest'], 
                 'ACCURACY':[log.score(X_train, Y_train), knn.score(X_train, Y_train), svc_lin.score(X_train, Y_train), svc_rbf.score(X_train, Y_train), gauss.score(X_train, Y_train), tree.score(X_train, Y_train), forest.score(X_train, Y_train)]}
  models_data_df = pd.DataFrame(data=models_data)
  
  sns.barplot(x='ACCURACY', y='MODELOS', data=models_data_df, orient='h', palette='mako')
  plt.title('models with SMOTE')
  plt.show()

  return log, knn, svc_lin, svc_rbf, gauss, tree, forest

In [None]:
model = models(X_train, Y_train)

##Evaluation

In [None]:
from sklearn.metrics import confusion_matrix
for i in range(len(model)):
  cm = confusion_matrix(Y_test, model[i].predict(X_test))
  #extrac TN, FN, FN, TP
  TN, FP, FN, TP = confusion_matrix(Y_test, model[i].predict(X_test)).ravel()
  test_score = (TP + TN)/(TP + TN + FN + FP)
  print(cm)
  print('model[{}] Testing Accuracy = "{}"'.format(i, test_score ))

In [None]:
from sklearn.model_selection import cross_val_score
for i in range(len(model)):
  accuracies = cross_val_score(estimator = model[i], X = X_train, y = Y_train, cv = 10)
  print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
  print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Logistic Regression, SVC Lineal and SVC RBF are the best candidates.