 
<h1 style='background-color: #6495ED; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;' > Stroke Prediction  </h1>
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.Doctors can predict patients' risk for ischemic stroke based on the severity of their metabolic syndrome, a conglomeration of conditions that includes high blood pressure, abnormal cholesterol levels and excess body fat around the abdomen and waist, a new study finds.

<img src="https://guardian.ng/wp-content/uploads/2017/06/brainomixstroke.jpg" width="800px">

<h1 style='background-color:#6495ED ; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;' > Exploratory Data Analysis </h1>

Exploratory Data Analysis (EDA) is the method of examining a dataset to learn about its main characteristics. The dataprep.eda package makes this step easier by allowing users to explore key characteristics using simple APIs. Each API allows the user to examine the dataset at various levels, from high to low, and from various angles. Especially.


<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcREYfRclj8FnDD3fE9BQNivtsj1XBoFg89AWeGgYRgyldGk_nKvfGe0q-OVDw-8deNpSA4&usqp=CAU" width="800px">




<h1 style='background-color:#6495ED ; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;' > Stroke Prediction with 10 Algorithms </h1>


* 1- Logistic Regression 
* 2- K Nearest Neighbor 
* 3- Support Vector Machine with linear
* 4- Support Vector Machine with rbf
* 5- Gaussian Naive Bayes 
* 6- Decision Tree
* 7- Random Forest Classifier
* 8- Xgboost Classifier
* 9- SGD Classifie
* 10-AdaBoost Classifier






### Attribute Information
* 1) id: unique identifier.
* 2) gender: "Male", "Female" or "Other".
* 3) age: age of the patient.
* 4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.
* 5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease.
* 6) ever_married: "No" or "Yes".
* 7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed".
* 8) Residence_type: "Rural" or "Urban".
* 9) avg_glucose_level: average glucose level in blood.
* 10) bmi: body mass index.
* 11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*.
* 12) stroke: 1 if the patient had a stroke or 0 if not.



### Dataset Link

##### [Herer](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)


In [None]:
!pip install dataprep by

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from dataprep.eda import *
from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
import plotly.express as px
import plotly.figure_factory as ff
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

In [None]:
df=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df

In [None]:
#Drop id 
df.drop(columns=['id'],inplace=True) 

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna()

In [None]:
df.isnull().sum()

In [None]:
# Chack if we have miss data
plot_missing(df)

In [None]:
# Imputing the missing values with the mean
df=df.fillna(np.mean(df['bmi']))
df.info()

In [None]:
import pandas_profiling as pp
pp.ProfileReport(df)

In [None]:
plot(df)

In [None]:
df

In [None]:
# plots the distribution of column x in various ways and calculates column statistics
plot(df, 'stroke')

In [None]:
plot(df, 'smoking_status')

In [None]:
plot(df, 'bmi')

In [None]:
plot(df, 'heart_disease')

In [None]:
plot_correlation(df)

In [None]:
create_report(df)

In [None]:
plt.subplots(figsize=(6, 8))
sns.countplot(x="stroke", data=df)
plt.show()

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df.corr(), annot=True)

In [None]:
# Convert Marrital Status, Residence and Gender into 0's and 1's
df['gender']=df['gender'].apply(lambda x : 1 if x=='Male' else 0) 
df["Residence_type"] = df["Residence_type"].apply(lambda x: 1 if x=="Urban" else 0)
df["ever_married"] = df["ever_married"].apply(lambda x: 1 if x=="Yes" else 0)

In [None]:
# Removing the observations that have smoking_status type unknown. 
df=df[df['smoking_status']!='Unknown']

In [None]:
df

In [None]:
# used One Hot encoding smoking_status, work_type
data_dummies = df[['smoking_status','work_type']]
data_dummies=pd.get_dummies(data_dummies)
df.drop(columns=['smoking_status','work_type'],inplace=True)

In [None]:
data_dummies

In [None]:
df

In [None]:
y=df['stroke']
df.drop(columns=['stroke'],inplace=True)
x=df.merge(data_dummies,left_index=True, right_index=True,how='left')


In [None]:
#df

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size=0.20,random_state=0)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train

In [None]:
def models(X_train,Y_train):
  
  #Using Logistic Regression Algorithm to the Training Set
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)
  
  #Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
  knn.fit(X_train, Y_train)
    
  #Using SVC method of svm class to use Support Vector Machine Algorithm
  from sklearn.svm import SVC
  svc_lin = SVC(kernel = 'linear', random_state =0)
  svc_lin.fit(X_train, Y_train)

  #Using SVC method of svm class to use Kernel SVM Algorithm
  from sklearn.svm import SVC
  svc_rbf = SVC(kernel = 'rbf', random_state = 0)
  svc_rbf.fit(X_train, Y_train)

  #Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
  from sklearn.naive_bayes import GaussianNB
  gauss = GaussianNB()
  gauss.fit(X_train, Y_train)

 

  #Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
  from sklearn.ensemble import RandomForestClassifier
  forest = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
  forest.fit(X_train, Y_train)
  
    
  #Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
  tree.fit(X_train, Y_train)
    
    
    
    
 #Using xgboostClassifier of tree class to use Decision Tree Algorithm
  from xgboost import XGBClassifier 
  xgboost = XGBClassifier(max_depth=5, learning_rate=0.01, n_estimators=100, gamma=0, 
                        min_child_weight=1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.005)
  xgboost.fit(X_train, Y_train)
    
    
    
 #Using  SGDClassifierr of tree class to use Decision Tree Algorithm    
  from sklearn.linear_model import SGDClassifier
  SGD = SGDClassifier()
  SGD.fit(X_train, Y_train)
    
    
  #Using  AdaBoostClassifier of tree class to use Decision Tree Algorithm    
  from sklearn.ensemble import AdaBoostClassifier
  Ada = AdaBoostClassifier(n_estimators=2000, random_state = 0)
  Ada.fit(X_train, Y_train)
    
    
  
  #print model accuracy on the training data.
  print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train)*100)
  print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train)*100)
  print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train, Y_train)*100)
  print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, Y_train)*100)
  print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train)*100)
  print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train)*100)
  print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train)*100)
  print('[7]Xgboost Classifier Training Accuracy:', xgboost.score(X_train, Y_train)*100)
  print('[8]SGD Classifier Training Accuracy:', SGD.score(X_train, Y_train)*100)
  print('[9]AdaBoost Classifier Training Accuracy:', Ada.score(X_train, Y_train)*100)

  return log, knn, svc_lin, svc_rbf, gauss,tree,forest,xgboost,SGD,Ada

model = models(X_train,Y_train)

In [None]:
#Show the confusion matrix and accuracy for all of the models on the test data
#Classification accuracy is the ratio of correct predictions to total predictions made.
from sklearn.metrics import confusion_matrix
for i in range(len(model)):
  cm = confusion_matrix(Y_test, model[i].predict(X_test))
  TN = cm[0][0]
  TP = cm[1][1]
  FN = cm[1][0]
  FP = cm[0][1]
  print(cm)
  print('Model[{}] Testing Accuracy = "{}!"'.format(i,  (TP + TN) / (TP + TN + FN + FP)))
  print()# Print a new line

#Show other ways to get the classification accuracy & other metrics 

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

for i in range(len(model)):
  print('Model ',i)
  #Check precision, recall, f1-score
  print( classification_report(Y_test, model[i].predict(X_test)) )
  #Another way to get the models accuracy on the test data
  print( accuracy_score(Y_test, model[i].predict(X_test)))
  print()#Print a new line

In [None]:
#Print Prediction of AdaBoost Classifier model
pred = model[9].predict(X_test)
print(pred)
#Print a space
print()
#Print the actual values
print(Y_test)