# Contents of the Notebook  :
 
# Part1: Exploratory Data Analysis(EDA):
#### 1)Analysis of the features.

#### 2)Finding any relations or trends considering multiple features.

# Part2: Feature Engineering and Data Cleaning:

#### 1)Converting features into suitable form for modeling.

# Part3: Predictive Modeling
#### 1)Running Basic Algorithms.

#### 2)Cross Validation.

#### 3)Ensembling.


In [None]:
import numpy as np
import pandas as pd 
import matplotlib as mpl 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from cycler import cycler

mpl.rcParams['figure.dpi'] = 120
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

## Data Check

In [None]:
data =pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
data.head()

In [None]:
data.describe(include='all')

In [None]:
data.isnull().sum()

There is only missing values in 'bmi'! Let's check out 

Determining the average BMI figure by gende

In [None]:
data.groupby('gender')['bmi'].mean()

There is no difference in average according to gender, so let's replace by average value.

In [None]:
data = data.fillna(data.mean())
data.isnull().sum()

# No more missing values left 

In [None]:
# Check out all the datas dtypes before EDA
data.dtypes

I think ID doesn't mean much, so let's remove it.

In [None]:
data= data.drop(columns='id')

#### Let the features be divided into two categories before we start EDA: 

1) **Categorical** : gender, ever_married, work_type, residence_type, smoking_status

2) **Numerical** : age, hypertension, heart_disease, avg_glucose_level, bmi

+ hyoertension & heart_disease have int dtypes, but we can check out that they are in categorical style 

# Part1: Exploratory Data Analysis(EDA):

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



#### First of all, we will look at features based on the target values ( 'stroke').


### Gender & Stroke

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)

ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='gender', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='gender', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

In [None]:
data_delete = data[data['gender'] == 'Other'].index
data = data.drop(data_delete)

data.groupby(['gender', 'stroke'])['stroke'].count()

- Delete 'Other', it can be a outlier to machine learning

- There is not much difference between a man and a woman, But in proportion, males are more likely to develop in proportion

### Ever Married & Stroke

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='ever_married', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='ever_married', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

There is a greater chance of stroke among people who have been married. It can be a meaningful feature.

### Worktype & Stroke

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='work_type', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='work_type', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

Overall, people who work are more likely to get in stroke. 

### Residence & Stroke

In [None]:
fig = plt.figure(figsize=(14,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='Residence_type', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='Residence_type', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

In [None]:
data.groupby(['Residence_type', 'stroke'])['stroke'].count()

It's too similar to see with eyes. But I think it is not that useful feature. 

### Smoking & Stroke 

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='smoking_status', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='smoking_status', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

In [None]:
data.groupby(['smoking_status', 'stroke'])['stroke'].count()

My background was that smoking would have a significant impact on the stroke outbreak, but there's not such big difference between smoking and non smoking.

But as a percentage of smokers, we can know that when me smoke it might be more is likely to occur.

### Age & Stroke 
#### Now Lets check out the numerical categories 

In [None]:
f,ax = plt.subplots(1,2, figsize=(20,10))

data.loc[data['stroke'] ==0]['age'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='skyblue')
ax[0].set_title('stroke = 0')
ax1 = list(range(0, 85, 5))
ax[0].set_xticks(ax1)

data[data['stroke']==1]['age'].plot.hist(ax=ax[1], color='red', bins=20, edgecolor='black')
ax[1].set_title('stroke=1')
x2=list(range(0, 85, 5))
ax[1].set_xticks(x2)
plt.show();

### Hypertension & Stroke 

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='hypertension', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='hypertension', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

In [None]:
data.groupby(['hypertension', 'stroke'])['stroke'].count()

### Heart Disease & Stroke 

In [None]:
fig = plt.figure(figsize=(16,11))
gs = fig.add_gridspec(3,4)
sns.set_style("white")
sns.set_context("poster", font_scale = 0.5)


ax_gender_stroke = fig.add_subplot(gs[:2,:2])
sns.countplot(x='heart_disease', hue='stroke', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()

ax_gender_stroke = fig.add_subplot(gs[:2,2:4], sharey=ax_gender_stroke)
sns.countplot(x='stroke', hue='heart_disease', data=data, ax=ax_gender_stroke, palette='coolwarm')
sns.despine()



plt.show()

In [None]:
data.groupby(['heart_disease', 'stroke'])['stroke'].count()

### Glucose_level

In [None]:
sns.kdeplot('avg_glucose_level', data=data, shade=True)
sns.set_style("white")
sns.despine()

In [None]:
f,ax = plt.subplots(1,2, figsize=(20,10))

data.loc[data['stroke'] ==0]['avg_glucose_level'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='skyblue')
ax[0].set_title('stroke = 0')
ax1 = list(range(30, 300, 10))
ax[0].set_xticks(ax1)

data.loc[data['stroke']==1]['avg_glucose_level'].plot.hist(ax=ax[1], color='red', bins=20, edgecolor='black')
ax[1].set_title('stroke=1')
x2= list(range(30, 300, 10))
ax[1].set_xticks(x2)
plt.show()

Higher your glucose_level the higher you can get in stroke!

### BMI

In [None]:
plt.hist('bmi', data=data, histtype='stepfilled',color='skyblue');

In [None]:
f,ax = plt.subplots(1,2, figsize=(20,10))

data.loc[data['stroke'] ==0]['bmi'].plot.hist(ax=ax[0], bins=20, edgecolor='black', color='skyblue')
ax[0].set_title('stroke = 0')
ax1 = list(range(0, 70, 5))
ax[0].set_xticks(ax1)

data.loc[data['stroke']==1]['bmi'].plot.hist(ax=ax[1], color='red', bins=20, edgecolor='black')
ax[1].set_title('stroke=1')
x2= list(range(0, 70, 5))
ax[1].set_xticks(x2)
plt.show()

Most of people's bmi levels are around 20 to 30 and higher do not mean they are more likely to have a stroke.

# Part2: Feature Engineering and Data Cleaning:


#### First devide the columns into **Categorical feature** and **Numerical Features**

#### I am going to use dummies values to categorical features and use StandardScaler to numerical features

In [None]:
# Categorical Features

cat_columns = [c for c, t in zip(data.dtypes.index, data.dtypes) if t == 'O']

data = pd.get_dummies(data = data, columns = cat_columns)
data = pd.get_dummies(data = data, columns = ['hypertension'])
data = pd.get_dummies(data = data, columns = ['heart_disease'])

data.columns

In [None]:
# Numerical Features

num_columns = [c for c, t in zip(data.dtypes.index, data.dtypes) if t == 'float64']
num_columns

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[num_columns] = scaler.fit_transform(data[num_columns])

data[num_columns]

In [None]:
data.head()

NOW! The feature engineering is clear! Next we are going to split the train-test set and go modeling ~!

## Train-Valid-Test split

We have gained some insights from the EDA part. But with that, we cannot accurately predict or tell whether a stroke will occur or not.. So now we will predict by using some great Classification Algorithms. Following are the algorithms I will use to make the model:

1)Logistic Regression

2)Support Vector Machines(Linear and radial)

3)Random Forest

4)K-Nearest Neighbours

5)Naive Bayes

6)Decision Tree

7)Logistic Regression

In [None]:
x = data.drop('stroke', axis=1).values
y = data['stroke'].values

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4, random_state=2020,shuffle=True)
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=2020,shuffle=True)

# Part3: Predictive Modeling
## 1)Running Basic Algorithms.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import f1_score

In [None]:
# 1. LogisticRegression

lr = LogisticRegression()

lr.fit(x_train, y_train)

y_pred = lr.predict(x_valid)

print(f"Logistic Regression F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

In [None]:
# 2. Support Vector Machine

svc = SVC(probability=True)

svc.fit(x_train, y_train)

y_pred = svc.predict(x_valid)

print(f"Support Vector Machine F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

In [None]:
# 3. Rnadom Forest

rf = RandomForestClassifier()

rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)

print(f"RandomForest F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

In [None]:
# 4. XGBoost

xgb = XGBClassifier()

xgb.fit(x_train, y_train)

y_pred = xgb.predict(x_valid)

print(f"XGBoost F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

In [None]:
# 5. LightGBM

lgb = LGBMClassifier()

lgb.fit(x_train, y_train)

y_pred = lgb.predict(x_valid)

print(f"LightGBM F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

In [None]:
# 6. KNeighborsClassifier 

knn = KNeighborsClassifier()

knn.fit(x_train, y_train)

y_pred = knn.predict(x_valid)

print(f"KNeighborsClassifier F1 Score: {f1_score(y_valid, y_pred, average='micro')}")

#### Wow! We get the accuracy about 96% !!! 
#### The score was better than I thought. I think I did a great job with feature engineering.

## 2) Cross Validation.

In [None]:
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
kfold = KFold(n_splits=10, random_state=2020, shuffle=True) # k=10, split the data into 10 equal parts
xyz=[]
accuracy=[]
std=[]
classifiers=['Logistic Regression',
             'SVC',
             'Random Forest',
             'XGB',
             'LGBM',
             'KNeighbors']

models=[LogisticRegression(),
        SVC(),
        RandomForestClassifier(),
        XGBClassifier(),
        LGBMClassifier(),
        KNeighborsClassifier()]

for i in models:
    model = i
    cv_result = cross_val_score(model,x,y, cv = kfold,scoring = "accuracy")
    cv_result=cv_result
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)
new_models_dataframe2=pd.DataFrame({'CV Mean':xyz,'Std':std},index=classifiers)       
new_models_dataframe2

In [None]:
plt.subplots(figsize=(18,10))
box=pd.DataFrame(accuracy,index=[classifiers])
box.T.boxplot();

In [None]:
new_models_dataframe2['CV Mean'].plot.barh(width=0.8)
plt.title('Average CV Mean Accuracy')
fig=plt.gcf()
fig.set_size_inches(8,5)
plt.show()

## 3) Ensembling

Ensembling is a good way to increase the accuracy or performance of a model. In simple words, it is the combination of various simple models to create a single powerful model.

Lets say we want to buy a phone and ask many people about it based on various parameters. So then we can make a strong judgement about a single product after analysing all different parameters. This is **Ensembling**, which improves the stability of the model. Ensembling can be done in ways like:

I am goind to use **Boosting** ensembling method. 

### Boosting

Boosting is an ensembling technique which uses sequential learning of classifiers. It is a step by step enhancement of a weak model.Boosting works as follows:

A model is first trained on the complete dataset. Now the model will get some instances right while some wrong. Now in the next iteration, the learner will focus more on the wrongly predicted instances or give more weight to it. Thus it will try to predict the wrong instance correctly. Now this iterative process continous, and new classifers are added to the model until the limit is reached on the accuracy.

#### AdaBoost(Adaptive Boosting)

The weak learner or estimator in this case is a Decsion Tree.  But we can change the dafault base_estimator to any algorithm of our choice.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200, # 200개라는 것은 200개를 붙여서 이제 실행을 해준다는 것을 말해주는 것이다 
                       random_state=0,
                       learning_rate=0.1)
result=cross_val_score(ada,x,y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())

## In conclusion, an accuracy of **95%** was obtained.

### Feel free to give any comments about my notebook!

### Also, if my notebook was helpful, please give me an upvote !!!!!