# Stroke Prediction with ensemble
 - Random Forest Classifier : Accuracy - 0.90 
 - Gradient Boosting Classifier : Accuracy - 0.89

**Context**

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. 
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

**Attribute Information**

- 1) id: unique identifier
- 2) gender: "Male", "Female" or "Other"
- 3) age: age of the patient
- 4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- 5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- 6) ever_married: "No" or "Yes"
- 7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- 8) Residence_type: "Rural" or "Urban"
- 9) avg_glucose_level: average glucose level in blood
- 10) bmi: body mass index
- 11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- 12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



---
### Contents(index)
```
Step 1. Data Load & EDA
Step 2. Feature Engineering
     2-a. Binary Features
     2-b. Continuous Features
     2-c. Categorical Features
Step 3. Train / Test set Split & Upsampling
Step 4. Modeling & Prediction
```

### Step 1. Data Load & EDA

In [None]:
import pandas as pd

In [None]:
ls

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.nunique()

In [None]:
df.gender.unique()

In [None]:
df.work_type.unique()

In [None]:
df.smoking_status.unique()

In [None]:
df.bmi

In [None]:
df.avg_glucose_level

---
- ID : delete
- Age : transform to Category ( 20s, 30s, 40s ...)
- Gender : transfrom by One-hot Encoding
- work_type : transfrom by One-hot Encoding
- smoking_status : transfrom by One-hot Encoding
- bmi : transform to Category ( 20-29, 30-39 ...)
- avg_glucose_level : transform to Category ( 0-50, 51-100 ...)
- Others : binary

---
It's sparse table, so we use RandomForest and GradientBoosting Algorithm instead of KNN

In [None]:
df

### Step 2. Feature Engineering

In [None]:
df.gender.value_counts()

In [None]:
df[df['gender'] == 'Other']

In [None]:
df[df['gender'] == 'Other'].index

In [None]:
df.drop(df[df['gender'] == 'Other'].index, inplace=True)

In [None]:
df

### 2-a. Feature Engineering - Binary Features

1. gender : 0 or 1

In [None]:
df['gender'] = df['gender'].apply(lambda x : 0 if x == 'Female' else 1)

2. ever_married : 0 or 1

In [None]:
df['ever_married'].apply(lambda x : 0 if x == 'No' else 1).unique()

In [None]:
df['ever_married'] = df['ever_married'].apply(lambda x : 0 if x == 'No' else 1)

3. Residence_type

In [None]:
df['Residence_type'].value_counts()

In [None]:
df['Residence_type'] = df['Residence_type'].apply(lambda x : 0 if x == 'Rural' else 1)

In [None]:
df

### 2-b. Feature Engineering - Continuous Features

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings("ignore")

1. age

In [None]:
sns.distplot(df['age']);

In [None]:
def age_classifier(age):
    if age < 20 :
        return 'age_under 19'
    elif age < 40 :
        return 'age_20 to 39'
    elif age < 60:
        return 'age_40 to 59'
    else:
        return 'age_over 60'

In [None]:
df['age'].apply(lambda x : age_classifier(x))

In [None]:
df['age'] = df['age'].apply(lambda x : age_classifier(x))

In [None]:
df['age'].unique()

In [None]:
df['age'].value_counts()

2. avg_glucose_level

In [None]:
sns.distplot(df['avg_glucose_level']);

In [None]:
df['avg_glucose_level'].describe()

___
let's split by 25%, 50%, 75% line

In [None]:
def glucose_level_classifier(level):
    if level < 77:
        return 'gl_under 25%'
    elif level < 91:
        return 'gl_26% to 50%'
    elif level < 114:
        return 'gl_50% to 75%'
    else:
        return 'gl_over 75%'

In [None]:
df['avg_glucose_level'].apply(lambda x : glucose_level_classifier(x))

In [None]:
# let's check 'is properly distributed?'
df['avg_glucose_level'].apply(lambda x : glucose_level_classifier(x)).value_counts()

In [None]:
df['avg_glucose_level'] = df['avg_glucose_level'].apply(lambda x : glucose_level_classifier(x))

3. bmi

In [None]:
df['bmi'].describe()

---
Let's classify 'bmi' the same as 'avg_glucose_level' (distribution %)

In [None]:
def bmi_classifier(figure):
    if figure < 23:
        return 'bmi_under 25%'
    elif figure < 28:
        return 'bmi_26% to 50%'
    elif figure < 33:
        return 'bmi_50% to 75%'
    else:
        return 'bmi_over 75%'

---
!! But 'bmi' has 101 null values

In [None]:
df.info()

In [None]:
df[df.bmi.isnull()]

We will fill null values with mean of bmi, because 50% line is similar with mean.

In [None]:
sns.distplot(df['bmi']);

In [None]:
df['bmi'].fillna(28, inplace=True)

In [None]:
df.bmi.isnull().sum()

Filled well

In [None]:
df['bmi'].apply(lambda x : bmi_classifier(x)).value_counts()

In [None]:
df['bmi'] = df['bmi'].apply(lambda x : bmi_classifier(x))

In [None]:
df

### 2-c. Feature Engineering - Categorical Features

In This part, We try One-Hot Encoding to All Categorical Features including we tranformed before(age, avg_glucose_level, bmi)

In [None]:
df.info()

In [None]:
columns = df.columns

In [None]:
columns

In [None]:
df[columns[0]].dtype

In [None]:
num_cols = []
cat_cols = []
for col in columns:
    if df[col].dtype == int:
        num_cols.append(col)
    else:
        cat_cols.append(col)

In [None]:
print('numeric columns : {}'.format(num_cols))
print('categorical columns : {}'.format(cat_cols))

1. age

In [None]:
from sklearn.preprocessing import LabelBinarizer 

lb = LabelBinarizer()
X_encoded = lb.fit_transform(df['age']) 
X_encoded

In [None]:
df['age']

---
- colums : [age_20 to 39 ,age_40 to 59, age_over 60, age_under 19]

In [None]:
df['age'].unique()

In [None]:
pd.DataFrame(X_encoded, columns=['age_20 to 39' ,'age_40 to 59', 'age_over 60', 'age_under 19'])

In [None]:
age_df = pd.DataFrame(X_encoded, columns=['age_20 to 39' ,'age_40 to 59', 'age_over 60', 'age_under 19'])

In [None]:
cat_cols

2. work_type

In [None]:
one_hot_encoded = lb.fit_transform(df['work_type'])

In [None]:
one_hot_encoded[-20:]

In [None]:
df['work_type'].tail(20)

In [None]:
df['work_type'].unique()

---
mapping
- colums : ['Govt_job','Never_worked', 'Private', 'Self-employed', 'Children']

In [None]:
pd.DataFrame(one_hot_encoded, columns=['Govt_job','Never_worked', 'Private', 'Self-employed', 'Children'])

In [None]:
work_type_df = pd.DataFrame(one_hot_encoded, columns=['Govt_job','Never_worked', 'Private', 'Self-employed', 'Children'])

3. avg_glucose_level 

In [None]:
df['avg_glucose_level']

In [None]:
one_hot_encoded = lb.fit_transform(df['avg_glucose_level'])

In [None]:
one_hot_encoded

In [None]:
df['avg_glucose_level'].unique()

mapping
- columns : ['gl_26% to 50%', 'gl_50% to 75%', 'gl_over 75%', 'gl_under 25%']

In [None]:
pd.DataFrame(one_hot_encoded, columns=['gl_26% to 50%', 'gl_50% to 75%', 'gl_over 75%', 'gl_under 25%'])

In [None]:
agl_df = pd.DataFrame(one_hot_encoded, columns=['gl_26% to 50%', 'gl_50% to 75%', 'gl_over 75%', 'gl_under 25%'])

4. bmi

In [None]:
df['bmi']

In [None]:
one_hot_encoded = lb.fit_transform(df['bmi'])
one_hot_encoded

In [None]:
df['bmi'].unique()

---
mapping

- columns : ['bmi_26% to 50%', 'bmi_50% to 75%', 'bmi_over 75%', 'bmi_under 25%']

In [None]:
pd.DataFrame(one_hot_encoded, columns=['bmi_26% to 50%', 'bmi_50% to 75%', 'bmi_over 75%', 'bmi_under 25%'])

In [None]:
bmi_df = pd.DataFrame(one_hot_encoded, columns=['bmi_26% to 50%', 'bmi_50% to 75%', 'bmi_over 75%', 'bmi_under 25%'])

In [None]:
df['smoking_status']

In [None]:
df['smoking_status'].unique()

In [None]:
lb.fit_transform(df['smoking_status'])

In [None]:
one_hot_encoded = lb.fit_transform(df['smoking_status'])

mapping
- columns : ['Unknown','formerly smoked','never smoked','smokes']

In [None]:
pd.DataFrame(one_hot_encoded, columns=['Unknown','formerly smoked','never smoked','smokes'])

In [None]:
smoked_df = pd.DataFrame(one_hot_encoded, columns=['Unknown','formerly smoked','never smoked','smokes'])

In [None]:
cat_cols.append('age')

In [None]:
cat_cols

finally! merge!

In [None]:
df

In [None]:
df.drop(cat_cols,axis=1,inplace=True)
df

In [None]:
# delete meaningless column (for machie learning)
df.drop('id', axis=1, inplace=True)

In [None]:
df

In [None]:
cat_cols

In [None]:
df.reset_index().iloc[:,1:]

In [None]:
df = df.reset_index().iloc[:,1:]

In [None]:
df

In [None]:
pd.concat([df,age_df,agl_df,work_type_df,bmi_df,smoked_df], axis=1)

In [None]:
pd.concat([df,age_df,agl_df,work_type_df,bmi_df,smoked_df], axis=1).isnull().sum()

In [None]:
final_df = pd.concat([df,age_df,agl_df,work_type_df,bmi_df,smoked_df], axis=1)

### Step 3. Train, Test set split & Upsampling

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = final_df.drop('stroke', axis=1)
y = final_df['stroke']

In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=111)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
plt.bar(x = y_train.value_counts().index, height = y_train.value_counts().values);

It's imbalanced data, we have to upsample '1' in **training set-y**

In [None]:
from sklearn.utils import resample

In [None]:
train_df = pd.concat([X_train,y_train], axis=1)

In [None]:
train_df

In [None]:
train_0 = train_df[train_df['stroke']==0]
train_1 = train_df[train_df['stroke']==1]


In [None]:
train_0.shape, train_1.shape

In [None]:
upsampled_train_1 = resample(train_1,
                             replace=True,
                             n_samples=3893,
                             random_state=123
                            )
upsampled_train_1.shape

In [None]:
upsampled_train = pd.concat([train_0, upsampled_train_1])

In [None]:
upsampled_train['stroke'].value_counts()

In [None]:
X_train = upsampled_train.drop('stroke',axis=1)
y_train = upsampled_train['stroke']

### Step 4. Modeling & Prediction

1. RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
print('Train Accuracy : {:.2f}'.format(rfc.score(X_train, y_train)))
print('Test Accuracy : {:.2f}'.format(rfc.score(X_test, y_test)))

2. GradientBoostingClassifier

In [None]:
best_params = {}
score = 0
from sklearn.ensemble import GradientBoostingClassifier
for i in range(1,8):
    for j in [50, 100, 150, 200, 250, 300, 350, 400]:
        gbc = GradientBoostingClassifier(max_depth=i,
                                         n_estimators=j
                                        )
        gbc.fit(X_train, y_train)

        print('max_depth : {}'.format(i))
        print('n_estimators : {}'.format(j))
        print('Train Score : {}'.format(gbc.score(X_train,y_train)))
        print('Train Score : {}'.format(gbc.score(X_test,y_test)))
        print('----------------------------------------------------')
        if gbc.score(X_test,y_test) > score:
            score = gbc.score(X_test, y_test)
            best_params['max_depth'] = i
            best_params['n_estimators'] = j

In [None]:
best_params

In [None]:
score