# Stroke Prediction


### Context
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

### Attribute Information
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Acknowledgements
(Confidential Source) - Use only for educational purposes

This Data Belongs to Respective Owners.

## Exploring Data

In [1]:
import pandas as pd
df = pd.read_csv('Stroke data.csv')
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Here we can clearly see that bmi has around 200 null value. As we can see that the null value is not so much we can remove it but this is not right approach to solve this.

In [3]:
df = df.dropna()

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4909 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4909 non-null   int64  
 1   gender             4909 non-null   object 
 2   age                4909 non-null   float64
 3   hypertension       4909 non-null   int64  
 4   heart_disease      4909 non-null   int64  
 5   ever_married       4909 non-null   object 
 6   work_type          4909 non-null   object 
 7   Residence_type     4909 non-null   object 
 8   avg_glucose_level  4909 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     4909 non-null   object 
 11  stroke             4909 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 498.6+ KB


As we can see that there is not any null value.

In [5]:
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5104,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [6]:
(df['smoking_status'] == 'Unknown').value_counts()

False    3426
True     1483
Name: smoking_status, dtype: int64

Here `1483` values in `smoking_status` is `Unknown`.

### Identifying the Inputs and Targets.

In [7]:
X = df.columns[1:-1]
y = 'stroke'

In [8]:
inputs = df[X]
targets = df[y]

In [9]:
inputs

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked
5,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked
...,...,...,...,...,...,...,...,...,...,...
5104,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown
5106,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked
5107,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked
5108,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked


In [10]:
targets

0       1
2       1
3       1
4       1
5       1
       ..
5104    0
5106    0
5107    0
5108    0
5109    0
Name: stroke, Length: 4909, dtype: int64

# Train Test Split 

We are going to do `train_test_split` of 20% for test data and rest 80% for Training.

In [11]:
from sklearn.model_selection import train_test_split
train_inputs, test_inputs, train_targets, test_targets = train_test_split(inputs, targets, test_size=0.2, random_state=42)

In [12]:
train_inputs

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
3565,Female,40.0,0,0,Yes,Private,Urban,65.77,31.2,never smoked
898,Female,59.0,0,0,Yes,Self-employed,Urban,81.64,32.8,Unknown
2707,Female,57.0,0,0,Yes,Private,Urban,217.40,36.6,never smoked
4198,Male,81.0,0,0,Yes,Self-employed,Urban,71.18,23.9,formerly smoked
2746,Male,65.0,0,0,Yes,Self-employed,Urban,95.88,28.5,never smoked
...,...,...,...,...,...,...,...,...,...,...
4613,Female,19.0,0,0,No,Private,Urban,89.30,22.1,never smoked
511,Female,51.0,0,0,Yes,Private,Rural,82.93,29.7,smokes
3247,Female,53.0,0,0,Yes,Private,Rural,90.65,22.1,formerly smoked
3946,Female,11.0,0,0,No,children,Rural,93.51,20.8,Unknown


In [13]:
train_targets

3565    0
898     0
2707    0
4198    0
2746    0
       ..
4613    0
511     0
3247    0
3946    0
916     0
Name: stroke, Length: 3927, dtype: int64

# Linear Regression

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

trf = ColumnTransformer([('trf',OneHotEncoder(sparse=False, drop='first'),['gender', 'ever_married','work_type', 'Residence_type', 'smoking_status'])], remainder='passthrough')

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
pipe = Pipeline(steps=[
    ('step1',trf),
    ('step2',LinearRegression())
])

In [16]:
pipe.fit(train_inputs, train_targets)



In [17]:
pipe.score(train_inputs, train_targets)

0.0759182317724485

In [18]:
pipe.score(test_inputs, test_targets)

0.08939928237069039

Here we can see that our model is only `7%` `accurate` for `Training data` and `8%` `accurate` for `Test data`.

Here this is the result because this is classification dataset so it perform very poor for `LinearRegression`.

# Logistic Regression

In [19]:
from sklearn.linear_model import LogisticRegression
pipe2 = Pipeline(steps=[
    ('step1',trf),
    ('step2',LogisticRegression())
])

In [24]:
pipe2.fit(train_inputs, train_targets)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [25]:
pipe2.score(train_inputs, train_targets)

0.9602750190985485

In [26]:
pipe2.score(test_inputs, test_targets)

0.9460285132382892

Here `LogisticRegression` is also not very effective because our model is `96%` `accurate` for `Training data` and `94%` `accurate` for `Test data`.

# Decision Tree

In [30]:
from sklearn.tree import DecisionTreeClassifier
pipe3 = Pipeline(steps=[
    ('step1',trf),
    ('step2',DecisionTreeClassifier())
])

In [31]:
pipe3.fit(train_inputs, train_targets)



In [32]:
pipe3.score(train_inputs, train_targets)

1.0

In [33]:
pipe3.score(test_inputs, test_targets)

0.9185336048879837

Here `Decision tree Regressor` perform `100%` for `Training data` and it perform `91%` for `Test data`.

we will do `Hyperparameter Tuning and Overfitting` for better result for `Test Data`.

## Hyperparameter Tuning and Overfitting

In [43]:
def decision_hyper(**params):
    pipe3 = Pipeline(steps=[
    ('step1',trf),
    ('step2',DecisionTreeClassifier(**params))
    ])
    pipe3.fit(train_inputs, train_targets)
    print('Training accuracy:- ', pipe3.score(train_inputs, train_targets))
    print('Test accuracy:- ', pipe3.score(test_inputs, test_targets))

### `max_depth`

In [44]:
decision_hyper(max_depth=2)

Training accuracy:-  0.9602750190985485
Test accuracy:-  0.9460285132382892




In [45]:
decision_hyper(max_depth=3)

Training accuracy:-  0.9602750190985485
Test accuracy:-  0.9460285132382892




In [46]:
decision_hyper(max_depth=10)

Training accuracy:-  0.9831932773109243
Test accuracy:-  0.9307535641547862




In [47]:
decision_hyper(max_depth=20)

Training accuracy:-  0.9997453526865292
Test accuracy:-  0.9195519348268839




Best `max_depth` is `2` and `3`.

### max_leaf_nodes

In [54]:
decision_hyper(max_leaf_nodes=4)

Training accuracy:-  0.9602750190985485
Test accuracy:-  0.9460285132382892




In [48]:
decision_hyper(max_leaf_nodes=10)

Training accuracy:-  0.9620575502928445
Test accuracy:-  0.9439918533604889




In [49]:
decision_hyper(max_leaf_nodes=20)

Training accuracy:-  0.9656226126814362
Test accuracy:-  0.9419551934826884




In [50]:
decision_hyper(max_leaf_nodes=50)

Training accuracy:-  0.9742806213394448
Test accuracy:-  0.9368635437881874




Best `max_leaf_nodes` is `4`.

#### Putting all Together

In [55]:
decision_hyper(max_depth = 2, max_leaf_nodes=4)

Training accuracy:-  0.9602750190985485
Test accuracy:-  0.9460285132382892




Here `DecisionTreeClasifier` perform very well and Now we are doing another model that our model perform more well So that accuracy score of our model increases that we can get better result. 

# Random Forest

In [58]:
from sklearn.ensemble import RandomForestClassifier
pipe4 = Pipeline(steps=[
('step1',trf),
('step2',RandomForestClassifier())
])
pipe4.fit(train_inputs, train_targets)



In [59]:
pipe4.score(train_inputs, train_targets)

1.0

In [60]:
pipe4.score(test_inputs, test_targets)

0.9460285132382892

Here with `Random Forest` our model perform very well more than `Decision Tree`.

## Hyperparameter Tuning and Overfitting

In [67]:
def random_hyper(**params):
    pipe4 = Pipeline(steps=[
    ('step1',trf),
    ('step2',RandomForestClassifier(**params))
    ])
    pipe4.fit(train_inputs, train_targets)
    print('Training accuracy:- ', pipe4.score(train_inputs, train_targets))
    print('Test accuracy:- ', pipe4.score(test_inputs, test_targets))

### max_depth

In [77]:
random_hyper(max_depth=2)



Training accuracy:-  0.9602750190985485
Test accuracy:-  0.9460285132382892


In [78]:
random_hyper(max_depth=5)



Training accuracy:-  0.9605296664120193
Test accuracy:-  0.9460285132382892


In [79]:
random_hyper(max_depth=10)



Training accuracy:-  0.9752992105933282
Test accuracy:-  0.9460285132382892


In [80]:
random_hyper(max_depth=30)



Training accuracy:-  1.0
Test accuracy:-  0.945010183299389


Best `max_depth` is `10`.

### n_estimators

In [73]:
random_hyper(n_estimators=20)

Training accuracy:-  0.9964349376114082
Test accuracy:-  0.945010183299389




In [74]:
random_hyper(n_estimators=50)

Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892




In [75]:
random_hyper(n_estimators=70)



Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892


In [76]:
random_hyper(n_estimators=100)



Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892


Best `n_estimators` is `50`.

### max_features

In [81]:
random_hyper(max_features=2)



Training accuracy:-  1.0
Test accuracy:-  0.9439918533604889


In [82]:
random_hyper(max_features=3)



Training accuracy:-  1.0
Test accuracy:-  0.945010183299389


In [83]:
random_hyper(max_features=6)



Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892


In [84]:
random_hyper(max_features=10)



Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892


best `max_features` is `6`.

### Putting it Together.


In [85]:
random_hyper(n_jobs=-1, random_state=42, n_estimators=50, max_features=6, max_depth=10,)

Training accuracy:-  0.980646804176216
Test accuracy:-  0.9460285132382892




Here putting all together is not well performing than other.

# XGBoost

In [86]:
from xgboost import XGBClassifier
pipe5 = Pipeline(steps=[
('step1',trf),
('step2',RandomForestClassifier())
])
pipe5.fit(train_inputs, train_targets)



In [87]:
print('Training accuracy:- ', pipe5.score(train_inputs, train_targets))
print('Test accuracy:- ', pipe5.score(test_inputs, test_targets))

Training accuracy:-  1.0
Test accuracy:-  0.9460285132382892


Here `XGBoost` is well performed but not well as `Random forest`.

# Saving the Best Model

Random Forest is best model thats why we will save that model.

In [88]:
pipe4.score(train_inputs, train_targets)

1.0

In [89]:
pipe4.score(test_inputs, test_targets)

0.9460285132382892

In [90]:
import pickle
pickle.dump(pipe4, open('stroke_model.pkl', 'wb'))

In [91]:
model = pickle.load(open('stroke_model.pkl', 'rb'))

In [92]:
model.score(train_inputs, train_targets)

1.0

In [93]:
model.score(test_inputs, test_targets)

0.9460285132382892