# CASE STUDY: Stroke Dataset
# BY: Arsh Dinesh Vijayvargiya

## Index

1. ***[Import Data](#DataInspection)***
2. ***[Data manipulation](#DataManipulation)***
3. ***[EDA](#EDA)***
4. ***[ML model](#MLmodel)***

<a class = 'anchor' id ='DataInspection'>
    
### Inspecting  the database

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(r"../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

<a id ='DataManipulation'>

-------

## Data Manipulation

##### 1. Gender

In [None]:
df.gender.unique()

In [None]:
df[df.gender == 'Other']

In [None]:
df.gender = df.gender.apply(lambda x:1 if x == 'Male' else 0)
df.gender.unique()

##### 2. ever_married

In [None]:
df.ever_married.unique()
df.ever_married = df.ever_married.apply(lambda x: 1 if x == 'Yes' else 0)

In [None]:
df[df.ever_married == 1]

##### 3. Residence_type

In [None]:
df.Residence_type.unique()
df.Residence_type = df.Residence_type.apply(lambda x: 1 if x == 'Urban' else 0)

##### 4. bmi

In [None]:
sum(df.bmi.isnull())

In [None]:
df.bmi = df.bmi.fillna(df.bmi.mean(),axis = 0)
sum(df.bmi.isnull())

<a id ='EDA'>

----------------------------
## EDA

In [None]:
import matplotlib.pyplot as plt, seaborn as sns

In [None]:
bmi_slot = ['Under-weight','Healthy','Overweight','Obese']
df['cate_bmi'] = pd.cut(df.bmi,[-1,18.5,25,30,100],labels = bmi_slot)

In [None]:
age_slot = ['infant','child','teenager','adult','senior-citizen']
df['cate_age'] = pd.cut(df.age,[-1,5,13,20,50,100],labels = age_slot)

In [None]:
def plott(df,x):
    fig,axes= plt.subplots(figsize =(15,7)) 
    axes = sns.countplot(data=df,x = x, hue= 'stroke', order=df[x].unique())
    count = 0
    for i in df[x].unique():
        total_count = len(df[df[x]==i])
        stroke = len(df.gender[df[x]==i][df.stroke==1])
        no_stroke = total_count - stroke
        has_stroke = round((stroke/total_count)*100,5)
        doesnt_has_stroke = round((no_stroke/total_count)*100,5)
        annote = [doesnt_has_stroke,has_stroke]
        n_count = df[x].nunique()-1
        for n in range(2):
            p = axes.patches[n+count+(n_count*n)]
            axes.annotate('{:.1f}%'.format(annote[n]),(p.get_x()+0.2,p.get_height()+40))
        count+=1
    return fig,axes


In [None]:
plott(df,'cate_age')
plt.show()

##### Insight: Elder people are at more risk to stroke. 

In [None]:
plott(df,'cate_bmi')
plt.show()

##### Insight: Risk of stroke is significantly higher for people with bmi > 25

In [None]:
fig,axes = plott(df,'gender')
label = ['Male','Female']
axes.set_xticklabels(label)
plt.show()

##### Insight: There isn't much effect of gender on having stroke as per our dataset.

In [None]:
fig,axes = plott(df,'ever_married')
axes.set_xticklabels(['Married','Single'])
plt.show()

##### Insight: Marriage brings in more risk of having stroke, live long live happy #SingleForever

In [None]:
fig,axes = plott(df,'Residence_type')
axes.set_xticklabels(['Urban','Rural'])
plt.show()

##### Insight: It seems living in urban area can increase the risk of stroke but not to a significant level. 

In [None]:
plott(df,'smoking_status')
plt.show()

##### Insight: People who had any smoking influence in their life are at more risk of having stroke. 

In [None]:
plott(df,'work_type')
plt.show()

##### Insight: Having a job greatly increases the chances of having stroke. It can be observed that self-employed people have significantly higher probablity of having stroke then Government Jobs or Private Jobs, this could be due to the benefits that one get from joining a private or government jobs aren't present in self-employed.

In [None]:
fig,axes = plott(df,'hypertension')
axes.set_xticklabels(['No','Yes'])
plt.show()

##### Insight: Hypertension is bad sign that can led to a possible stroke.

In [None]:
fig,axes = plott(df,'heart_disease')
axes.set_xticklabels(['Yes','No'])
plt.show()

##### Insight: A person with heart disease is more likely to get a stroke than an healthy person.


----------------
## Multivariate Analysis

In [None]:
plt.figure(figsize=(16,10))
axes =sns.boxplot(data =df, x='gender', y='age', hue ='stroke' ,order = df.gender.unique())
plt.title('Relation of Age and Gender on Stroke')
label = ['Male','Female']
axes.set_xticklabels(label)
plt.show()

##### Insight: As expected senior citizens are at more risk of having stroke irrespective of their gender.

In [None]:
plt.figure(figsize=(16,10))
axes =sns.boxplot(data =df, y='avg_glucose_level', x='stroke' ,order = df.stroke.unique())
plt.title('Effect of Avg_Glucose_level on Stroke')
label = ['Yes','No']
axes.set_xticklabels(label)
plt.show()

##### Insight: In general low avg_glucose_level is better sign of not having stroke in near future. ;)

In [None]:
plt.figure(figsize=(16,10))
axes =sns.boxplot(data =df, y='avg_glucose_level', x='heart_disease',hue='stroke' ,order = df.heart_disease.unique())
plt.title('Effect of Avg_Glucose_level and Heart_disease on Stroke')
label = ['Yes','No']
axes.set_xticklabels(label)
plt.show()

##### Insight: Heart Disease along with high Average Glucose Level is a strong indicator of possible stroke.

### Summary:

1. The ***major determinant*** features of stroke are:
    
    a) **Age**
    
    b) **Marriage Status**
    
    c) **Heart Disease** 
    
    d) **Hypertension**
    
    
2. As per the dataset **gender** and **residence type** contributes significantly ***less*** on having stroke. 
3. **Self employed** people are ***more*** prone to stroke.
4. **Married** people experienced stork ***more*** than **Single** ones.
5. A healthy **bmi**, low **average glucose level**, and lack of **hypertension** greatly reduces the chances if having stroke.
6. As one grow old she/he has more responsibilities and has to deal with various hardships as a result of this stress we can see monotonically increase in people getting stoke as they grow old.

<a class = 'anchor' id ='MLmodel'>


-------------------------------
## ML Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
df.drop(['cate_bmi','cate_age'],axis =1,inplace=True)

# Seprating label from dataset
y = df.stroke
X = df.drop('stroke',axis=1)

train_X, test_X, train_y,test_y = train_test_split(X,y,random_state= 1)

cate_feat = ['work_type','smoking_status']
for col in cate_feat:
    le = LabelEncoder()
    train_X[col] = le.fit_transform(train_X[col])
    test_X[col] = le.transform(test_X[col])
    
    

In [None]:
# decision tree
dec_tree = DecisionTreeClassifier(random_state=1)
dec_tree.fit(train_X,train_y)
pred = dec_tree.predict(test_X)
error = mae(test_y,pred)
error

In [None]:

acc = accuracy_score(test_y,pred)
print('{:.2f}%'.format(acc*100))

In [None]:
# Random Forest Classifier
estimators_list = [100,150,200,250,300]

for n in estimators_list:
    rand_for = RandomForestClassifier(n_estimators = n, random_state =1)
    pred= rand_for.fit(train_X,train_y).predict(test_X)
    loss = mae(test_y,pred)
    acc = accuracy_score(test_y,pred)
    print(loss,acc)
    

rand_for = RandomForestClassifier(n_estimators = 200, random_state =1)
pred= rand_for.fit(train_X,train_y).predict(test_X)
loss = mae(test_y,pred)
acc = accuracy_score(test_y,pred)
print('loss: {:.4f}, acc: {:.3f}%'.format(loss,acc*100))

In [None]:
# GradientBoosting Classifier
estimators_list = [25,50,75,100,125]
for n in estimators_list:
    gbc= GradientBoostingClassifier(loss = 'exponential',n_estimators = n,random_state= 1)
    pred = gbc.fit(train_X,train_y).predict(test_X)
    loss = mae(test_y,pred)
    print(n,loss)

gbc= GradientBoostingClassifier(loss = 'exponential',n_estimators = 100,random_state= 1)
pred = gbc.fit(train_X,train_y).predict(test_X)
loss = mae(test_y,pred)
acc = accuracy_score(test_y,pred)
print('loss: {:.4f}, acc: {:.3f}%'.format(loss,acc*100))

In [None]:
final_model = GradientBoostingClassifier(loss = 'exponential',n_estimators = 100,random_state= 1)
pred = final_model.fit(train_X,train_y).predict(test_X)
loss = mae(test_y,pred)
acc = accuracy_score(test_y,pred)

In [None]:
print('loss: {:.4f}, acc: {:.3f}%'.format(loss,acc*100))