* # Stroke Prediction

---
**1. Importing the necessary libraries**

In [None]:
# To prevent the annoying warning from scikit learn package
import warnings  
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
cmap = sns.cm.mako_r

%matplotlib inline

**2. Importing the Data using Pandas read_csv(). And calling head() and info() on the DataFrame**

In [None]:
stroke = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
stroke.head()

In [None]:
stroke.info()

In [None]:
stroke.drop(columns=['id']).describe()

---
**3. Preprocessing Data before Exploratory Data Analysis**
<ol>
    <li>Using <strong>round()</strong> to round off Age.</li><br>
    <li>Setting values to <strong>NaN</strong> where BMI is less than <strong>12</strong> and greater than <strong>60</strong>. Found out from google search that these can be considered as outliers</li><br>
    <li>We will sort the DataFrame first based on <strong>Gender</strong> then on <strong>Age</strong> and use <strong>Forward Filling</strong> to fill those missing BMI values</li>
</ol>

In [None]:
# Round off Age
stroke['age'] = stroke['age'].apply(lambda x : round(x))

# BMI to NaN
stroke['bmi'] = stroke['bmi'].apply(lambda bmi_value: bmi_value if 12 < bmi_value < 60 else np.nan)

# Sorting DataFrame based on Gender then on Age and using Forward Fill-ffill() to fill NaN value for BMI
stroke.sort_values(['gender', 'age'], inplace=True) 
stroke.reset_index(drop=True, inplace=True)
stroke['bmi'].ffill(inplace=True)

In [None]:
stroke.info()

Now we have <strong>Age</strong> Column as <em style='color:blue'>int64</em> and no missing values in <strong>Bmi</strong> Column

---

In [None]:
# Checking if Data is balanced
xs = stroke['stroke'].value_counts().index
ys = stroke['stroke'].value_counts().values

ax = sns.barplot(xs, ys)
ax.set_xlabel("Stroke")
plt.show()

As we can see from the above plot that the Data is not balanced which will result in a bad model. To resolve this issue we can use SMOTE to balance the Data. This is will done before fitting our data to the model.

---

In [None]:
# Age vs BMI with hue = stroke
plt.figure(figsize=(12,8))
ax = sns.scatterplot(x="bmi", y="age", alpha=0.4, data=stroke[stroke['stroke'] == 0])
sns.scatterplot(x="bmi", y="age", alpha=1, data=stroke[stroke['stroke'] == 1], ax=ax)
plt.show()

From the above <strong>Age vs BMI</strong> plot we can clearly see that when people attain an age of <strong>40</strong> or greater the chances of getting a stroke increases and after <strong>60+</strong> it tends to increase even more. Also, people with a BMI of <strong>25+</strong> have shown a higher chances of encountering a stroke. 

So, people with 40+ years and BMI of 25+ have a greater probability of encountering a stroke.

---

In [None]:
# Age vs BMI with hue = stroke
plt.figure(figsize=(12,8))
ax = sns.scatterplot(x="bmi", y="avg_glucose_level", alpha=0.4, data=stroke[stroke['stroke'] == 0])
sns.scatterplot(x="bmi", y="avg_glucose_level", alpha=1, data=stroke[stroke['stroke'] == 1], ax=ax)
plt.show()

---

In [None]:
# Percentage of People
def plot_percent_of_stroke_in_each_category(df, column, axis):
    x_axis = []
    y_axis = []
    
    unique_values = df[column].unique()
    
    for value in unique_values:
        stroke_yes = len(df[(df[column] == value) & (df['stroke'] == 1)])
        total = len(df[df[column] == value])
        percentage = (stroke_yes/total) * 100
        x_axis.append(value)
        y_axis.append(percentage)
    
    sns.barplot(x_axis, y_axis, ax=axis)
    
columns = ['gender', 'hypertension', 'heart_disease', 'ever_married', 
           'work_type', 'Residence_type', 'smoking_status']

fig, axes = plt.subplots(4, 2, figsize=(16, 18))
axes[3, 1].remove()

plot_percent_of_stroke_in_each_category(stroke, 'gender', axes[0,0])
axes[0,0].set_xlabel("Gender")
axes[0,0].set_ylabel("Percentage")

plot_percent_of_stroke_in_each_category(stroke, 'hypertension', axes[0,1])
axes[0,1].set_xlabel("Hypertension")

plot_percent_of_stroke_in_each_category(stroke, 'heart_disease', axes[1,0])
axes[1,0].set_xlabel("Heart Disease")
axes[1,0].set_ylabel("Percentage")

plot_percent_of_stroke_in_each_category(stroke, 'ever_married', axes[1,1])
axes[1,1].set_xlabel("Ever Married")


plot_percent_of_stroke_in_each_category(stroke, 'work_type', axes[2,0])
axes[2,0].set_xlabel("Work Type")
axes[2,0].set_ylabel("Percentage")

plot_percent_of_stroke_in_each_category(stroke, 'Residence_type', axes[2,1])
axes[2,1].set_xlabel("Residence Type")

plot_percent_of_stroke_in_each_category(stroke, 'smoking_status', axes[3,0])
axes[3,0].set_xlabel("Smoking Status")
axes[3,0].set_ylabel("Percentage")

plt.show()

**Insights drawn from the above plot with respect to the Stroke Data**

<ol>
    <li>Both the Genders have arround 5% chance.</li><br>
    <li>People with history of Hypertension and Heart Disease have shown an increased in percentage of Stroke with around 12.5% and 16.5% respectively.</li><br>
    <li>Married/Divorced people have a 6.5% chance of stroke. No wonder why people these days choose to stay single.</li><br>
    <li>Self Employed people have a higher chance compared to Private and Govt Jobs.</li><br>
    <li>Rural and Urban doesn't show much difference.</li><br>
    <li>For some reason people who once used to smoke have higher chance compared to people who are still smoking. If you have already started smoking, don't stop. JK, do as you wish. </li>
</ol>

---

---

**5. Preparing the Data for Prediction**

<ol>
    <li>Converting the Categorical Columns into Numerical by Mapping each category to an integer value using <strong>map()</strong> on pandas series object</li><br>
    <li>As we saw earlier that data is <strong>Imbalanced</strong>. To make it balanced we use a technique called as <strong>SMOTE (Synthetic Minority Oversampling Technique)</strong>. There are other techniques like NearMiss Algorithm. But I prefer SMOTE. The major difference between SMOTE and NMA is that SMOTE increases number of sample of minority classes by linear interpolation. Whereas the later one randomly eliminates majority class examples. When instances of two different classes are very close to each other.</li><br>  
    <li>Spliting the Data in Training and Testing Samples</li>
</ol>

In [None]:
#Converting Categorical Data to Numerical
gender_dict = {'Male': 0, 'Female': 1, 'Other': 2}
ever_married_dict = {'No': 0, 'Yes': 1}
work_type_dict = {'children': 0, 'Never_worked': 1, 'Govt_job': 2, 'Private': 3, 'Self-employed': 4}
residence_type_dict = {'Rural': 0, 'Urban': 1}
smoking_status_dict = {'Unknown': 0, 'never smoked': 1, 'formerly smoked':2, 'smokes': 3}

stroke['gender'] = stroke['gender'].map(gender_dict)
stroke['ever_married'] = stroke['ever_married'].map(ever_married_dict)
stroke['work_type'] = stroke['work_type'].map(work_type_dict)
stroke['Residence_type'] = stroke['Residence_type'].map(residence_type_dict)
stroke['smoking_status'] = stroke['smoking_status'].map(smoking_status_dict)

In [None]:
# Splitting into features and value to be predicted
X = stroke.drop(columns=['id', 'stroke'])
y = stroke['stroke']

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)

sns.barplot(x=['0', '1'], y =[sum(y == 0), sum(y == 1)], ax = ax1)
ax1.set_title("Before Oversampling")
ax1.set_xlabel('Stroke')

#Using SMOTE to balance the Data
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 2) 
X, y = sm.fit_resample(X, y) 

sns.barplot(x=['0', '1'], y =[sum(y == 0), sum(y == 1)], ax = ax2)
ax2.set_title("After Oversampling")
ax2.set_xlabel('Stroke')

plt.tight_layout()
plt.show()

In [None]:
# Spliting the Data into Train and Test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)

**6. Creating a Model for Stroke Prediction**

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix

pipeline = make_pipeline(StandardScaler(), RandomForestClassifier())
pipeline.fit(X_train, y_train)
prediction = pipeline.predict(X_test)

print(f"Accuracy Score : {round(accuracy_score(y_test, prediction) * 100, 2)}%")

In [None]:
print(classification_report(y_test, prediction))

In [None]:
plot_confusion_matrix(pipeline, X_test, y_test, cmap=cmap)
plt.grid(False)
plt.show()

---
# Thanks a lot for showing your Interest 