<h1 id="contents" style="text-align:center; background-color:#acdf87;color:white;padding-top:15px;padding-bottom:15px"><strong>CONTENTS</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#contents">¶</a></h1>


1. [Importing Packages & Dataset](#imp)
1. [EDA](#eda)
   1. [Numerical Features](#num)
   1. [Categorical Features](#cat)
1. [Data Preprocessing](#DataPre)
   1. [Encoding](#enc)
   1. [Splitting](#spl)
   1. [SMOTE](#smt)
1. [Modelling](#model)
   1. [Logistic Regression](#LR)
   1. [SVM](#SVM)
   1. [Decision Tree](#DT)
   1. [Random Forest](#RF)
   1. [XGBoost](#XG)
1. [Conclusion](#concl)

<div id="imp"></div>

<h1 id="importing-packages" style="font-size:20px; color:white; background-color:#5dd466 ;text-align:center;padding-top:10px;padding-bottom:10px"><strong>IMPORTING PACKAGES & DATASET</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#importing-packages">¶</a></h1>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
from IPython.core.display import display, HTML, Javascript
from plotly.offline import download_plotlyjs,init_notebook_mode
init_notebook_mode(connected=True)

In [None]:
data_o = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
data_o.head(8)

In [None]:
data = data_o.drop("id",axis=1)
data.info()

**A total of 201 missing values in BMI**

In [None]:
data["bmi"].fillna(data["bmi"].mean(),inplace=True)
data.isna().sum()

In [None]:
data.describe()

**The age column looks weird as the min value is 0.08 which probably means that age is counted in days or there is some noise in this column , we need to analyze it in data cleaning**

<div id="eda"></div>

<h1 id= "eda" style="font-size:20px; color:White; background-color:#5dd466 ;text-align:center;padding-top:10px;padding-bottom:10px"><strong>EXPLORATORY DATA ANALYSIS</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#eda">¶</a></h1>

In [None]:
plt.figure(figsize=(15,10))
corr = data.corr()
sns.heatmap(corr,cmap = 'viridis',annot=True,fmt=".2f",vmin=-1,vmax=1,linewidths=0.2)

<div id="num"></div>

### **NUMERICAL FEATURES**
<div id="num"></div>

In [None]:
fig = plt.figure(figsize=(20,5))
uncat_data = ["age","bmi","avg_glucose_level"]

plt.subplot(1,3,1)
att = data["age"].values
p = sns.distplot(att,color="violet")
p.set_title("Age Distribution",fontsize=16)
p.set_xlim([min(att),max(att)])

plt.subplot(1,3,2)
att = data["bmi"].values
p = sns.distplot(att,color="black")
p.set_title("BMI Distribution",fontsize=16)
p.set_xlim([min(att),max(att)])

plt.subplot(1,3,3)
att = data["avg_glucose_level"].values
p = sns.distplot(att,color="orange")
p.set_title("Age Distribution",fontsize=16)
p.set_xlim([min(att),max(att)])

In [None]:
import plotly.figure_factory as ff
group_labels = ['0', '1']
l = [data['age'][(data["stroke"] == 0)],data['age'][(data["stroke"] == 1)]]
fig = ff.create_distplot(l, group_labels,curve_type='kde',colors = ['slategray', 'magenta'])
fig.update_layout(title_text='Age & Stroke Distribution',xaxis_title="Age Distribution",yaxis_title="Frequency")
fig.show()

- **We can infer that after the age of 40 the risk of Stroke increases.**
- **At the age above 76 there is high chance of stroke.**

In [None]:
import plotly.figure_factory as ff
group_labels = ['0', '1']
l = [data['bmi'][(data["stroke"] == 0)],data['bmi'][(data["stroke"] == 1)]]
fig = ff.create_distplot(l, group_labels,curve_type='kde',colors = ['#F66095', '#2BCDC1']
)
fig.update_layout(title_text='BMI & Stroke Distribution',xaxis_title="BMI Distribution",yaxis_title="Frequency")
fig.show()

- Below 18.5 - Underweight
- 18.5-24.9 - Normal
- 25.0-29.9 - Overweight
- 30.0 And Above - Obese
- **The BMI 30 above is considered as Obese, hence the chance of stroke is more in obese people.**

In [None]:
import plotly.figure_factory as ff
group_labels = ['0', '1']
l = [data['avg_glucose_level'][(data["stroke"] == 0)],data['avg_glucose_level'][(data["stroke"] == 1)]]
fig = ff.create_distplot(l, group_labels,curve_type='kde',colors = ['#393E46', 'rgb(0, 200, 200)'])
fig.update_layout(title_text='Avg Glucose Level & Stroke Distribution',xaxis_title="Avg_Glucose_Level Distribution",yaxis_title="Frequency")
fig.show()

- **Elevated glucose level results in higher chances of a stroke, a trait observed in diabetic patients.**
- **From the graph we can infer that above the level of 150/160, the risk of stroke increases.**

### **CATEGORICAL & BOOLEAN FEATURES**
<div id="cat"></div>

In [None]:
print('No Stroke :', round(data['stroke'].value_counts()[0]/len(data) * 100,2), '% of the dataset')
print('Stroke :', round(data['stroke'].value_counts()[1]/len(data) * 100,2), '% of the dataset')
fig = px.histogram(data, x="stroke", color="stroke",barmode="group") 
fig.update_layout(title_text="Stroke Count")
fig.show()

In [None]:
fig = px.histogram(data, x="smoking_status", color="stroke",barmode="group") 
fig.update_layout(title_text="Smoking Status Count")
fig.show()

**There is a high disproportion in the dataset, which could eventually lead to bad model, hence Sampling is required**

In [None]:
plt.subplots(figsize=(20,5))
sns.set_style(style="darkgrid")

plt.subplot(1,3,1)
sns.countplot("ever_married",data=data,palette="Paired",hue="stroke")

plt.subplot(1,3,2)
sns.countplot("hypertension",data=data,palette="crest",hue='stroke')

plt.subplot(1,3,3)
sns.countplot("heart_disease",data=data,palette="tab10",hue='stroke')

In [None]:
plt.subplots(figsize=(20,10))

plt.subplot(2,3,1)
sns.countplot("gender",data=data,palette="mako",hue='stroke')

plt.subplot(2,3,2)
sns.countplot("work_type",data=data,palette="rocket_r",hue='stroke')

plt.subplot(2,3,3)
sns.countplot("Residence_type",data=data,palette="autumn",hue='stroke')

In [None]:
import plotly.express as px
fig = px.scatter(data, x="bmi", y="avg_glucose_level", color="stroke",color_continuous_scale="tropic")
fig.show()

**Due to imbalance in dataset it is hard to analyze but we can see that:**
- **Higher the glucose level (150-250) Result in high Stroke chances.**
- **BMI Above 30-40 intersecting with the glucose level has a higher risk of stroke.**

<div id="DataPre"></div>

<h1 id="preprocessing" style="font-size:20px; color:White; background-color:#5dd466 ;text-align:center;padding-top:10px;padding-bottom:10px"><strong>DATA PREPROCESSING</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#preprocessing"></a></h1>

In [None]:
data_X = data.drop("stroke",axis=1)
data_y = data.iloc[:,-1]

### **Checking for unique values in the dataset**

In [None]:
cat_col = data_X.select_dtypes(include = 'object').columns.to_list()
num_col = data_X.select_dtypes(include= ['int64','float64']).columns.to_list()

for i in num_col:
    print(i + ": ",data_X[i].nunique())
    print("------------------------------------------------------------------")

In [None]:
data[data_X["age"]==0.32]

Well if we convert it into int, we can loose children's data, as 0.32 might be 32 days.

### **Encoding Data**
<div id="enc"></div>

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cat_encode = ColumnTransformer([('encoder', OneHotEncoder(), [0,5,9])], remainder= 'passthrough')
data_X = cat_encode.fit_transform(data_X)
data_X = pd.DataFrame(data_X)
data_X

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data_X[15] = encoder.fit_transform(data_X[15])
data_X[16] = encoder.fit_transform(data_X[16])

In [None]:
data_X.shape

### **Scaling & Splitting The Data**
<div id="spl"></div>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size= 0.2, random_state=42)

from sklearn.preprocessing import StandardScaler 
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### **SMOTE Handling Imbalance Data**
<div id="smt"></div>

In [None]:
ax = plt.subplots(figsize=(20,7))
plt.subplot(1,2,1)
ax=sns.countplot('stroke', data=pd.DataFrame(y_train), palette='viridis',hue="stroke",dodge=False)
plt.title("Stroke Count Before Oversampling")
for p in ax.patches:
    ax.annotate((p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

from imblearn.over_sampling import SMOTE
samp = SMOTE(random_state=3)
X_train, y_train = samp.fit_resample(X_train, y_train.ravel())
    
plt.subplot(1,2,2)
ax=sns.countplot('stroke', data=pd.DataFrame(y_train,columns=["stroke"]), palette='viridis',hue="stroke",dodge=False)
plt.title("Stroke Count After Oversampling")
for p in ax.patches:
    ax.annotate((p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

<div id="model"></div>

<h1 id="modelling" style="font-size:20px; color:White; background-color:#5dd466 ;text-align:center;padding-top:10px;padding-bottom:10px"><strong>MODELLING</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#modelling">¶</a></h1>

In [None]:
from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, classification_report, roc_curve, plot_roc_curve, auc, average_precision_score, precision_recall_curve, plot_precision_recall_curve
from sklearn.model_selection import cross_val_score

In [None]:
def metrics(model,x,y_test,y_pred):
    cv = cross_val_score(model_1,X_train,y_train, cv = 5) 
    roc = roc_auc_score(y_test, y_pred)  
    precision = precision_score(y_test, y_pred)  
    recall = recall_score(y_test, y_pred)  
    f1 = f1_score(y_test, y_pred)
    print(classification_report(y_test, y_pred))
    print("\nAccuracy Score: ",accuracy_score(y_test, y_pred))
    print("\nROC AUC Score: {:.2f}".format(roc))
    print("\nPrecision: {:.2f}".format(precision))
    print("\nRecall: {:.2f}".format(recall))
    print("\nF1: {:.2f}".format(f1))
    f, axes = plt.subplots(1,2, figsize=(20,7))
    sns.set_theme(style = 'white')
    #-------------------------------------CONFUSION MATRIX----------------------------------
    
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, cmap = 'Blues_r', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15},ax=axes[0] ,yticklabels = ['0', '1'], xticklabels = ['Predicted 0', 'Predicted 1'])
    
    #-------------------------------------ROC_AUC Curve----------------------------------
    
    plot_roc_curve(model, x, y_test,ax=axes[1]) 
    plt.plot([0, 1], [0, 1], linestyle = '--', color = '#b01717')
    plt.show()        

### **Logistic Regression**
<div id="LR"></div>

In [None]:
from sklearn.linear_model import LogisticRegression
model_1 = LogisticRegression(random_state=42)
model_1.fit(X_train,y_train)
y_pred = model_1.predict(X_test)
metrics(model_1,X_test,y_test,y_pred)

### **Support Vector Machine**
<div id="SVM"></div>

In [None]:
from sklearn.svm import SVC
model_2 = SVC(random_state=42)
model_2.fit(X_train,y_train)
y_pred = model_2.predict(X_test)
metrics(model_2,X_test,y_test,y_pred)

### **Decision Tree**
<div id="DT"></div>

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_3 = DecisionTreeClassifier(random_state=42)
model_3.fit(X_train,y_train)
y_pred = model_3.predict(X_test)
metrics(model_3,X_test,y_test,y_pred)

### **Random Forest**
<div id="RF"></div>

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_4 = RandomForestClassifier(random_state=42)
model_4.fit(X_train,y_train)
y_pred = model_4.predict(X_test)
metrics(model_4,X_test,y_test,y_pred)

### **XGBoost**
<div id="XG"></div>

In [None]:
from xgboost import XGBClassifier
model_5 = XGBClassifier(random_state=42,eval_metric="error")
model_5.fit(X_train,y_train)
y_pred = model_5.predict(X_test)
metrics(model_5,X_test,y_test,y_pred)

<h1 id="conclusion" style="font-size:20px; color:White; background-color:#5dd466 ;text-align:center;padding-top:10px;padding-bottom:10px"><strong>CONCLUSION</strong><a class="anchor-link" href="https://www.kaggle.com/abtabm/indepth-stroke-analysis-eda-smote-91-acc/#conclusion">¶</a></h1>

<div id="concl"></div>

- Higher **recall** and **f1-score** is required, but there aren't many True postives in the dataset. 
- The AUC Score of **Random Forest** and **Logistic Regression** is high, 
- Moreover True positive are more in the **XGBoost** and **Random Forest** Model

#### I am open to advice/suggestions to improve this notebook and i am curious to know which metric should be focussed more for this problem.