**About the Dataset**
* Total of 5110 observations and 12 attributes
* Nuemeric continuous variables - id, age, avg_glucose_level, bmi
* Categorical variables - gender, hypertension, heart_disease, ever_married, work_type, Residence_type, smoking_status


**Attribute Information**
* id: unique identifier
* gender: patient's gender
* age: age of the patient
* hypertension: if patient suffers from hypertension (0 - No, 1 - Yes)
* heart_disease: if patient suffers from heart disease (0 - No, 1 - Yes)
* ever_married: if patient is married (No, Yes)
* work_type: patients' job type
* Residence_type: patients' residential location
* avg_glucose_level: patients' average blood glucose level
* bmi: patients' body mass index
* smoking_status: patients' past and present smoking status
* stroke: if patient suffers from stroke (0 - No, 1 - Yes)


**Contents**
* Importing libraries and data
* A peek into the data
* Checking for missing values
* EDA on categorical and numeric columns
* Tailoring dataframe for building model
* Comparing multiple classifiers
* Prediction and evaluation

**Aim**
* Develop model to predict stroke based on numeric and categorical variables

# Importing required libraries and packages

In [None]:
# -- Importing required libraries and packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score
import missingno as msno
%matplotlib inline

# Importing data | Skimming data | Missing values

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head(3)

In [None]:
# -- getting a feel for the dataset
df.info()

# -- bmi has a few missing values

df.describe()

In [None]:
# -- dropping 'id' column
df.drop(columns='id', inplace=True)

In [None]:
# -- looking for missing values in this dataset

ax = sns.heatmap(df.isna(), yticklabels=False)
ax.set_title(label="(white bars are missing values)")

print('Shape of the dataset - {}'.format(df.shape))

# -- The missing values can be attended to lateron

In [None]:
# -- how are the missing values distributed 
msno.matrix(df)

# Handling missing values

In [None]:
# -- Handling missing bmi values ussing KNNImputer from sklearn
imputer = KNNImputer(n_neighbors=2)

# -- Selecting numerical columns for imputing missing bmi values
df_num = df.select_dtypes(exclude=['object'])
array_ = imputer.fit_transform(df_num)

# -- adding the column names back
colx = df_num.columns
df_num_nn = pd.DataFrame(array_, columns=colx)

df_obj = df.select_dtypes(exclude=['int64','float64'])

# -- concatenating the imputed numeric and object columns 
df = pd.concat([df_obj,df_num_nn], axis=1)

# -- df is new dataframe without missing values 
df.info() , df.describe()

In [None]:
# -- Having a look at thegender column
df.gender.value_counts()

# -- row with 'Other' can be dropped

In [None]:
df.drop(df[df['gender'] == 'Other'].index, inplace = True)
df.gender.value_counts()

# EDA on how Stroke relates to different parameters

**Stroke cases relation to continuous numeric variables**

In [None]:
# -- total no. of strokes in this dataset
sns.set_theme()
sns.set_palette('husl')

df.stroke.value_counts().plot(kind='bar',label='Stroke-cases Count',figsize=(12,7))
plt.legend()

# -- This is highly imbalanced and needs to be evened out

In [None]:
# -- Does age have anything to do with stroke cases?

plt.figure(figsize=(12,5))
sns.histplot(x='age', data=df, hue='stroke', bins=40)
# -- We can observe that stroke risk increases with age

In [None]:
# -- Stroke cases distribution with average sugar level

plt.figure(figsize=(12,5))
g = sns.histplot(x='avg_glucose_level', data=df, hue='stroke', bins=50, palette='husl')
g.set_title('Stroke vs Avg. Blood Glucose Level')
# -- The distribution shows 2 small peaks. Once at about 75 and once at ~200. Not many stroke cases between 125 and 175

In [None]:
# -- How do stroke cases vary with Body Mass Indices?
plt.figure(figsize=(12,5))
g = sns.histplot(x='bmi', bins=40, data=df, hue='stroke', alpha=0.4)
g.set_title('stroke vs BMI')
# -- No. of stroke cases peak with the no-stroke counts between 20 and 40.
# -- We see some of the outliers in bmi values after 60

**Stroke cases in relation to categorical variables**

In [None]:
fig,ax = plt.subplots(4,2, figsize=(17,15))

sns.countplot(ax=ax[0,0],data=df,x='gender', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[0,1],data=df,x='hypertension', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[1,0],data=df,x='heart_disease', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[1,1],data=df,x='ever_married', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[2,0],data=df,x='work_type', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[2,1],data=df,x='Residence_type', hue='stroke', palette='Dark2')
sns.countplot(ax=ax[3,0],data=df,x='smoking_status', hue='stroke', palette='Dark2')
fig.tight_layout()

In [None]:
cat_list=['gender','hypertension','heart_disease','ever_married','work_type','Residence_type','smoking_status']
l = []

for i in cat_list:
    j = df.groupby(i)['stroke'].mean() * 100
    l.append(j)

l = [pd.DataFrame(l[i]) for i in range(len(l))]

sns.set_palette('Set2')
fig, ax = plt.subplots(4,2, figsize=(17,15))
st = fig.suptitle("% STROKE CASES FOR EACH OF THE CATEGORIES", fontsize="x-large")
st.set_y(0.92)

g0=sns.barplot(x=l[0].index, data=l[0], y='stroke', ax=ax[0,0])
g1=sns.barplot(x=l[1].index, data=l[1], y='stroke', ax=ax[0,1])
g2=sns.barplot(x=l[2].index, data=l[2], y='stroke', ax=ax[1,0])
g3=sns.barplot(x=l[3].index, data=l[3], y='stroke', ax=ax[1,1])
g4=sns.barplot(x=l[4].index, data=l[4], y='stroke', ax=ax[2,0])
g5=sns.barplot(x=l[5].index, data=l[5], y='stroke', ax=ax[2,1])
g6=sns.barplot(x=l[6].index, data=l[6], y='stroke', ax=ax[3,0])

# -- Gender and Residence types doesn't have much difference in stroke risk
# -- people with 'hypertension', 'heart_disease' or 'ever_married' are at a higher risk of stroke


In [None]:
# -- Stroke risks among married and unmarried indivisuals grouped by gender
gp = df.groupby(['gender','ever_married'], as_index=False)['stroke'].count()
plt.figure(figsize=(12,5))
g = sns.barplot(x='ever_married', data=gp, y='stroke', hue='gender')

In [None]:
# -- Smoking status based on occupation

gp2 = df.groupby(['work_type','smoking_status'], as_index=False)['stroke'].count()
plt.figure(figsize=(12,5))
sns.barplot(x='work_type', data=gp2, y='stroke', hue='smoking_status')

In [None]:
# -- Categorical variables' distribution along age
fig, (ax0, ax1,ax2, ax3) = plt.subplots(4,figsize=(11,12))
sns.kdeplot(x='age', data=df, hue='work_type', palette='Dark2', fill=True, ax=ax0)
sns.kdeplot(x='age', data=df, hue='smoking_status', palette='Dark2', fill=True, ax=ax1)
sns.kdeplot(x='age', data=df, hue='hypertension', palette='Dark2', fill=True, ax=ax2)
sns.kdeplot(x='age', data=df, hue='heart_disease', palette='Dark2', fill=True, ax=ax3)
fig.tight_layout()

In [None]:
# -- Avg. glucose levels and BMI relations with age
plt.figure(figsize=(8, 5), dpi=80)
sns.scatterplot(x='age',y='avg_glucose_level',data=df,hue='stroke',s=6,marker='o',palette='Dark2')


In [None]:
plt.figure(figsize=(8, 5), dpi=80)
sns.scatterplot(x='age',y='bmi',data=df,hue='stroke',s=6,marker='o',palette='Dark2')

In [None]:
# -- correlation between numerical variables
sns.heatmap(df[['age','avg_glucose_level','bmi']].corr(), annot=True)

# -- no concrete correlations seen here

# Modifying dataframe for building model

In [None]:
# -- Using dummy variables showed better accuracy score as opposed to Label Encoder
# -- converting work_type and smoking_status to dummy variables
# -- Other categorical columns have 2 unique values (0 and 1) anyway so let's leave them that way 

WT_dummy = pd.get_dummies(df['work_type'],prefix_sep='_',prefix='WT',drop_first=True)
SS_dummy = pd.get_dummies(df['smoking_status'],prefix_sep='_',prefix='SS',drop_first=True)
G_dummy = pd.get_dummies(df['gender'],prefix_sep='_',prefix='G',drop_first=True)
M_dummy = pd.get_dummies(df['ever_married'],prefix_sep='_',prefix='M',drop_first=True)
Res_dummy = pd.get_dummies(df['Residence_type'],prefix_sep='_',prefix='Res',drop_first=True)
df = pd.concat([df,WT_dummy,SS_dummy,G_dummy,M_dummy,Res_dummy], axis=1)

# -- dropping original categorical columns as they've been converetd to dummy columns

df.drop(columns=['work_type','smoking_status','gender','ever_married','Residence_type'], inplace=True)
df['hypertension']=df.hypertension.astype('int32')
df['heart_disease']=df.hypertension.astype('int32')
df.head()

**Balancing 'stroke' column using SMOTE oversampling**

In [None]:
df.stroke.value_counts()

# -- Given the high imbalance in this dataset we carry out some tailoring 

In [None]:
# -- Balancing can be carried out by either undersampling or oversampling
from imblearn.over_sampling import SMOTE

samp = SMOTE()
X = df.drop(columns='stroke')
y = df[['stroke']]
X,y = samp.fit_resample(X,y)
y.value_counts()

In [None]:
# -- joining back the target and predictors into a new DF -> DF2
df2 = pd.concat([X,y], axis=1)

df2.info(), df2.head()

# Prediction using Logistic Regression

In [None]:
# Importing required prediction libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, plot_roc_curve, confusion_matrix, confusion_matrix , \
precision_score , recall_score ,f1_score , accuracy_score , classification_report , roc_curve , auc


In [None]:
X = df2.drop(columns='stroke')
y = df2.stroke

# -- Splitting dataset into testing and training sub-sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
type(y_train)

In [None]:
# -- Training the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
# -- Prediction
pred = logreg.predict(X_test)

# Checking accuracy of predictions
accu_ = accuracy_score(y_test, pred)
print('Accuracy score for Logistic Regression is: {:.3f}'.format(accu_))
print(f"The ROC_AUC score for Logistic Regression model is {roc_auc_score(y_test, pred)}")
print(f"The Precision score for Logistic Regression model is {precision_score(y_test, pred)}")
print(f"The recall score for Logistic Regression model is {recall_score(y_test, pred)}")
print(f"The f1 score for Logistic Regression model is {f1_score(y_test, pred)}")
print(f"The Confusion Matrix score for Logistic Regression model is \n {confusion_matrix(y_test, pred)}")

In [None]:
# -- ROC curve for Logistic Regression
plot_roc_curve(logreg, X_test, y_test)
plt.show

# Can we use any other classifier for better accuracy and roc_auc values?

**Here we try**
* KNeighborsClassifier
* RandomForestClassifier 
* SVC
-classification models and see how they fare against Logistic Regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
model_list = [LogisticRegression, KNeighborsClassifier, RandomForestClassifier, SVC]

for i in model_list:
    model = i()
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc_ = accuracy_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred)
    print(f"The accuracy score for {i} model is: {acc_}")
    print(f"The ROC_AUC score for {i} model is: {roc_auc}")
    print(f"The Precision score for {i} model is: {precision_score(y_test, pred)}")
    print(f"The recall score for {i} model is: {recall_score(y_test, pred)}")
    print(f"The f1 score for {i} model is: {f1_score(y_test, pred)}")
    print(f"The Confusion Matrix for {i} model is :\n {confusion_matrix(y_test, pred)}")
    plot_roc_curve(model,X_test,y_test)
    print('\n')