<h1> <u>Stroke prediction</u></h1>

<img src="img.jpg" width="60%">

# Problem statement
Stroke is sometimes termed as brain attack or a cardiovascular accident (CVA). It is much like a heart attack, only it occurs in the brain.<br>

It occurs when the supply of blood to the brain is reduced or blocked completely, which prevents brain tissue from getting oxygen and nutrients.<br>

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.<br>
Early identification of stroke can help doctors to give necessary medication to the patient.


## Machine Learning problem
Predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.<br>
<b>Type</b> : Supervised Learning<br>
<b>Task</b> : Binary classification<br>
<b>Performance metric</b> : F1 score (since imbalanced classes)<br>

## About Dataset
Source : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

### Attribute Information
<b>id</b>: unique identifier<br>
<b>gender</b>: Male, Female or Other<br>
<b>age</b>: age of the patient<br>
<b>hypertension</b>: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension<br>
<b>heart_disease</b>: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease<br>
<b>ever_married</b>: No or Yes<br>
<b>work_type</b>: children, Govt_jov, Never_worked, Private or Self-employed<br>
<b>Residence_type</b>: Rural or Urban<br>
<b>avg_glucose_level</b>: average glucose level in blood<br>
<b>bmi</b>: body mass index<br>
<b>smoking_status</b>: formerly smoked, never smoked, smokes or Unknown<br>
<b>stroke</b>: 1 if the patient had a stroke or 0 if not (target)<br>

# Libraries
Importing all the necessary python modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import f1_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

import pickle
import warnings
warnings.filterwarnings('ignore')

# settings
sns.set_style('whitegrid')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_columns', None)

### K-Fold Cross-Validation
Step 1: Randomly divide a dataset into k groups, or ‚Äúfolds‚Äù, of roughly equal size.<br>
Step 2: Choose one of the folds to be the holdout set. Fit the model on the remaining k-1 folds.<br>
Step 3: Calculate the test F1-score on the observations in the fold that was held out.<br>
Step 4: Repeat this process k times, using a different set each time as the holdout set.<br>
Step 5: Calculate the average of the k test F1-scores to get the overall test F1-score.

In [None]:
# Below function implements above steps.
def run_kfold(model, X_train, y_train, N_SPLITS = 10):
    f1_list = []
    oofs = np.zeros(len(X_train))
    folds = StratifiedKFold(n_splits=N_SPLITS)
    for i, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
        
        print(f'\n------------- Fold {i + 1} -------------')
        X_trn, y_trn = X_train.iloc[trn_idx], y_train.iloc[trn_idx]
        X_val, y_val = X_train.iloc[val_idx], y_train.iloc[val_idx]
        
        model.fit(X_trn, y_trn)
        # Instead of directly predicting the classes we will obtain the probability of positive class.
        preds_val = model.predict_proba(X_val)[:,1]
        
        fold_f1 = f1_score(y_val, preds_val.round())
        f1_list.append(fold_f1)
        
        print(f'\nf1 score for validation set is {fold_f1}') 
        
        oofs[val_idx] = preds_val
        
    mean_f1 = sum(f1_list)/N_SPLITS
    print("\nMean validation f1 score :", mean_f1)
    
    oofs_score = f1_score(y_train, oofs.round())
    print(f'\nF1 score for oofs is {oofs_score}')
    return oofs

# Data preprocessing

In [None]:
# Load data into memory
data = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

print("No of columns in the data : ", len(data.columns))
print("No of rows in the data : ", len(data))

In [None]:
# random sample of data
data.sample(5)

In [None]:
# statistical summary of the data
data.describe()

In [None]:
# null values
data.isna().sum().to_frame(name="Null count")

## Variable separation

In [None]:
# features
features = ['gender', 'age', 'hypertension', 'heart_disease',
            'ever_married','work_type','Residence_type','avg_glucose_level',
            'bmi','smoking_status']

#target
target = 'stroke'

numerical_features = ['age', 'avg_glucose_level', 'bmi']

categorical_features = ['gender', 'hypertension', 'heart_disease',
                        'ever_married', 'work_type', 'Residence_type', 
                        'smoking_status']

In [None]:
# Converting features into required datatypes
data[numerical_features] = data[numerical_features].astype(np.float64)

data[categorical_features] = data[categorical_features].astype('category')

# Replace Other label in gender with Female
data.gender.replace({'Other':"Female"}, inplace=True)

# Remove id column
data.drop('id', axis=1, inplace=True)

In [None]:
# data types
data[features+[target]].dtypes.to_frame(name="Data type")

## Train Test Split
- Dividing the total dataset into training and testing sets
- For Training 75% of data
- For Testing 25% of data

In [None]:
train, test = train_test_split(data, random_state=1,
                               test_size=0.25,
                               stratify=data.stroke)

print("No. of data points in training set : ", len(train))
print("No. of data points in testing set : ", len(test))

## Fill Missing values
Using K-nearest neighbors of numerical features to fill the missing values in bmi

In [None]:
imputer = KNNImputer(n_neighbors = 5)

train[numerical_features] = imputer.fit_transform(train[numerical_features])
test[numerical_features] = imputer.transform(test[numerical_features])

# Exploratory Data Analysis
Exploratory data analysis (EDA) is used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

Performing EDA on Training set only (best practice to avoid overfitting)
## Univariate analysis
- Univariate analysis refers to the analysis of one variable.
- The purpose of univariate analysis is to understand the distribution of values for a single variable.

### A. Target distribution

In [None]:
fig, axes = plt.subplots(ncols=2,figsize=(12, 4))
train[target].value_counts(normalize=True).plot \
.bar(width=0.2, color=('red','green'), ax=axes[0], title="Train")

test[target].value_counts(normalize=True).plot \
.bar(width=0.2, color=('red','green'), ax=axes[1], title="Test")

plt.tight_layout()
plt.show()

### B. Histogram

In [None]:
fig, axes = plt.subplots(nrows=3,figsize=(8, 8))
for i, c in enumerate(numerical_features):
    hist = train[c].plot(kind = 'hist', ax=axes[i], 
                         title=c, color='blue', bins=30)
plt.tight_layout()
plt.show()

### Boxplot (Outliers)
An outlier is a data point that differs significantly from other observations.

In [None]:
fig, axes = plt.subplots(nrows=3, figsize=(8, 7))
for i, c in enumerate(numerical_features):
    box = train[c].plot(kind = 'box', ax=axes[i],
                        vert=False, color='blue')
plt.tight_layout()
plt.show()

### KDE Plot

In [None]:
fig, axes = plt.subplots(nrows=3, figsize=(8, 7))
for i, c in enumerate(numerical_features):
    plot = sns.kdeplot(data=train, x=c, ax=axes[i],
                       fill=True, color='blue')
plt.tight_layout()
plt.show()

### Pie-Charts
Percentage of labels in categorical features

In [None]:
fig, axes = plt.subplots(4, 2, figsize=(12,16))
axes = [ax for axes_row in axes for ax in axes_row]

for i,c in enumerate(categorical_features):
    train[c].value_counts() \
    .plot(kind='pie', ax=axes[i], title=c, autopct="%.2f", fontsize=14)
    axes[i].set_ylabel('')
plt.tight_layout()
plt.show()

#  Bivariate analysis
It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

We perform bivariate analysis of features with respect to target.

### Box plots

In [None]:
fig, axes = plt.subplots(nrows=3, figsize=(8, 8))
for i, c in enumerate(numerical_features): 
    plot = sns.boxplot(x=train[target], y=train[c], ax=axes[i])
    axes[i].set_ylabel(c, fontsize=13)
    axes[i].set_xlabel(target, fontsize=13)
plt.tight_layout()
plt.show()

### Target vs Mean of Numerical features

In [None]:
fig, axes = plt.subplots(ncols=3, figsize=(20, 5))
for i, c in enumerate(numerical_features):
    train.groupby(target)[c].mean().plot(kind = 'bar', ax=axes[i], color=('red','green'))
    axes[i].set_ylabel(f'Mean_{c}', fontsize=14)
    axes[i].set_xlabel('stroke', fontsize=14)
plt.tight_layout()

### Target vs categorical features

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(20,10))
axes = [ax for axes_row in axes for ax in axes_row]

for i, c in enumerate(categorical_features):
    df = train[[c,target]].groupby(c).mean().reset_index()
    sns.barplot(df[c], df[target], ax=axes[i])
    axes[i].set_ylabel('Target mean', fontsize=14)
    axes[i].set_xlabel(c, fontsize=14)
    
plt.tight_layout()
plt.show()

# Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of machine learning algorithms.

## New Features with Age and bmi

In [None]:
def age_group(x):
    if x<13: return "Child"
    elif 13<x<20: return "Teenager"
    elif 20<x<=60: return "Adult"
    else: return "Elder"
    
train["age_group"] = train.age.apply(age_group)
test['age_group'] = test.age.apply(age_group)

def bmi_group(x):
    if x<18.5 : return "UnderWeight"
    elif 18.5<x<25: return "Healthy"
    elif 25<x<30: return "OverWeight"
    else: return "Obese"

train["bmi_group"] = train.bmi.apply(bmi_group)
test['bmi_group'] = test.bmi.apply(bmi_group)

## OneHot encoding
- Replaces categorical column(s) with the binary value for each category.

In [None]:
# add new features
categorical_features.extend(["age_group", "bmi_group"])

encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(train[categorical_features])

cols = encoder.get_feature_names(categorical_features)

train.loc[:, cols] = encoder.transform(train[categorical_features])
test.loc[:, cols] = encoder.transform(test[categorical_features])

# Drop categorical features
train.drop(categorical_features, axis=1, inplace=True)
test.drop(categorical_features, axis=1, inplace=True)

## Feature Scaling
Standardize the numerical features

In [None]:
scaler = StandardScaler()
scaler.fit(train[numerical_features])

train.loc[:, numerical_features] = scaler.transform(train[numerical_features])
test.loc[:, numerical_features] = scaler.transform(test[numerical_features])

 ## Correlation
 Correalation between features and target

In [None]:
# Correlation with Target

corr = train.corr()[target].sort_values(ascending=False).to_frame()
plt.figure(figsize=(2,8))
sns.heatmap(corr, cmap='Blues', cbar=False, annot=True)
plt.show()

## Preprocessed data

In [None]:
train.head()

In [None]:
# Inputs and Target 
X_train = train.drop(target, axis=1)
y_train = train[target]

X_test = test.drop(target, axis=1)
y_test = test[target]

# Machine Learning
## Decision tree classifier

In [None]:
# Base model
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)
train_preds = clf.predict(X_train)
test_preds = clf.predict(X_test)
print("Train f1 Score :", f1_score(y_train, train_preds))
print("Test f1 Score :", f1_score(y_test, test_preds))

In [None]:
# Hyperparameter tuning
params = {
    'max_depth': [4, 6, 8, 10, 12, 14, 16, 20],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [5, 10, 20, 30, 40, 50],
    'max_features': [0.2, 0.4, 0.6, 0.8, 1],
    'max_leaf_nodes': [8, 16, 32, 64, 128,256],
    'class_weight': [{0: 1, 1: 9}, {0: 1, 1: 4},
                     {0: 1, 1: 5}, {0: 1, 1: 6}, 
                     {0: 1, 1: 7}, {0: 1, 1: 8}]
}

clf = RandomizedSearchCV(DecisionTreeClassifier(random_state=1),
                         params,
                         scoring='f1',
                         verbose=1,
                         random_state=1,
                         cv=5,
                         n_iter=50)

search = clf.fit(X_train, y_train)

print("\nBest f1-score:",search.best_score_)
print("\nBest params:",search.best_params_)

In [None]:
# Cross validation
clf = DecisionTreeClassifier(random_state = 1,
                             **search.best_params_)
oofs = run_kfold(clf, X_train, y_train, N_SPLITS=5)

In [None]:
# Final Decision tree classifier
clf = DecisionTreeClassifier(random_state = 1, 
                             **search.best_params_)
clf.fit(X_train, y_train)

preds_test = clf.predict_proba(X_test)[:, 1]
    
cm = confusion_matrix(y_test,preds_test.round(),normalize='true')
plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, cmap='Blues', cbar=False,fmt='.2f')
plt.show()

## Logistic Regression

In [None]:
# Base model
clf = LogisticRegression(random_state=1, 
                         class_weight='balanced')

clf.fit(X_train, y_train)
train_preds = clf.predict(X_train)
test_preds = clf.predict(X_test)
print("Train f1 Score :", f1_score(y_train, train_preds))
print("Test f1 Score :", f1_score(y_test, test_preds))

In [None]:
# Hyperparameter tuning
params = {
    'penalty': ['l1', 'l2','elasticnet'],
    'C':[0.0001, 0.001, 0.1, 1, 10, 100,1000],
    'fit_intercept':[True, False],
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

clf = RandomizedSearchCV(LogisticRegression(random_state=1,
                                            class_weight='balanced'),
                         params,
                         scoring='f1',
                         verbose=1,
                         random_state=1,
                         cv=5,
                         n_iter=20)

search = clf.fit(X_train, y_train)

print("\nBest f1-score:",search.best_score_)
print("\nBest params:",search.best_params_)

In [None]:
# Cross validation
clf = LogisticRegression(random_state = 1,
                         class_weight='balanced', 
                         **search.best_params_)
oofs = run_kfold(clf, X_train, y_train, N_SPLITS=5)

In [None]:
# Final Logistic regression

clf = LogisticRegression(random_state = 1,
                         class_weight='balanced',
                         **search.best_params_)
clf.fit(X_train, y_train)

preds_test = clf.predict_proba(X_test)[:, 1]

cm = confusion_matrix(y_test, preds_test.round(), normalize='true')
plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, cmap='Blues', cbar=False, fmt='.2f')
plt.show()

## Weigths or Coefficents learnt by Logistic regression for each feature

In [None]:
imp = pd.DataFrame([X_train.columns, 
                    clf.coef_[0]]).T.sort_values(1, ascending=False).reset_index(drop=True)
imp.columns=['feature', 'coeff']
imp

# Save all the transformers
One hot encoder<br>
Standard scaler<br>
Logistic regression<br>
<b> Logistic regression is giving the high true positive rate i.e., performing better at predicting the likelihood of Stroke. Which is what we want.!! 

In [None]:
with open("onehotencoder.pkl", 'wb') as f:
    pickle.dump(encoder, f)

with open("scaler.pkl", 'wb') as f:
    pickle.dump(scaler, f)

with open("model.pkl", 'wb') as f:
    pickle.dump(clf, f)

## Prediction on single data point

In [None]:
def predict(x):
    X = pd.DataFrame(x, columns=features)
    # converting numerical features as float dtype
    X.loc[:, numerical_features] = X.loc[:, numerical_features].astype('float64')
    # add new features
    X["age_group"] = X.age.apply(age_group)
    X["bmi_group"] = X.age.apply(bmi_group)
    
    # converting categorical features as category dtype
    X.loc[:, categorical_features] = X.loc[:, categorical_features].astype('category')
    # Categorical encoding
    cols = encoder.get_feature_names(categorical_features)

    X.loc[:, cols] = encoder.transform(X[categorical_features])

    # Drop categorical features
    X.drop(categorical_features, axis=1, inplace=True)

    # Feature scaling
    X.loc[:, numerical_features] = scaler.transform(X[numerical_features])
    return clf.predict(X)[0]

In [None]:
# Random data point
x = [['Male', 67.0, 0, 1, 'Yes', 'Private', 'Urban', 228.69, 36.6, 'formerly smoked']]
y_true = 1
print("y_true :", y_true)
y_pred = predict(x)
print("y_pred :",y_pred)

## Next steps

#### Building a web application for this problem!! Updating soon.üëç
## Thanks for reading!! Please upvote if you like it.üòÄ