# Logistic regression on stroke prediction dataset

Context:
* Given a dataset that contain multiple variable and a target variable of stroke or no stroke (symbolize by 1 and 0), I would like to build a model where it can predict whether the user will more likely to get stroke or not.


There are few steps that I will take in building prediction, they are:
1. Data preprocessing -> to fill blank values, change the categorical values to numerical 
2. Splitting train and test data
3. Build logistic regression model -> train the data
4. Predict the test data
5. Review the accuracy


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#import library for numerical computation
import pandas as pd
import numpy as np
import pylab as pl

#import library for visualisation
%matplotlib inline 
import matplotlib.pyplot as plt

#import library for model
from sklearn.model_selection import train_test_split
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
#read the data
df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.smoking_status.unique

In [None]:
#print the summary
df.describe()

As can be seen from the data above, there is a null data on the BMI desrcribed as NaN, we need to replace this with the average of the BMI

In [None]:
#data preprocessing

#replacing nan value with the average of the columns
df["bmi"] = df["bmi"].fillna(df["bmi"].mean())

#replacing gender male and female with 1 and 0
df.gender[df.gender == 'Male'] = 1
df.gender[df.gender == 'Female'] = 0
df.gender[df.gender == 'Other'] = 3

#replacing smoking status with numerical values
df.smoking_status[df.smoking_status == 'never smoked'] = 0
df.smoking_status[df.smoking_status == 'formerly smoked'] = 1
df.smoking_status[df.smoking_status == 'smokes'] = 2
df.smoking_status[df.smoking_status == 'Unknown'] = 3
#df['smoking_status'] = df['smoking_status'].apply({0:'never smoked', 1:'formerly smoked', 2:'smokes'}.get)

df.head(5)

**Now since we have pre process the data, we will the select columns that will be used as the independent variable to explain whether user have higher probability to get stroke or not**

In [None]:
X = df[["gender","age","hypertension","heart_disease","avg_glucose_level","bmi","smoking_status"]]
y = df["stroke"]

In [None]:
#now new need to normalize the X
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

In [None]:
#after do the data pre processing, now its time for us to split the data into train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
print("X_train shape: {}", format(X_train.shape))
print("Y_train shape: {}", format(Y_train.shape))
print("X_test shape: {}", format(X_test.shape))
print("Y_test shape: {}", format(Y_test.shape))

In [None]:
#now lets build logistic regression model
stroke_lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,Y_train)
stroke_lr

In [None]:
#predicting the test dataset
y_hat = stroke_lr.predict(X_test)
y_hat[:5]

In [None]:
yhat_prob = stroke_lr.predict_proba(X_test)
yhat_prob

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(Y_test, y_hat, labels=[1,0]))

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, y_hat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['stroke=1','no stroke=0'],normalize= False,  title='Confusion matrix')

In [None]:
from sklearn.metrics import jaccard_score
jaccard_score(Y_test, y_hat,pos_label=0)

In summary:
* Using logistic regression to predict whether a user will have higher probability of getting cancer or not can yield 95% accuracy when the model is being tested out to the test set.
* Another important thing in here that I do not include some categorical data such as ever_married, work_type, and residence_type