# Do you know your stroke risk?

Source of the data: https://www.sciencedirect.com/science/article/pii/S0933365719302295?via%3Dihub
Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for: A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1

The medical dataset contains 43,400 records of potential patients which includes 783 occurrences of stroke. 

Cerebral stroke has become a significant global public health issue. The ideal solution to this concern is to prevent in advance by controlling related metabolic factors. However, it is difficult for medical staff to decide whether special precautions are needed for a potential patient only based on the monitoring of physiological indicators unless they are obviously abnormal. This project builds a machine learning model to predict whether someone is at risk of having a stroke.

The data in each row includes numerical factors, such as age and average glucose levels, and categorical factors, such as "has heart disease" (yes or no), work type, and smoking status. This is not an exhaustive list. We use this data to determine which factors contribute to having a stroke, and among those which hold the most weight.

In this notebook, we build our Machine Learning model. In our initial data analysis, we noticed that the individuals who had a stroke make up approximately 1.8% of the data. We will use the Synthetic Minority Oversampling Technique (SMOTE) to account for this.

To view our initial data analysis, please see the notebook titled "stroke_data."

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

data = pd.read_csv('data/stroke_ML_dataset.csv')
data.shape

In [None]:
# Count missing values for each column of the input dataset

data.isnull().sum()

In [None]:
import imblearn
from imblearn.over_sampling import SMOTE
print(imblearn.__version__)

In [None]:
from collections import Counter

In [None]:
# drop nulls from df
stroke_data_df = data.dropna()

# set y as the 'stroke' output, with targets of 0 (No) and 1 (Yes)
y = stroke_data_df['stroke']
target_names = ['0', '1']

# set X as the df after dropping stroke output
X = stroke_data_df.drop('stroke', axis=1)

# set X as the df after dropping stroke output and id
# X = stroke_data_df.drop('stroke', axis=1).drop('id',axis=1)

# define a smote instance with default parameters
oversample = SMOTE()

# rebalance data by applying SMOTE to add instances of 'Yes'
X, y = oversample.fit_resample(X, y)

# show new counts of output variables by type (should be same)
counter = Counter(y)
print(counter)

In [None]:
# over = SMOTE(sampling_strategy=0.1)
# under = RandomUnderSampler(sampling_strategy=0.5)

In [None]:
stroke_data_df = data.dropna()
stroke_data_df.head()

In [None]:
# Create separate df for records that have stroke and have no stroke
stroke_positive = stroke_data_df[stroke_data_df['stroke'] == 1]
stroke_positive.shape

In [None]:
stroke_negative = stroke_data_df[stroke_data_df['stroke'] == 0]
stroke_negative.shape

In [None]:
# Create separate df for records that have stroke and have no stroke
stroke_positive = stroke_data_df[stroke_data_df['stroke'] == 1]
stroke_negative = stroke_data_df[stroke_data_df['stroke'] == 0]

# return random sample of 500 for both postive and negative results
stroke_negative_sample = stroke_negative.sample(750)
stroke_positive_sample = stroke_positive.sample(750)

# merge postive and negative df to make one combined df
stroke_sample = pd.merge(stroke_negative_sample, stroke_positive_sample, how = 'outer')

stroke_sample.head()

## Logistic Regression

In [None]:
# Import Maching Learning algorithm LogisticRegression
from sklearn.linear_model import LogisticRegression

# Import other essential Machine Learning functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [None]:
import warnings
warnings.filterwarnings('ignore')

classifier.fit(X_train, y_train)

In [None]:
predictions = classifier.predict(X_test)
predictions

In [None]:
prediction_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
prediction_df

In [None]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

In [None]:
# Calculate classification report
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions,
                            target_names=["No Stroke", "Stroke"]))

In [None]:
logmodel = LogisticRegression()

In [None]:
logmodel.fit(X_train, y_train)

In [None]:
# Now predict values for the testing data.
predictions = logmodel.predict(X_test)

In [None]:
predictions

In [None]:
prediction_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
prediction_df

In [None]:
print(f"Training Data Score: {logmodel.score(X_train, y_train)}")
print(f"Testing Data Score: {logmodel.score(X_test, y_test)}")

In [None]:
print(classification_report(y_test, predictions))

## Use cell below to make predictions with Logistic Regression model

### List of input values in order (with codification)
 - Gender (Female=0,Male=1,Other=2)
 - Age (actual value)
 - Hypertension (No=0,Yes=1)
 - Heart Diserase (No=0,Yes=1)
 - Married (No=0,Yes=1)
 - Work Type (Private=0,Self-employed=1,children=2,Govt_job=3,Never_worked=4")
 - Residence Type (Urban=0,Rural=1)
 - Blood Glucose Level (actual value)
 - BMI (actual value)
 - Smoking (never smoked=0,formerly smoked=1,smokes=2,unkown=3)

### Output prediction value
 - Are you at risk of having a stroke? (No=0,Yes=1)

In [None]:
sample = [[1,76,1,1,0,0,0,150,32,1]]
prediction = logmodel.predict(sample)
print(prediction)

In [None]:
sample = [[1,76,1,1,0,0,0,150,32,1]]
prediction = classifier.predict(sample)
print(prediction)