# Do you know your stroke risk?

Source of the data: https://www.sciencedirect.com/science/article/pii/S0933365719302295?via%3Dihub
Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for: A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1

The medical dataset contains 43,400 records of potential patients which includes 783 occurrences of stroke. 

Cerebral stroke has become a significant global public health issue. The ideal solution to this concern is to prevent in advance by controlling related metabolic factors. However, it is difficult for medical staff to decide whether special precautions are needed for a potential patient only based on the monitoring of physiological indicators unless they are obviously abnormal. This project builds a machine learning model to predict whether someone is at risk of having a stroke.

The data in each row includes numerical factors, such as age and average glucose levels, and categorical factors, such as "has heart disease" (yes or no), work type, and smoking status. This is not an exhaustive list. We use this data to determine which factors contribute to having a stroke, and among those which hold the most weight.

In this notebook, we build our Machine Learning model. In our initial data analysis, we noticed that the individuals who had a stroke make up approximately 1.8% of the data. In this notebook, we DO NOT use the Synthetic Minority Oversampling Technique (SMOTE) to account for this.

To view our initial data analysis, please see the notebook titled "stroke_data."

In [None]:
# Dependencies and Setup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# Read the stroke dataset
ml_stroke_data = pd.read_csv("data/stroke_ML_dataset.csv", delimiter=',', skipinitialspace=True)
ml_stroke_data.shape

In [None]:
# Preview dataframe
ml_stroke_data.head()

In [None]:
# Use info and describe() on ml_stroke_data
ml_stroke_data.info()

In [None]:
# Use info and describe() on ml_stroke_data
ml_stroke_data.describe()

### Exploratory Data Analysis

In [None]:
# Create a histogram of age
sns.displot(ml_stroke_data["age"], bins=20)

In [None]:
# Create a jointplot showing average_glucose_level versus age
# draw jointplot with 
# scatter kind 
sns.jointplot(x = "avg_glucose_level", y = "age", 
              kind = "scatter", data = ml_stroke_data) 
# show the plot 
plt.show() 

In [None]:
# Transform data to one hot encoded data
ml_ready_stroke_data = pd.get_dummies(ml_stroke_data, columns=["hypertension", "heart_disease", "ever_married", "work_type", "smoking_status"])
ml_ready_stroke_data.head()

## Logistic Regression

Now it's time to do a train test split, and train our model!

Split the data into training set and testing set using train_test_split

In [None]:
# Import Maching Learning algorithm LogisticRegression
from sklearn.linear_model import LogisticRegression

# Import other essential Machine Learning functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Import SMOTE to handle the imbalanced data issue
# from imblearn.over_sampling import SMOTE

In [None]:
X = ml_ready_stroke_data.drop(["stroke"], axis = 1)
y = ml_ready_stroke_data["stroke"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [None]:
import warnings
warnings.filterwarnings('ignore')

classifier.fit(X_train, y_train)

In [None]:
predictions = classifier.predict(X_test)
predictions

In [None]:
prediction_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
prediction_df

In [None]:
# Our initial analysis is complete and we'll export this dataframe as a csv file, and perform the Machine Learning in a new notebook.
import os

# Export final DataFrame as csv
prediction_df.to_csv("data/LG_prediction_df.csv", index=False, header=True)

In [None]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

#### logmodel

In [None]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()

#### Train and fit a logistic regression model on the training set.

In [None]:
logmodel.fit(X_train, y_train)

#### Predictions and Evaluations

In [None]:
# Now predict values for the testing data.
predictions = logmodel.predict(X_test)

In [None]:
predictions

#### Create a classification report for the model.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))