# Do you know your stroke risk?

Source of the data: https://www.sciencedirect.com/science/article/pii/S0933365719302295?via%3Dihub
Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for: A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1

The medical dataset contains 43,400 records of potential patients which includes 783 occurrences of stroke. 

Cerebral stroke has become a significant global public health issue. The ideal solution to this concern is to prevent in advance by controlling related metabolic factors. However, it is difficult for medical staff to decide whether special precautions are needed for a potential patient only based on the monitoring of physiological indicators unless they are obviously abnormal. This project builds a machine learning model to predict whether someone is at risk of having a stroke.

The data in each row includes numerical factors, such as age and average glucose levels, and categorical factors, such as "has heart disease" (yes or no), work type, and smoking status. This is not an exhaustive list. We use this data to determine which factors contribute to having a stroke, and among those which hold the most weight.

## Basic analysis of the input dataset

In [None]:
# Dependencies and Setup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# Read the golf course dataset
input_csv = pd.read_csv("data/stroke_dataset.csv", delimiter=',', skipinitialspace=True)
input_csv.shape

In [None]:
# Display the input data for preview
input_csv

In [None]:
# Check the number of unique records using groupby
csv_group = input_csv.groupby("id")
len(csv_group)

In [None]:
# Count missing values for each column of the input dataset

input_csv.isnull().sum()

In [None]:
# Count values for each column of the input dataset

input_csv.notnull().sum()

In [None]:
# List the categorical columns present in the dataset
input_csv.select_dtypes(exclude=np.number).columns

In [None]:
# List the numerical columns present in the dataset
input_csv.select_dtypes(include=np.number).columns

## gender

In [None]:
# Overview of gender
input_csv["gender"].value_counts()

## age

In [None]:
# Overview of age
input_csv["age"].value_counts()

## hypertension

In [None]:
# Overview of hypertension
input_csv["hypertension"].value_counts()

## heart_disease

In [None]:
# Overview of heart_disease
input_csv["heart_disease"].value_counts()

## ever_married

In [None]:
# Overview of ever_married
input_csv["ever_married"].value_counts()

## work_type

In [None]:
# Overview of work_type
input_csv["work_type"].value_counts()

## Residence_type

In [None]:
# Overview of Residence_type
input_csv["Residence_type"].value_counts()

## avg_glucose_level

In [None]:
# Overview of avg_glucose_level
input_csv["avg_glucose_level"].value_counts()

## bmi

In [None]:
# Overview of bmi
input_csv["bmi"].value_counts()

In [None]:
# Null bmi

print(input_csv["bmi"].isnull().sum())

## Replace missing bmi data with 0

In [None]:
input_csv["bmi"] = input_csv["bmi"].fillna(0)
input_csv

In [None]:
# Create a subset data frame
subset_df = input_csv[((input_csv["id"] >= 72586) & (input_csv["id"] <= 72600))]

subset_df

## Categorize bmi

In [None]:
# Create the bins in which Data will be held
bins = [-1, 0, 18.5, 25, 30, 35, 40, 45, 50, 60, 100]

# Create the names for the five bins
group_labels = ["unknown", "<18.5", "18.5-24.9", "25-29.9", "30-34.9", "35-39.9", "40-44.9", "45-49.9", "50-59.9", ">60"]

In [None]:
input_csv["bmi group"] = pd.cut(input_csv["bmi"], bins, labels=group_labels)
input_csv

In [None]:
# Create the bins in which Data will be held
bins = [-1, 0, 18.5, 25, 30, 35, 40, 45, 50, 60, 100]

# Create the names for the bins
group_labels = ["unknown", "Underweight", "Normal weight", "Overweight", "Obese Class I (Moderately obese)", "Obese Class II (Severely obese)", "Obese Class III (Very severely obese)", "Obese Class IV (Morbidly Obese)", "Obese Class V (Super Obese)", "Obese Class VI (Hyper Obese)"]

In [None]:
input_csv["bmi group 2"] = pd.cut(input_csv["bmi"], bins, labels=group_labels)
input_csv

In [None]:
# Creating a group based off of the bins
bmi_group = input_csv.groupby("bmi group")
bmi_group[["stroke", "hypertension", "heart_disease"]].sum()

In [None]:
# Creating a group based off of the bins
bmi_group2 = input_csv.groupby("bmi group 2")
bmi_group2[["stroke", "hypertension", "heart_disease"]].sum()

In [None]:
# Creating a group based off of the bins
bmi_group2 = input_csv.groupby("bmi group 2")
bmi_group2["age"].mean()

In [None]:
# Creating a group based off of the bins
bmi_group2 = input_csv.groupby("bmi group 2")
bmi_group2["id"].count()

## smoking_status

In [None]:
# Overview of smoking_status
input_csv["smoking_status"].value_counts()

In [None]:
# Null smoking_status

print(input_csv["smoking_status"].isnull().sum())

## stroke

In [None]:
# Overview of stroke
input_csv["stroke"].value_counts()

## Replace missing smoking_status data with unknown

In [None]:
input_csv

In [None]:
# Replacing missing data with unknown
input_csv["smoking_status"] = input_csv["smoking_status"].fillna("unknown")
input_csv

In [None]:
# Overview of smoking_status
input_csv["smoking_status"].value_counts()

In [None]:
# groupby smoking_status
smoking_group = input_csv.groupby(["smoking_status","gender","stroke"])
smoking_group.count()

In [None]:
input_csv["age"].min()

In [None]:
input_csv["age"].max()

In [None]:
input_csv["age"].median()

In [None]:
# Look at age data set and compute quantiles
age = input_csv["age"]

fig1, ax1 = plt.subplots()
ax1.set_title('Box Plot of Age')
ax1.set_ylabel('Age')
ax1.boxplot(age)
plt.show()

In [None]:
# Get quartile calculations
quartiles = age.quantile([.25,.5,.75])
lowerq = quartiles[0.25]
upperq = quartiles[0.75]
iqr = upperq-lowerq

print(f"The lower quartile of age is: {lowerq}")
print(f"The upper quartile of age is: {upperq}")
print(f"The interquartile range of age is: {iqr}")
print(f"The the median of age is: {quartiles[0.5]} ")

lower_bound = lowerq - (1.5*iqr)
upper_bound = upperq + (1.5*iqr)
print(f"Values below {lower_bound} could be outliers.")
print(f"Values above {upper_bound} could be outliers.")

In [None]:
# Create a subset data frame
subset_df = input_csv[((input_csv["id"] >= 36250) & (input_csv["id"] <= 36450))]

subset_df

## Basic analysis of the subset dataset

## gender

In [None]:
# Overview of gender
subset_df["gender"].value_counts()

## age

In [None]:
# Overview of age
subset_df["age"].value_counts()

## hypertension

In [None]:
# Overview of hypertension
subset_df["hypertension"].value_counts()

## heart_disease

In [None]:
# Overview of heart_disease
subset_df["heart_disease"].value_counts()

## ever_married

In [None]:
# Overview of ever_married
subset_df["ever_married"].value_counts()

## work_type

In [None]:
# Overview of work_type
subset_df["work_type"].value_counts()

## Residence_type

In [None]:
# Overview of Residence_type
subset_df["Residence_type"].value_counts()

## avg_glucose_level

In [None]:
# Overview of avg_glucose_level
subset_df["avg_glucose_level"].value_counts()

## bmi

In [None]:
# Overview of bmi
subset_df["bmi"].value_counts()

In [None]:
# Null bmi

print(subset_df["bmi"].isnull().sum())

In [None]:
input_csv["bmi"].max()

## smoking_status

In [None]:
# Overview of smoking_status
subset_df["smoking_status"].value_counts()

In [None]:
# Null smoking_status

print(subset_df["smoking_status"].isnull().sum())

## stroke

In [None]:
# Overview of stroke
subset_df["stroke"].value_counts()

In [None]:
# Creating a group based off of the bins
subset_bmi_group2 = subset_df.groupby("bmi group 2")
subset_bmi_group2[["stroke", "hypertension", "heart_disease"]].sum()