# Stroke predictions

### Introduction

Blabla

### Imports

In [2]:
import csv
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statistics
from typing import Optional, List, Tuple, Dict, Any, Callable, Iterable
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pandas'

### Data parsing

The variable *input_path* corresponds to the path of the input CSV file (the healtcare dataset). This file is converted to a pandas dataframe where each row corresponds to a unique patient:

In [None]:
input_path = "stroke_data.csv"
df = pd.read_csv(input_path, delimiter=";")

Our dataframe contains the following columns:

In [None]:
df.info()

The meanings of the columns are rather self-explanatory. There are 11 features in total, plus the label. The label is represented by the *stroke* column. Note that only the *bmi* column has missing values.

### Data exploration and preprocessing

In the following we will study each column in more detail. Let's start with the *id* column. Since this column will not be relevant to predict strokes, we will simply drop it:

In [None]:
df = df.drop(["id"], axis=1)

The *gender* column shows the following distribution:

In [None]:
df["gender"].value_counts()

For simplicity, we will only consider two gender options. The *Other* value can be replaced by the majority vote, which is *Female*. Furthermore, we should convert gender into an integer:

In [None]:
df["gender"] = df["gender"].replace(["Other"], "Female")
gender_conversion = {"Male": 0, "Female": 1}
df["gender"] = df["gender"].map(gender_conversion)
df["gender"] = df["gender"].astype(int)
df["gender"].value_counts()

Next up is the *age* column. All we want to is to convert the datatype from float to integer:

In [None]:
df["age"] = df["age"].astype(int)

The *hypertension* field takes on the values 0 (no hypertension) and 1 (hypertension). Similarly, *heart_disease* is either 0 (no heart disease) or 1 (heart disease). The distribution is as follows:

In [None]:
df[["hypertension", "heart_disease"]].value_counts()

The column *ever_married* shows whether the patient has ever been married:

In [None]:
df["ever_married"].value_counts()

We should map the string values to integers:

In [None]:
married_conversion = {"No": 0, "Yes": 1}
df["ever_married"] = df["ever_married"].map(married_conversion)
df["ever_married"] = df["ever_married"].astype(int)

Next, we have *work_type*. The options for this field read:

In [None]:
df["work_type"].value_counts()

In [None]:
sns.barplot(x='work_type', y='stroke', data=df)

For the *residence_type* field, the distribution is as follows:

In [None]:
df["residence_type"].value_counts()

The following column *avg_glucose_level* describes the average glucose level in mg/dL. From the [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451) we learn the following in relation to diabetes:

> A blood sugar level less than 140 mg/dL is normal. A reading of more than 200 mg/dL after two hours indicates diabetes. A reading between 140 and 199 mg/dL indicates prediabetes.

Our glucose level distribution is as follows:

In [None]:
df["avg_glucose_level"].describe()

The next column is *bmi*, the body mass index (BMI) in kg/m$^2$. From the [CDC](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html) we learn the following in relation to obesity:

| BMI | Weight status |
| ---: | :--- |
| < 18.5 | Underweight |
| 18.5 - 24.9 | Normal weight |
| 25.0 - 29.9 | Overweight |
| > 30.0 | Obese |

From earlier we know that there 201 missing values for this field. We can replace those missing values by randomly generated numbers that are drawn from the normal distribution that represents the current values:

In [None]:
mean = df["bmi"].mean()
std = df["bmi"].std()
missing = df["bmi"].isnull().sum()

random_bmi = np.random.normal(loc=mean, scale=std, size=missing)
bmi_slice = df["bmi"].copy()
bmi_slice[np.isnan(bmi_slice)] = random_bmi
df["bmi"] = bmi_slice

# Check if we indeed filled all missing values:
df["bmi"].isnull().sum()

The last feature column is "smoking_status". The options are as follows:

In [None]:
df["smoking_status"].value_counts()

The value *Unknown* means that this information is unavailable for this patient.