<h1>Stroke Prediction</h1>

<img src="https://vascular.org/sites/default/files/gallery/Stroke_hemorrhagic_0.jpg">

source:https://vascular.org/sites/default/files/gallery/Stroke_hemorrhagic_0.jpg

<h2>What is Stroke</h2>
A stroke is a sudden interruption in the blood supply of the brain. Most strokes are caused by an abrupt blockage of arteries leading to the brain (ischemic stroke).  Other strokes are caused by bleeding into brain tissue when a blood vessel bursts (hemorrhagic stroke). Because stroke occurs rapidly and requires immediate treatment, stroke is also called a <b>brain attack</b>. When the symptoms of a stroke last only a short time (less than an hour), this is called a <b>transient ischemic attack (TIA) or mini-stroke.</b>

The effects of a stroke depend on which part of the brain is injured, and how severely it is injured. Strokes may cause sudden weakness, loss of sensation, or difficulty with speaking, seeing, or walking. Since different parts of the brain control different areas and functions, it is usually the area immediately surrounding the stroke that is affected. Sometimes people with stroke have a headache, but stroke can also be completely painless. It is very important to recognize the warning signs of stroke and to get immediate medical attention if they occur.

<h2>Types of Stroke</h2>
<h3>Ischemic Stroke</h3>
The most common type of stroke, accounting for almost 80 percent of all strokes, is caused by a clot or other blockage within an artery leading to the brain.

<h3>Intracerebral Hemorrhage</h3>
An intracerebral hemorrhage is a type of stroke caused by the sudden rupture of an artery within the brain. Blood is then released into the brain compressing brain structures.

<h3>Subarachnoid Hemorrhage</h3>
A subarachnoid hemorrhage is also a type of stroke caused by the sudden rupture of an artery. A subarachnoid hemorrhage differs from an intracerebral hemorrhage in that the location of the rupture leads to blood filling the space surrounding the brain rather than inside of it.
Source: http://www.strokecenter.org/patients/about-stroke/what-is-a-stroke/

<h2>Context</h2>
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

<h2>Attribute Information</h2>
<p>1) id: unique identifier<br>
2) gender: "Male", "Female" or "Other"<br>
3) age: age of the patient<br>
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension<br>
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease<br>
6) ever_married: "No" or "Yes"<br>
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"<br>
8) Residence_type: "Rural" or "Urban"<br>
9) avg_glucose_level: average glucose level in blood<br>
10) bmi: body mass index<br>
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*<br>
12) stroke: 1 if the patient had a stroke or 0 if not<br>
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient</p>

<h2>Importing</h2>

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix

In [None]:
data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

<h2>Data Exploration</h2>

In [None]:
data.shape

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.isna().sum()

In [None]:
col = data.keys()
for i in col:
    print(i, " :", pd.unique(data[i]))
    print("Unique :", len(pd.unique(data[i])))
    print("="*50)

In [None]:
data.describe()

From above data exploration we can perform few modification to data
1. We can drop **id** column since all ids are unique.
2. Gender can be modified to 0 for male and 1 for female and explore *other* class in gender. 
3. Age column looks fine.
4. Hypertension column already labeled.
5. Heart disease column already labeled.
6. ever_married, Residence_type, work_type columns can be labeled.
7. bmi have few na values to be filled.
8. smoking_status can be labeled and unknown label will explore.
9. stroke is fine.

In [None]:
data.drop("id", inplace=True, axis=1)

<h2>Data Visualization</h2>

In [None]:
px.histogram(data_frame=data, x="gender", color="gender", title="Gender")

Conclusion: <br>
In dataset female data is more then male.<br>
Other count is only 1, We will make it female since mode is female


In [None]:
px.histogram(data_frame=data, x="age", color="stroke", title="age")

Conclusion:<br>
Stroke starts from age 38 and incerases with age.<br>
Dataset have spike of persons age having 78.

In [None]:
px.histogram(data_frame=data, x="hypertension", color="stroke", title="hypertension")

In [None]:
px.histogram(data_frame=data, x="heart_disease", color="stroke", title="heart_disease")

In [None]:
px.histogram(data_frame=data, x="ever_married", color="stroke", title="ever_married")

In [None]:
px.histogram(data_frame=data, x="work_type", color="stroke", title="work_type")

Conclusion:
Cant conclude that Private people have more strokes. lol :D

In [None]:
px.histogram(data_frame=data, x="Residence_type", color="stroke", title="Residence_type")

In [None]:
px.histogram(data_frame=data, x="avg_glucose_level", color="stroke", title="avg_glucose_level")

Conclusion: More glucose -> More strokes? :/

In [None]:
px.histogram(data_frame=data, x="bmi", color="stroke", title="bmi")

Conclusion: BMI column is right skewed. We will cap them as outliers.

Last bmi having posotive stroke is at 59. we can cap at 65

In [None]:
px.box(data["bmi"])

In [None]:
px.histogram(data_frame=data, x="smoking_status", color="stroke", title="smoking_status")

In [None]:
px.histogram(data_frame=data, x="stroke", title="stroke")

In [None]:
px.imshow(data.corr())

Conclusion: age is most corelated column

Conclusion:
1. BM remove na values and cap values

In [None]:
px.histogram(data[data["bmi"].isna()]["stroke"])

Conclusion: <br>
BMI have 40 NA values with stroke positive.<br>
Find corelation of bmi to fill na.

In [None]:
data.corr()["bmi"]

Conclusion: We will fill BMI using age column

In [None]:
px.scatter(data_frame=data, x="age", y="bmi", title="BMI vs AGE")

<h2>Caping bmi</h2>

In [None]:
data["bmi"][data["bmi"]>60] = 60

In [None]:
px.scatter(data_frame=data, x="age", y="bmi", title="BMI vs AGE")

<h3>Filling bmi</h3>

In [None]:
data_bmi_train = data[data["bmi"].notnull()]
data_bmi_test = data[data["bmi"].isna()]

In [None]:
age = np.reshape(data_bmi_train["age"].values, [-1,1])
bmi = np.reshape(data_bmi_train["bmi"].values, [-1,1])

In [None]:
# Baseline model
bmi_mean = bmi.mean()
error = np.sum(np.square(bmi-bmi_mean))
rmse = np.sqrt(error/len(bmi))
print(rmse)

In [None]:
# Baseline model
bmi_median = np.median(bmi)
error = np.sum(np.square(bmi-bmi_median))
rmse = np.sqrt(error/len(bmi))
print(rmse)

In [None]:
# ExtraTreesRegressor model
model = ExtraTreesRegressor()
model.fit(age, bmi)
pred  = model.predict(age)
error = np.sum(np.square(bmi-pred))
rmse = np.sqrt(error/len(bmi))
print(rmse)

In [None]:
# LinearRegression model
model = LinearRegression()
model.fit(age, bmi)
pred  = model.predict(age)
error = np.sum(np.square(bmi-pred))
rmse = np.sqrt(error/len(bmi))
print(rmse)

In [None]:
# Mean model
gap=1
find_bmi_by_mean = lambda given_age: data_bmi_train["bmi"][(data_bmi_train["age"]>=given_age ) & (data_bmi_train["age"]<given_age+gap)].mean()

pred = []
for i in age:
    pred.append(find_bmi_by_mean(i[0]))
error = np.sum(np.square(bmi-pred))
rmse = np.sqrt(error/len(bmi))
print(rmse)

In [None]:
# Mean model
find_bmi_by_median = lambda given_age: data_bmi_train["bmi"][(data_bmi_train["age"]>=given_age ) & (data_bmi_train["age"]<given_age+gap)].median()
pred = []
for i in age:
    pred.append(find_bmi_by_median(i[0]))
error = np.sum(np.square(bmi-pred))
rmse = np.sqrt(error/len(bmi))
print(rmse)

Conclusion: Linear regression works well

In [None]:
# LinearRegression model
model = LinearRegression()
model.fit(age, bmi)

In [None]:
age_t = np.reshape(data_bmi_test["age"].values, [-1,1])
pred = model.predict(age_t)
data["bmi"][data["bmi"].isna()] = np.ravel(pred)

Filling gender other with female

In [None]:
data["gender"][data["gender"]=="Other"] = "Female"

In [None]:
data.head()

## Data transformation

In [None]:
one_hot = pd.get_dummies(data, drop_first=True)

In [None]:
one_hot = pd.get_dummies(one_hot, columns=['hypertension', 'heart_disease'], drop_first=True)

In [None]:
one_hot["hypertension_0"] = np.zeros_like(one_hot["hypertension_1"])
one_hot["hypertension_0"][one_hot["hypertension_1"]==0]=1
one_hot["heart_disease_0"] = np.zeros_like(one_hot["heart_disease_1"])
one_hot["heart_disease_0"][one_hot["heart_disease_1"]==0]=1

In [None]:
one_hot.head()

In [None]:
one_hot.info()

In [None]:
one_hot.head()

In [None]:
x = one_hot.drop("stroke", axis=1)
y = one_hot["stroke"]

In [None]:
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = .3, stratify=y)
train_x.shape, train_y.shape

## Model 

In [None]:
rfc = RandomForestClassifier()
rfc.fit(train_x, train_y)
rfc.score(test_x, test_y)

### Accuracy doesn't matter if classes are highly imbalance
#### We have to look confusion metrix

In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold

In [None]:
print(confusion_matrix(rfc.predict(test_x), test_y))

print(classification_report(rfc.predict(test_x), test_y))

As we can see false positive rate is very high

### Defining class weights

In [None]:
data["stroke"].value_counts()

In [None]:
class_weight = {0:1, 1:20}

In [None]:
rfc = RandomForestClassifier(class_weight=class_weight)
rfc.fit(train_x, train_y)
rfc.score(test_x, test_y)

In [None]:
print(confusion_matrix(rfc.predict(test_x), test_y))
print(classification_report(rfc.predict(test_x), test_y))

Not much difference

## Undersampling

In [None]:
from imblearn.under_sampling import NearMiss
from collections import Counter

In [None]:
nm = NearMiss(.99)
train_x_nm, train_y_nm = nm.fit_resample(train_x, train_y)
Counter(train_y_nm)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(train_x_nm, train_y_nm)
rfc.score(test_x, test_y)

In [None]:
print(confusion_matrix(rfc.predict(test_x), test_y))
print(classification_report(rfc.predict(test_x), test_y))

Made things worse

## Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
os = RandomOverSampler(.50)
train_x_os, train_y_os = os.fit_resample(train_x, train_y)
Counter(train_y_os)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(train_x_os, train_y_os)
rfc.score(test_x, test_y)

In [None]:
print(confusion_matrix(rfc.predict(test_x), test_y))
print(classification_report(rfc.predict(test_x), test_y))