# Stroke Risk - Fbeta IS THE KEY (LightGBM)
**Notebook Author** <br>
**Aleksander Jakubowski**

**Context** <br>
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

**Attribute Information** <br>
1) id: unique identifier<br>
2) gender: "Male", "Female" or "Other"<br>
3) age: age of the patient<br>
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension<br>
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease<br>
6) ever_married: "No" or "Yes"<br>
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"<br>
8) Residence_type: "Rural" or "Urban"<br>
9) avg_glucose_level: average glucose level in blood<br>
10) bmi: body mass index<br>
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*<br>
12) stroke: 1 if the patient had a stroke or 0 if not<br>
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient<br>

**Acknowledgements** <br>
(Confidential Source) - Use only for educational purposes
If you use this dataset in your research, please credit the author.


In [None]:
import numpy as np 
import pandas as pd 
import lightgbm as lgbm
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import cross_val_score,train_test_split,RandomizedSearchCV
from sklearn.metrics import fbeta_score,make_scorer,recall_score,precision_score
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
data = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

In [None]:
data

## Dataset and Objective

We are not going to go into in-depth analysis with this dataset. I feel like BI tools (e.g Tabelau, Power BI, Qlik), would do much better job in that case.<br>

**Target Variable** <br>
We need to look if our y (stroke classes) are imbalanced (Very likely case) <br>

**Independent Variables** <br>
In our datasets we have couple of issues to address. First one are categorical features. We have to encode those. When we encode we need to pick encoding type (e.g One Hot or Ordinal) and we base our decision on the model we are going to train. If our model is sensitive to non standardized/normalized values in our data, then we have to deal with those too. <br>
In majority of problems like this Gradient boosting models are among the best in the field since they can deal with non-standardized data, categorical features, missing values and class imbalance. <br>
Since it is not competition, I will not stack many models to get extremely accurate result. The library and model I pick is Microsoft LightGBM with its Gradient Boosting Machine Classifier.<br>

### What is important to take into account is evaluation metric. What I saw in a lot of other works here on Kaggle,is that people that keep repeating the same mistake. They measure accuracy, precision, best case scenerio f1 score and it is just WRONG.<br>

What you need to address is what you don't want to happen. We don't want to have people who might have a stroke to be predicted as healthy. So we want to see how our model is predicting Positive value, and we want to punish our score when model doesn't detect person that is positive. Which means that we want to go with Fbeta Score which is good metrics when cost of False Positive is smaller than False Negative in other words. <br>

**If our patient goes onto list of potentially at risk of stroke without being at risk then not much will change for him/her. However for person that is at risk it is important to make sure that person has the best care and knows of his/her state and implement treatment.**<br>



For more information regarding Fbeta Metrics I highly encourage these sources.<br>
[deep ai F2 Score](https://deepai.org/machine-learning-glossary-and-terms/f-score) <br>
[Sklearn Fbeta Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html)<br>


Apart from Fbeta, I suggest to also focus on Recall and Precision. We might have good recall and Fbeta but our model might not be precise (detect most of things as positive).

Going back to dataset as we can see below, we indeed work with imbalanced dataset.

In [None]:
pd.DataFrame(data.groupby("stroke").count()["id"])

In [None]:
fig1 = plt.figure(figsize=(16,5))
sns.heatmap(data.isna().T)
plt.title("Null Values")

So generally our model (Gradient Boosting Machines) can deal with null values, however I not know how it deals with those null values.<br>
As we can see we are missing BMI values for our patients and it might be beneficial to fill those values. However we can not just fill them with zeros or avg/median <br>values as it wouldn't make much of a sense (quite possibly model would also either treat nulls as zeros or something close to that).<br>

In this case we need to use what we have, and we have other features with in real-life are related to our bmi value. E.g. Hypertension can be caused by Obesity (Obesity-Induced Hypertension).<br>

I feel important features that we should groupby in that case are: <br>
1) Gender<br>
2) Age (We will create age group)<br>
3) Hypertension<br>

In [None]:
def age_groups(age):
    if age >= 0 and age <=18:
        age_group="to 18 years"
    elif age > 18 and age <=29:
        age_group="18-29 years"
    elif age>29 and age <=40:
        age_group="30-40 years"
    elif age>40 and age <=50:
        age_group = "40-50 years"
    elif age >50 and age <= 60:
        age_group ="50-60 years"
    elif age >60:
        age_group="more than 60 years"
    return age_group    

In [None]:
data["age_group"]=data["age"].apply(age_groups)

In [None]:
data.groupby(["gender","age_group","hypertension","smoking_status"]).mean()

Here we will fill NaN with mean from our particular groups and later we will drop age_group featue

In [None]:
data['bmi'] = data['bmi'].fillna(data.groupby(["gender","age_group","hypertension"])['bmi'].transform('mean'))

In [None]:
data.drop(["age_group","id"],axis=1,inplace=True)

In [None]:
data.isna().sum()

In [None]:
data

## Modeling

In [None]:
enc = OrdinalEncoder()

In [None]:
objects = list(data.select_dtypes("object").columns)
data_objects = pd.DataFrame(enc.fit_transform(data[objects]))
data_objects.columns=objects

for column in objects:
    data[column] = data_objects[column]

In [None]:
X = data[data.columns[:10]]
Y = data["stroke"]

X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.15,random_state=101)

In [None]:
f2_scorer = make_scorer(fbeta_score, beta=2.0)
recall_scorer = make_scorer(recall_score)

In [None]:
random_model = lgbm.LGBMClassifier()

params = {
    'n_estimators': np.random.randint(50,120,80),
    'learning_rate': np.random.uniform(0.01,0.2,20),
    'num_leaves': np.random.randint(25,60,80),
    'boosting_type' : ['gbdt'],
    'objective' : ['binary'],
    'max_depth' : np.random.randint(3,11,20),
    'random_state' : [101],
    "scale_pos_weight": np.random.randint(15,500,1000)
    }

grid = RandomizedSearchCV(random_model,params,verbose=1,cv=10,n_jobs = -1,n_iter=150,scoring=recall_scorer)
grid.fit(X,Y)

In [None]:
grid.best_params_

In [None]:
model = lgbm.LGBMClassifier(
    n_estimators = grid.best_params_.get("n_estimators"),
    learning_rate=grid.best_params_.get("learning_rate"),
    objective = "binary",
    num_leaves = grid.best_params_.get("num_leaves"),
    max_depth = grid.best_params_.get("max_depth"),
    boosting_type = "gbdt",
    scale_pos_weight=grid.best_params_.get("scale_pos_weight")
)

model.fit(X_train,y_train)

In [None]:
predictions = model.predict(X_test)
recall_results = recall_score(y_test,predictions)
f2_results = fbeta_score(y_test,predictions,beta=2.0)
precision_results = precision_score(y_test,predictions)
print('Precision Score: %.3f' % (precision_results))
print('F2 Score: %.3f' % (f2_results))
print('Recall Score: %.3f' % (recall_results))

In [None]:
beta_score_list = []
for i in range(1,101):
    fbeta_results = fbeta_score(y_test,predictions,beta=i)
    beta_score_list.append(fbeta_results)

In [None]:
fig2 = plt.figure(figsize=(10,6))
sns.lineplot(x=range(1,101),y=beta_score_list)
plt.title("Fbeta scores with beta 0 to 100")

# SUMMARY

We have build model using Gradient Boosting Classifier from Microsoft LightGBM. It helped us with a lot of preprocessing work. <br>
For encoding we have used ordinal encoder. <br>
As a way to deal with class imbalanced, we have used scale_pos_weight parameter in our model.<br>
Hyperparameter tuning was done with RandomizedCV (due to better speed than GridSearchCV).<br>

We measured our model performance with three metrics.<br>
- FBeta<br>
- Recall<br>
- Precision<br>

Best obtained results<br>
Precision Score: 0.094<br>
F2 Score: 0.333<br>
Recall Score: 0.909<br>

Our conclusion is that with this particular dataset it is really difficult to predict stroke. <br>
Recall score indicates that we can successfully detect true positive values, so our model is pretty good at this.<br>
Unfortunately however Precision is sitting at only 9.4% which means that majority of our positive cases are misses.<br>
Our final score is F2 score which acts as balancing act between Precision and Recall and in very short way it indicates overall performance of our model.<br>

Now what can we do with this model. With this kind of knowledge we can consider application. <br>
If we achieved a little more precision and recall (let's say recall at 96% and precision at 15-20%) then this model would be quite good for indicating which patients<br> are in group of higher risk of stroke. Other than that, it is pretty poor performing model.<br>
We can always search for better parameters, but I doubt we will get much more out of this dataset (if you have any ideas how to improve it, let me know).<br>

I have seen tons of people measuring accuracy here. They do not take into account class imbalance or other metrics and it makes these models just unrealistic.<br>

As always thank you for attention,<br>
If you have noticed any mistakes or room for improvements, please leave a comment.<br>