# Stroke prediction with LightGBM

In this activity we will use a [dataset available in Kaggle](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset), which has a handful of features of people and a target column `stroke` where is indicated wheter a patient has stroke or not. LightGBM is an powerful ensembling model capable to predict very fast and accurate. In the end of this Notebook, some useful references are mentioned. We address the handling of categorical data using LightGBM and comparing with manual preprocessing. An important point of the data used here is that it is unbalanced. 
<img src="https://webstockreview.net/images/pain-clipart-heart-failure.jpg">


In [None]:
# Import packages
import numpy as np # Handling matrices
import pandas as pd # Data processing
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Plotting 
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder # Handling categorical data and normalization
from sklearn.model_selection import train_test_split # Split data in train and test
from sklearn.metrics import roc_auc_score,precision_score,confusion_matrix, accuracy_score, roc_curve, f1_score # Several useful metrics
import lightgbm as lgb # LightGBM

# Set matplotlib configuration
%matplotlib inline
plt.style.use('seaborn')

# 1) Review and analysis of data

In [None]:
# Import data
data = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
print("This dataset contains: {} rows and {} columns".format(data.shape[0],data.shape[1]))
data.head()

In [None]:
# Review the type of each feature
data.dtypes

In [None]:
# Our target feature "stroke" is numeric, however, we need to change it to object. Also, we need to change the features "heart_disease" and "hypertension"
data["stroke"] = data["stroke"].astype("object")
data["heart_disease"] = data["heart_disease"].astype("object")
data["hypertension"] = data["hypertension"].astype("object")

In [None]:
# Remove id
data.drop("id",axis = 1,inplace = True)

# Analyse missing values
data.isna().sum()

# We can see tha only bmi (body mass index) is missing.This is a numerical feature and can be difficult to imputate.

In [None]:
# We can note the distribution of bmi values
data["bmi"].describe()

In [None]:
# The number of missing values in bmi are low, thereby we can remove them. First, we go to review the relation between bmi and stroke

# Create the boxplot 
plt.figure(figsize=(20,6))
sns.boxplot(x='stroke', y="bmi", data=data);
plt.title("Relation between BMI and stroke", fontsize=24)
plt.xlabel('Stroke')
plt.ylabel('BMI');

# We observe that bmi is a feature that is not so relevant. In this case, patients wihtout stroke tend to have a less bmi, however, some cases with a high bmi are detected

# Remove missing values
data.dropna(subset = ["bmi"], inplace=True)

In [None]:
# Identify categorical features
cat = (data.dtypes == 'object')
cat_cols = list(cat[cat].index)
print(cat_cols)

# Create a handful of plots
for cols in cat_cols:
    plt.figure(figsize=(8,4));
    sns.countplot(x = data[cols]);

In [None]:
# We can see closer some particular features
data["gender"].value_counts()

# The value "Other" is valid, however we have only a one case. Additionally, in the feature "smoking_status", we have the class "Unknown". 
# We can mantain this class to avoid create missing values.

In [None]:
# An important point here, is that we have an unbalanced dataset, 
# because we count with a little number of patients with stroke.

In [None]:
# Create a list of numerical_cols
numerical_cols = [cname for cname in data.columns if data[cname].dtype in ['int64', 'float64']]

# Also, we can see how numerical features are related with the target
data[numerical_cols].hist(bins=15, figsize=(15, 6), layout=(2, 3));

In [None]:
# We can create a model using these features.

# Separate target from predictors 
y = data[["stroke"]].copy()

# LightGBM need numerical data
y["stroke"] = y["stroke"].astype("int64")

X = data.drop(['stroke'], axis=1).copy()

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, 
                                                                test_size=0.2,random_state = 0)


# 2) Manual preprocessing of categorical features

Now, we can modify categorical data to transform to numerical values. For this, we can use different approaches such as Label encoding and One-Hot encoding. More information abou the use of these methods is available in [this section of ML Kaggle course](https://www.kaggle.com/alexisbcook/categorical-variables). First, reviewing our categorical dataset, we can distinguish ordinal and nominal features. Unlike nominal categories, ordinal categories follow an order. For example, heart_disease can be classified as a ordinal value, because we know that class "1" or having a heart disease is worst than class 0 or not having a heart disease. In other hand, we can order the feature gender, because neither of its classes ("Male", "Female" and "Other") is better than the other.

In [None]:
# Preprocessing data

# We go to use Label Encoding, One-Hot encoding and MinMaxScaler to different features

# Apply Label encoding to ordinal feature. Make copy to avoid changing original data 
ord_cat = ["heart_disease","hypertension"]
label_train = X_train[ord_cat].copy()
label_valid = X_valid[ord_cat].copy()
label_encoder = LabelEncoder()
for col in ord_cat:
    label_train[col] = label_encoder.fit_transform(X_train[col])
    label_valid[col] = label_encoder.transform(X_valid[col])

# Apply one-hot encoder to each column with categorical data
nom_cat = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
dummy_train = pd.get_dummies(X_train[nom_cat], prefix=nom_cat)
dummy_valid = pd.get_dummies(X_valid[nom_cat], prefix=nom_cat)

# Apply normalization to numerical data. We use this normalization because our data does not have a normal distribution
scaler = MinMaxScaler()
s_cols_train = pd.DataFrame(scaler.fit_transform(X_train[numerical_cols]))
s_cols_valid = pd.DataFrame(scaler.transform(X_valid[numerical_cols]))

# Rename columns
s_cols_train.columns = numerical_cols
s_cols_valid.columns= numerical_cols

# Encoding removed index; put it back
s_cols_train.index = X_train.index
s_cols_valid.index = X_valid.index

# Add one-hot encoded columns to numerical features
X_train_modified = pd.concat([s_cols_train, dummy_train, label_train], axis=1)
X_valid_modified = pd.concat([s_cols_valid, dummy_valid, label_valid], axis=1)

# 3) Use LightGBM with modified data

In [None]:
# Create the model 
d_train=lgb.Dataset(X_train_modified, label=y_train) 
params={} #Specifying the parameter
params['learning_rate']=0.015 # Learning rate 
params['boosting_type']='gbdt' # GradientBoostingDecisionTree
params['objective']='binary' # Binary target feature
params['metric']='auc',# Metric for binary classification
params['max_depth']=200, # Set depth
params['bagging_fraction'] = 0.8,
params['force_row_wise'] = True, # Need to the model
params['unbalance'] =True, # To consider an unbalanced dataset
clf=lgb.train(params,d_train,200) # Train the model on 100 epocs

In [None]:
# Prediction on the test set
y_pred=clf.predict(X_valid_modified, predict_disable_shape_check=True)
y_pred.shape

In [None]:
# Function to plot ROC curve
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

fpr, tpr, thresholds = roc_curve(y_valid, y_pred)
plot_roc_curve(fpr, tpr)

In [None]:
# Create Confusion Matrix
pred_class = y_pred > 0.5
pred_class = pred_class.astype(int)
cm = confusion_matrix(y_valid, pred_class)
print(cm)

# Get accuracy
accuracy = round(accuracy_score(y_valid,pred_class),4)
print("Accuracy: {}".format(accuracy),"\n")

# Get f1 score (it is required on the Task 1 of this dataset)
f1 = f1_score(y_valid,pred_class)
print("F1: {}".format(f1),"\n")

In [None]:
# See the feature importance
importance_feature = pd.DataFrame({'Value':clf.feature_importance(),'Feature':clf.feature_name()}).sort_values(by="Value", ascending=False)

# Create a plot
plt.figure(figsize=(20, 10))
sns.barplot(x = 'Value',y = 'Feature',data = importance_feature);
plt.title("Importance feature");

We can note that we have 19 right predictions of patients with stroke out of 43 patients. We can use another configuration of options to get a better result, however it is likely that we increase importantly the amount of mistakes in our prediction (false positive). In the last plot, we see that the main features are the glucose level, bmi and age. Hypertension and smoking status play a key role as the most important categorical features. 

# 4) Use LightGBM and its method to handle categorical feature

In [None]:
# LightGBM also can handle categorical data directly We go to probe its inner method

# Transform categorical features into the appropriate type that is expected by LightGBM
for c in X_train.columns:
    col_type = X_train[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        X_train[c] = X_train[c].astype('category')
        
for c in X_valid.columns:
    col_type = X_valid[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        X_valid[c] = X_valid[c].astype('category')

In [None]:
# Create the model 
d_train=lgb.Dataset(X_train, label=y_train)#Specifying the parameter
params={}
params['learning_rate']=0.015 # Learning rate 
params['boosting_type']='gbdt' # GradientBoostingDecisionTree
params['objective']='binary' # Binary target feature
params['metric']='auc' # Metric for binary classification
params['max_depth']=200, # Set depth
params['bagging_fraction'] = 0.8,
params['force_row_wise'] = True, # Need to the model
params['unbalance'] = True, # To consider an unbalanced dataset
clf=lgb.train(params,d_train,200) # Train the model on 100 epocs

# Prediction on the test set
y_pred2=clf.predict(X_valid, predict_disable_shape_check=True)

In [None]:
# We can review the params of our model. We can see if the categorical columns are correclty considered.
clf.params

In [None]:
# Function to plot ROC curve
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

fpr, tpr, thresholds = roc_curve(y_valid, y_pred2)
plot_roc_curve(fpr, tpr)

In [None]:
# Create Confusion Matrix
pred_class = y_pred2 > 0.5
pred_class = pred_class.astype(int)
cm2 = confusion_matrix(y_valid, pred_class)
print(cm2)

# Get accuracy
accuracy2 = round(accuracy_score(y_valid,pred_class),4)
print("Accuracy: {}".format(accuracy2),"\n")

# Get f1 score (it is required on the Task 1 of this dataset)
f1_2 = f1_score(y_valid,pred_class)
print("F1: {}".format(f1_2),"\n")


In [None]:
# See the feature importance
importance_feature = pd.DataFrame({'Value':clf.feature_importance(),'Feature':clf.feature_name()}).sort_values(by="Value", ascending=False)

# Create a plot
plt.figure(figsize=(20, 10))
sns.barplot(x = 'Value',y = 'Feature',data = importance_feature);
plt.title("Importance feature");

# 5) Compare results

In [None]:
# We can summarize the results
print("Accuracy modifying categorical features manually: {}".format(accuracy))
print("Accuracy using LightGBM to handle categorical features: {}".format(accuracy2),"\n")

# We compare matrix
print("Confusion matrix using manual handling: \n", cm,"\n")
print("Confusion matrix LightGBM: \n", cm2, "\n")

# Besides, we add the F1_score that is required  in the Task 1 of this dataset
print("F1 modifying categorical features manually: {}".format(f1))
print("F1 using LightGBM to handle categorical features: {}".format(f1_2),"\n")

* Carrying out this new model, we obtain a better result using the inner method of LightGBM to handle categorical data. Our accuracy improved and also we predict correctly two patients more, without trade-off. In [this page](https://medium.com/swlh/dealing-with-categorical-variables-in-machine-learning-4401b949b093), you can see more about the use of LightGBM to handle categorical data. Any comment is welcome!

# References

* https://nitin9809.medium.com/lightgbm-binary-classification-multi-class-classification-regression-using-python-4f22032b36a2
* https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
* https://www.kaggle.com/alexisbcook/categorical-variables
* https://www.kaggle.com/ashishpatel26/feature-importance-of-lightgbm
* https://medium.com/swlh/dealing-with-categorical-variables-in-machine-learning-4401b949b093