# Risk of Stroke Explained by QLattice
The QLattice is a supervised machine learning tool for symbolic regression developed by Abzu . It is inspired by Richard Feynman's path integral formulation. That's why the python module to use it is called Feyn, and the Q in QLattice is for Quantum.

Abzu provides free QLattices for non-commercial use to anyone. These free community QLattices gets allocated for us automatically if we use Feyn without an active subscription, as we will do in this notebook. Read more about how it works here: https://docs.abzu.ai/docs/guides/getting_started/community.html

The feyn Python module is not installed on Kaggle by default so we have to pip install it first.

In [None]:
!pip install feyn

## Import packages

In [None]:
import feyn
import pandas as pd
import numpy as np

## Import data
.. and drop ID column

In [None]:
data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
data.drop('id', axis = 1, inplace = True)
data

### Impute missing values
.. in 'bmi' column

In [None]:
missing_col = ['bmi']

#Technique 2: Using median to impute the missing values
for i in missing_col:
 data.loc[data.loc[:,i].isnull(),i]=data.loc[:,i].median()


## Data Exploration
Consider the balance in the dataset. We're dealing with a heavily skewed outcome variable. We need to deal with this.

In [None]:
data.stroke.value_counts()

Let's have a look at the numerical variation in this dataset. We are seeing that stroke is closely linked to age; higher age higher risk of stroke. The imbalance makes it difficult to eyeball the differences in distributions across stroke or no stroke for most of the other features. However, it could look like both hypertension, heart disease, and average glucose level play a role. Contrary to my expectation bmi does not seem to play a major role here.

In [None]:
from seaborn import pairplot
pairplot(data, hue = 'stroke')

Looking at categorical features you might look at crosstab of occurence. Here we look at the occurence of stroke for the given smoking statuses: Current and has been smokers are overrepresented among stroke events with occurences of 5.3% and 7.9%, respectively. While for never smokers and unknowns we see 4.7% and 3% occurence. Let's move from exploration mode into modelling mode.

In [None]:
pd.crosstab(data['smoking_status'], data['stroke'], normalize = 'index')

## Train-test-split
First we split our data into train and test. To keep things simple I omit the otherwise important validation set for now. Due to the heavy imbalance in the output variable I stratify the split with 'stroke' to fix the proportion of stroke events.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.3, stratify = data.stroke, random_state = 42)

## Sample weights
How to deal with the imbalance? I choose to go with applying sample weighting to my algorithm. I assign a weight to each observation according to the prevalance of stroke. In practical terms, this means that stroke-individuals will weight about 20 times more non-stroke individuals since stroke-individuals constitue around 5% of the population. When we apply stochastic gradient descend to our models correctly specifying a stroke individual rewards the model 20 times as much as a non-stroke individual.

In [None]:
sample_weights = np.where(train.stroke == 1, data.stroke.value_counts()[0]/data.stroke.value_counts()[1], 1)

# QLattice application
Connect to a Community QLattice available for non-commercial users. For more questions about this method visit the [docs page](https://docs.abzu.ai/).

In [None]:
ql = feyn.connect_qlattice()
ql.reset(42)

## Assign semantic types
We distinguish between categorical and numerical features and the QLattice needs to be informed of the semantics of your features. I split by dtypes of my training features, objects being categorical features.

In [None]:
stypes = {}
for f in data.columns:
    if data[f].dtype =='object':
        stypes[f] = 'c'

## Model Search
Start the flow of models from the QLattice to your PC. Define the rules of the game: Which models you are interested in?

In [None]:
models = ql.auto_run(train,
                     output_name = 'stroke',
                     kind = 'classification',
                     n_epochs = 10,
                     threads = 6,
                     criterion='bic',
                     stypes = stypes,
                     sample_weights=sample_weights
                    )

## Pick preferred model
'models' is a list of differentiated models explaining your output. Go and inspect them.

In [None]:
my_model = models[0]

## Evaluate model
Inspect the model best fitting the assigned bic-criterion. We paint the graph with pearson correlation in that way displaying the signal flow through the model. Also, we look at performance measures across train and test split. 

In [None]:
my_model.plot(train, test)

## These models are math really
Call the matematical expression of your model with 'sympify'

In [None]:
my_model.sympify(2, symbolic_lr=True)

How would we predict risk of stroke out-of-sample individuals. Let's have a look at confusion matrix. With this model we correctly predict 60 out of 75 actual stroke-individuals, a recall of 80%. The precision of the model on the other hand is low. When we predict stroke we are only right 12% of the time, or 60 of 520 predicted. Depending on the application of the model you could weigh recall and precision differently by changing the threshold for when to predict stroke (the default here being 50%, heavily affected by our sample weights-setting).

In [None]:
my_model.plot_confusion_matrix(test)

## Interpretability
Our models are inherently interpretable since we are dealinh with simple math. However to further simplifying the model functionality to the user a set of partial plots are available. Here we look at the effect of age on stroke risk with three different values of average glucose level. It becomes obvious that age is strongly linked to risk of stroke, increasing from the age of 20 to 30 (depending on glucose level) all the way up to the oldest individuals in the sample around 80. One stroke happening to a young child means that the algorithm has tried to grasp that, and that is why we see the small heightening of risk for young children.

In [None]:
my_model.plot_partial(train, by = 'age', fixed = {'avg_glucose_level': [50, 150, 250]})

## Comparing predictive power to other go-to machine learning techniques
We ask: Can such a simple model really compete?

In [None]:
# Do one hot encoding for compatibility
data_ohe = pd.get_dummies(data)

# Perform same train-test-split on prepared data
train_ohe, test_ohe = train_test_split(data_ohe, test_size = 0.3, stratify = data.stroke, random_state=42)

Apply Random Forest, Gradient Boosting and Logistic Regression readily available for benchmarking purposes in the feyn library.

In [None]:
rf = feyn.reference.RandomForestClassifier(train_ohe, output_name='stroke')
gb = feyn.reference.GradientBoostingClassifier(train_ohe, output_name='stroke')
lr = feyn.reference.LogisticRegressionClassifier(train_ohe, output_name='stroke', max_iter=10000)

The resulting AUC's are all around the same level Logistic Regression being slightly higher than the QLattice (0.84 vs. 0.83). Now consider that the model provided by the QLattice is a two-feature model compared to the ten features otherwise used by the other models. This makes for a much easier grasp of the functionality of the model. We often see that you can boil things down to only a couple of features and still acheive competitive predictive power of your model.

In [None]:
rf.plot_roc_curve(test_ohe, label="Random Forest")
gb.plot_roc_curve(test_ohe, label="Gradient Boosting")
lr.plot_roc_curve(test_ohe, label="Logistic Regression")
my_model.plot_roc_curve(test, label = "QLattice")

Thanks for reading and if you are interested please go and apply the QLattice to your problems and share your experience.