# Explainable models Use a Community QLattice to find explainable models for diabetes

The QLattice is a supervised machine learning tool for symbolic regression developed by [Abzu](https://www.abzu.ai) . It is inspired by Richard Feynman's path integral formulation. That's why the python module to use it is called *Feyn*, and the *Q* in QLattice is for Quantum.

Abzu provides free QLattices for non-commercial use to anyone. These free community QLattices gets allocated for us automatically if we use Feyn without an active subscription, as we will do in this notebook. Read more about how it works here: https://docs.abzu.ai/docs/guides/getting_started/community.html

The feyn Python module is not installed on Kaggle by default so we have to pip install it first. 

__Note__: the pip install will fail unless you enable *Internet* in the *settings* to the right --->

In [None]:
!pip install feyn

# Python imports
In this notebook we will use only three python modules: the `feyn` module to access the QLattice, and the `pandas` module to access the data, and sklearn to split the data in test/train sets

In [None]:
import feyn
import pandas as pd
import sklearn.model_selection

# Data
Read in the data and have a quick look at it:

In [None]:
data = '/kaggle/input/predict-diabetes-based-on-diagnostic-measures/diabetes.csv'
df = pd.read_csv(data)
df

# Adjust data types
The "gender" and "diabetes" features are really boolean, but represented as text. Let's start by fixing that.
We will also remove the patient_number column as it will overfit

In [None]:
df["gender"]=(df["gender"]=="male").astype(int)
df["diabetes"]=(df["diabetes"]=="Diabetes").astype(int)
df.drop(["patient_number"], axis=1, inplace=True)

There is also a problem with the three real-valued columns: "chol_hdl_ratio", "bmi" and "waist_hip_ratio". They use comma as a decimal seperator, European-style, which the csv parser in pandas did not know about. Lets fix that too:

In [None]:
df["bmi"] = df["bmi"].str.replace(",",".").astype(float)
df["waist_hip_ratio"] = df["waist_hip_ratio"].str.replace(",",".").astype(float)
df["chol_hdl_ratio"] = df["chol_hdl_ratio"].str.replace(",",".").astype(float)

# Target balance
Let's have a quick look at the balance of target variable

In [None]:
df.diabetes.value_counts()

We are skewed towards "No diabetes". 

# Splitting
Let's split the data for train/test. We will stratify by diabetes and take 2/3 for training. This will leave 20 diabetic patients in the test set, so we are at the quite low end.

In [None]:
train, test = sklearn.model_selection.train_test_split(df,stratify=df["diabetes"], train_size=.66, random_state=1)

In [None]:
test.diabetes.value_counts()

# Allocate a QLattice
The actual QLattice is a quantum simulator that runs on Abzu's hardware, but we can allocate one to use for our analysis with a single line of code. Hopefully the following line will get us one.

In [None]:
ql = feyn.connect_qlattice()

# Resetting and reproducability
The QLattice has the potential to store learnings between sessions to enable transfer of learning and federated learning. This is not possible with Community QLattices, since a new one gets allocated whenever we run the notebook, so it is not strictly necessary to call the reset function on our new QLattice. 

But the reset function also allows us to provide a random seed, which will ensure that we get the same results every time we run this notebook

In [None]:
ql.reset(random_seed=1)


# Search for the best model

We are now ready to instruct the QLattice to search for the best mathematical model to explain the data. Here we use the high-level convenience function that does everything with sensible defaults: https://docs.abzu.ai/docs/guides/essentials/auto_run.html.

For more detailed control, we could use the primitives: https://docs.abzu.ai/docs/guides/primitives/using_primitives.html

NOTE: This will take a minute to complete. It involves work done on the QLattice machine remotely as well as in the local notebook. The part that runs locally is slowing things down because of the limited CPU resources on Kaggle. Running the same on my machine locally only takes 10 seconds!

In [None]:
models = ql.auto_run(train, output_name="diabetes", kind="classification", criterion="aic")

# A quite simple model
We have ended up with a model that uses only two features, age and glucose. That is quite minimal, and impressive if it works well. 

# Evaluate the performance of the model
Lets look at the ROC curves of the model on the training and the test data

In [None]:
models[0].plot_roc_curve(train, label="Training data")
models[0].plot_roc_curve(test, label="Test data")

# Seems good
With only two features we get an AUC of .94 on the test data. This is pretty much exactly thee same as the performance on the training data, indicating that the model generalises very well. This is consistent with other findings, such as for example this research paper:
[Symbolic regression outperforms other models for small data sets](https://arxiv.org/abs/2103.15147)

Let's have a look at the actual mathematical expression found:

In [None]:
models[0].sympify(2)

# Comparison
Finally, let's compare it to the usual model methods, random forest, gradient boosting and logistic regression

In [None]:
rf = feyn.reference.RandomForestClassifier(train, output_name="diabetes")
gb = feyn.reference.GradientBoostingClassifier(train, output_name="diabetes")
lr = feyn.reference.LogisticRegressionClassifier(train, output_name="diabetes", max_iter=10000)

In [None]:
rf.plot_roc_curve(test, label="Random Forest")
gb.plot_roc_curve(test, label="Gradient Boosting")
lr.plot_roc_curve(test, label="Logistic Regression")

# Pretty consistent
All the four models curiously perform about equally well. The unique property of the QLattice is really that it is able to explain the data with the use of only two features in a fairly straightforward mathematical equation.

# Conclusion
Using the QLattice for symbolic regression, we were able to find a model that predicts diabetes in this dataset with the same accuracy as the usual black-box machine learning techniques.

A simple result such as this one will have much more clinical credibility than a black-box model, and will also help medical researchers understand what actually drives diabetes.

Note though, that the findings are quite limited by the small size of the data set.