# Using the QLattice to understand mushroom toxicity 
The QLattice is a supervised machine learning tool for symbolic regression developed by [Abzu](https://www.abzu.ai) . It is inspired by Richard Feynman's path integral formulation. That's why the python module to use it is called *Feyn*, and the *Q* in QLattice is for Quantum.

Abzu provides free QLattices for non-commercial use to anyone. These free community QLattices gets allocated for you automatically if you use Feyn without an active subscription, as we will do in this notebook. Read more about how it works here: https://docs.abzu.ai/docs/guides/getting_started/community.html

The feyn Python module is not installed on Kaggle by default so we have to pip install it first. 

__Note__: the pip install will fail unless you enable *Internet* in the *settings* to the right--->

In [None]:
!pip install feyn

# Python imports
In this notebook we will only use three python modules: the `feyn` module to access the QLattice, the `pandas` module to access the data, and `sklearn` to split the data into train and test sets

In [None]:
import feyn
import pandas as pd
import sklearn.model_selection

# Getting the Data
First let's load the dataset and take a look

In [None]:
data = '/kaggle/input/mushroom-classification/mushrooms.csv'
df = pd.read_csv(data)
df

In [None]:
df.isna().sum()

# First impressions:
We notice that:
- The target variable is `class`, and can be represented as a boolean
- All data types are categorical. The QLattice works with both categorical and numerical data, but needs to be told which entries are categorical (i.e. it assumes they are numerical)
- There are no missing entries

Since all entries are the same, we'll remove the `veil-type` column

In [None]:
df.drop('veil-type', axis=1, inplace=True)

Let's change our target column, `class`, to boolean

In [None]:
df["class"]=df["class"].replace({"p":True, "e":False}).astype(bool)

In [None]:
df

# Splitting the data
Let's split the data into train, test, and validation sets. We will stratify by `class` and take 2/3 of the entire dataset for training. We also create a holdout set to represent how our model could perform in the real world. More on this later

In [None]:
train, test = sklearn.model_selection.train_test_split(df, stratify=df["class"], train_size=.66, random_state=1)
test, holdout = sklearn.model_selection.train_test_split(test, stratify=test["class"], test_size=.5, random_state=1)

# Setting data types
As mentioned earlier, the QLattice needs to be told which entries are categorical. We accomplish this by running through the dataframe and recording which columns contain object types. This is recorded in the dictionary `stypes` and passed to the QLattice to indicated that these features should be treated as categorical. 

In [None]:
stypes = {}

for col in train.columns:
    if train[col].dtype == 'O':
        stypes[col] = 'c'
        
stypes["class"] = 'b'

In [None]:
stypes

# Allocate a QLattice
The actual QLattice is a quantum simulator that runs on Abzu's hardware, but we can allocate one with a single line of code. Cool, huh?

In [None]:
ql = feyn.connect_qlattice()

# Resetting and reproducability
The QLattice has the potential to store learnings between sessions to enable transfer of learning and federated learning. This is not possible with Community QLattices, since a new one gets allocated whenever we run the notebook, so it is not strictly necessary to call the reset function on our new QLattice.

But the reset function also allows us to provide a random seed, which will ensure that we get the same results every time we run this notebook

In [None]:
ql.reset(random_seed=1)

# Search for the best model
We are now ready to instruct the QLattice to search for the best mathematical model to explain the data. Here we use the high-level convenience function that does everything with sensible defaults: https://docs.abzu.ai/docs/guides/essentials/auto_run.html.
​
For more detailed control, we could use the primitives: https://docs.abzu.ai/docs/guides/primitives/using_primitives.html
​
NOTE: This will take a minute to complete. It invoves work done on the QLattice machine remotely as well as in the local notebook. The part that runs locally is slowing things down because of the limited CPU resources on Kaggle. Running the same on my machine locally only takes 10 seconds!

In [None]:
models = ql.auto_run(train, output_name="class", kind="classification", stypes=stypes, criterion="aic")

# What did we find?
`models` is a list of graphs sorted by accuracy. Each model shows how the selected features, or inputs, interact to achieve the output. We can access the best graph and see how it performs on the train and test sets by calling:

In [None]:
models[0].plot(train,test)

Look at that performance!! With only three features: odor, spore print color, and stalk color below ring we can predict if a given mushroom is edible or poisonous with incredible accuracy. What's more, we can also see specifically **how** each feature interacts with one another to predict toxicity.

# Understanding our model
We can see how each feature contributes to the model using plot_flow_interactive 

In [None]:
from feyn.plots.interactive import interactive_activation_flow

In [None]:
interactive_activation_flow(models[0], train)

# Looking at probablity scores
Another way to visualize how the model is performing on it's predictions is by using a probability score plot. This shows the histogram of probabilities assigned by the model that the edible (negative class) mushrooms and poisonous (positive class) mushrooms are poisonous.

In [None]:
models[0].plot_probability_scores(test)

We can see that the model does a pretty good job. Most often, it correctly assigns high probability scores to poisonous mushrooms and low ones to edible mushrooms.

The fact that we see most of our predictions at low or high probability scores is great! **Most of our predictions are not ambiguous**. Our model will most likely strongly suggest that a mushroom is poisonous, or strongly suggest that it is not. 

# Confusion Matrices: when the model fails
It is important to note the small sliver of pink that we see in the left-most part of the plot and the small sliver at around 0.6-0.7. These are poisonous mushrooms that the model does not predict as such, in the first case, or doesn't do so with as much strength, the second case.

We can visualize these better using **confusion matrices**

Performing first a standard confusion matrix shows the four mushrooms that are predicted as edible, but are not. Without setting the threshold, it automatically uses 0.5 (in other words, everything that has a probability score of 0.5 or higher is considered as predicted to be poisonous). Therefore the mushrooms that scored a 0.6-0.7 probability of being dangerous are still considered as being predicted as poisonous: a correct, albiet less confident, classification

In [None]:
models[0].plot_confusion_matrix(test)

Now let's set the threshold to 0.8 so we can see the less "confidently" predicted poisonous mushrooms

In [None]:
models[0].plot_confusion_matrix(test, threshold=0.8)

We can see that three additional mushrooms were included

# A useful model
Imagine you're our in nature and you come across some delicious looking mushrooms. Wonder if these tasty looking treats are edible? We can note some quick observations of it's odor, spore print color, and it's color below the stalk, then use our model to predict whether or not it is poisonous.

We will simulate this using our holdout set and the **predict function**

In [None]:
predictions = models[0].predict(holdout)

This gives us an array of values between 0 and 1 telling us the probability that the given mushroom is poisonous. Just by looking at the first few entries and comparing them to the true toxicity of the mushroom, we can see that our model gives a pretty good indication of whether or no the mushroom is safe (e.g. for the first mushroom our model says that there is a 2.66% chance that the mushroom is poisonous, and it is, in fact, edible)

In [None]:
predictions

In [None]:
holdout["class"]

# What did we learn?
1. We can predict toxicity with **only three features** very well! 
2. The QLattice is extremely **easy to use**. Admittedly, other machine learning algorithms can also acheive this high accuracy, however, the QLattice requires no onehot encoding
3. With QLattice models we can clearly see **how** the features interact to predict the target and feyn includes some cool plots and tools to visualize this
4. Our model has useful applications