<a href="https://colab.research.google.com/github/rahiakela/hands-on-explainable-ai-xai-with-python/blob/main/3-explaining-machine-learning-with-facets/1_implementing_feature_statistics_using_facets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Implementing the feature statistics using Facets

Lack of the right data often poisons an artificial intelligence (AI) project from the start. We are used to downloading ready-to-use datasets from Kaggle, scikit-learn, and other reliable sources.

We focus on learning how to use and implement machine learning (ML) algorithms. However, reality hits AI project managers hard on day one of a project.
Companies rarely have clean or even sufficient data for a project. Corporations have massive amounts of data, but they often come from different departments.

When finally you obtain a training dataset sample, you may find that your
AI model does not work as planned. You might have to change ML models or find
out what is wrong with the data. You are trapped right from the start. What you
thought would be an excellent AI project has turned into a nightmare.

You need to get out of this trap rapidly by first explaining the data availability problem. You must find a way to explain why the datasets require improvements. You must also explain which features require more data, better quality, or volume. You do not have the time or resources to develop a new explainable AI (XAI) solution for each project.

Facets Overview and Facets Dive provide visualization tools to analyze your
training and testing data feature by feature.

##Setup

In [None]:
!pip install facets-overview

In [2]:
import os
import pandas as pd
import base64
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator

from IPython.core.display import display, HTML
from scipy.stats import entropy

In [3]:
%%shell

wget -q https://raw.githubusercontent.com/PacktPublishing/Hands-On-Explainable-AI-XAI-with-Python/master/Chapter03/DLH_train.csv
wget -q https://raw.githubusercontent.com/PacktPublishing/Hands-On-Explainable-AI-XAI-with-Python/master/Chapter03/DLH_test.csv



##Reading the data files

In [4]:
# Setting the path for each file
dtrain = "/content/DLH_train.csv"
dtest = "/content/DLH_test.csv"
print(dtrain, dtest)

/content/DLH_train.csv /content/DLH_test.csv


In [5]:
# Loading Denis Rothman research training and testing data into DataFrames
features = ["colored_sputum", "cough", "fever", "headache", "days", "france", "chicago", "class"]

The data files contain no headers so we will use our features array to define the names of the columns for the training data:

In [6]:
train_data = pd.read_csv(dtrain, names=features, sep=r'\s*,\s*', engine="python", na_values="?")
test_data = pd.read_csv(dtest, names=features, sep=r'\s*,\s*', skiprows=[0], engine="python", na_values="?")

train_data.head()

Unnamed: 0,colored_sputum,cough,fever,headache,days,france,chicago,class
0,1.0,3.5,9.4,3.0,3,0,1,flu
1,1.0,3.4,8.4,4.0,2,0,1,flu
2,1.0,3.3,7.3,3.0,4,0,1,flu
3,1.0,3.4,9.5,4.0,2,0,1,flu
4,1.0,2.0,8.0,3.5,1,0,1,flu


##Creating feature statistics for the datasets

Facets Overview provides a wide range of statistics for each feature of a dataset.Facets Overview will help you detect missing data, zero values, non-uniformity in data distributions.

Without Facets Overview or a similar tool, the only way to obtain statistics would be to write our programs or use spreadsheets. Writing our own functions can be time-consuming and costly. This is where Facets provides statistics with a few lines of code that we will use now.

We will encode the data, stringify it, and build the statistics generator.
When using JSON, we first stringify information to transfer data into strings before sending it to JavaScript functions.

In [7]:
gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([
  {"name": "train", "table": train_data},
  {"name": "test", "table": test_data}                             
])

We will creates a UTF-8 encoder/decoder string that will be plugged into the
HTML interface.

In [8]:
protostr = base64.b64encode(proto.SerializePartialToString()).decode("utf-8")

##Display HTML page for Facets Overview

In [9]:
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>
"""
html = HTML_TEMPLATE.format(protostr=protostr)

The protostr variable containing our stringified encoded data is now plugged into the template.

Then, the HTML template named html is sent to IPython's display function.

In [10]:
display(HTML(html))

We can now visualize and explore the data.

##Sorting by distribution distance

Calculating the distribution distance between the training set and the test set, for example, can be implemented with the Kullback-Leibler divergence, also named relative entropy.

We can calculate the distribution distance with three variables:
- S is the relative entropy
- X is the dtrain dataset
- Y is the dtest dataset

The equation used by scikit-learn for Kullback-Leibler divergence is as follows:

```python
S = sum(X* log(Y/X))
```

If the values of X or Y do not add up to 1, they will be normalized.

In Facets Overview, a few examples show that entropy increases as
distribution distance increases.

We can start with two data distributions that are similar:

In [12]:
X = [1, 1, 1, 2, 1, 1, 4]
Y = [1, 2, 3, 4, 2, 2, 5]

entropy(X, Y)

0.05045985212037224

However, if the two data distributions begin to change, they will diverge, producing higher entropy values:

In [13]:
X = [10, 1, 1, 20, 1, 10, 4]
Y = [1, 2, 3, 4, 2, 2, 5]

entropy(X, Y)

0.5396425997525232

The relative entropy has increased. The value now is 0.53.

##Facets Dive