# Overview of the example from the paper


![overview](paper_example_image.png)

 Example of an ML pipeline that predicts which patients are at a higher risk of serious complications, under the requirement to achieve comparable false negative rates across intersectional groups by age and race. The pipeline is implemented using native constructs from the popular pandas and scikit-learn libraries. On the left, we highlight potential issues identified by mlinspect. On the right, we show the corresponding dataflow graph extracted by mlinspect to instrument the code and pinpoint issues.

# Add inspections and execute the pipeline

The central entry point of mlinspect is the `PipelineInspector`. To use mlinspect, we use it and pass it the path to the runnable version of the example pipeline. Here, we have the example pipeline in a `healthcare.py` file, but e.g., `.ipynb` jupyter notebook files are already supported by mlinspect, too. Then, we define the set of inspections and checks we want mlinspect to run. In this example, we use 3 checks, one to compute histograms of sensitive groups and verify operators cause no significant distribution changes, one to check for missing embeddings of our word embeding transformer, and one to check for the usage of illegal/problematic features. We also use 2 additional inspections, one to track row-level lineage and one to materialize a few example output rows of each operator.

Then, we execute the pipeline. Mlinspect returns a `InspectorResult`, which contains both the extracted Dag, the output of our checks, and the output of our inspections. 

In [None]:
import os
from mlinspect.utils import get_project_root

from mlinspect import PipelineInspector, OperatorType
from mlinspect.inspections import HistogramForColumns, RowLineage, MaterializeFirstOutputRows
from mlinspect.checks import NoBiasIntroducedFor, NoIllegalFeatures
from demo.feature_overview.no_missing_embeddings import NoMissingEmbeddings

HEALTHCARE_FILE_PY = os.path.join(str(get_project_root()), "example_pipelines", "healthcare", "healthcare.py")

inspector_result = PipelineInspector\
    .on_pipeline_from_py_file(HEALTHCARE_FILE_PY) \
    .add_check(NoBiasIntroducedFor(["age_group", "race"]))\
    .add_check(NoIllegalFeatures())\
    .add_check(NoMissingEmbeddings())\
    .add_required_inspection(RowLineage(5)) \
    .add_required_inspection(MaterializeFirstOutputRows(5)) \
    .execute()

extracted_dag = inspector_result.dag
inspection_results = inspector_result.inspection_to_annotations
check_results = inspector_result.check_to_check_results

# Now, let's look at the extracted Dag

Mlinspect automatically extracted a dataflow graph corresponding to the code in the `healthcare.py` file. Now, we want to look at it. The format mlinspect returns the Dag in is a `networkx.DiGraph`. Networkx provides a lot of functionality, which makes it easy for users to e.g., convert it to other common formats. In addition to this, we also offer a visualisation function, `save_fig_to_path`, which can be directly used to save an image of the extracted Dag to some file path.

Here, we use that convenience function to save an image of the Dag and then use a jupyter notebook function to show this image. 

In [None]:
from IPython.display import Image
from mlinspect.visualisation import save_fig_to_path

filename = os.path.join(str(get_project_root()), "demo", "feature_overview", "healthcare.png")
save_fig_to_path(extracted_dag, filename)

Image(filename=filename) 

# Want to know the output of a specific operator?

For each operator, the `MaterializeFirstOutputRows` materialized the first `5` output rows. Especially for scikit-learn pipelines, it requires custom debugging code if a user just wants to look at some intermediate results ([example stackoverflow post](https://stackoverflow.com/questions/34802465/sklearn-is-there-any-way-to-debug-pipelines)). Using mlinspect, this becomes easy. We can look at the input and output of arbitrary featurizers like OneHotEncoders or Word2Vec models. 

Here, we use this functionality to look at the output of a OneHotEncoder and the imputer right before it. For this, we only need to look at the inspection result for the corresponding Dag nodes. In this example, we can see that the OneHotEncoder encounters two different values for the `county` column in the train set. We see that the value `county2` gets transformed to `[1,0]` and `county3` gets transformed to `[0,1]`.

In [None]:
from IPython.display import display

first_rows_inspection_result = inspection_results[MaterializeFirstOutputRows(5)]

relevant_nodes = [node for node in extracted_dag.nodes if node.description in {
    "Imputer (SimpleImputer), Column: 'county'", "Categorical Encoder (OneHotEncoder), Column: 'county'"}]

for dag_node in relevant_nodes:
    if dag_node in first_rows_inspection_result and first_rows_inspection_result[dag_node] is not None:
        print("\n\033[1m{} ({})\033[0m\n{}\n{}".format(
            dag_node.operator_type, dag_node.description, dag_node.source_code, dag_node.code_reference))
        display(first_rows_inspection_result[dag_node])

# Want to know the origin of some row in the featurised model input?

We can use the `RowLineage` to get row-level lineage information for e.g., a featurised tuple. In practice, you probably do not want to look at the lineage information yourself, as it can get quite complicated for complex pipelines like the one in our example. In the future, we could e.g., extend the lineage inspection to take a list of lineage ids and materialize all related intermediate results in the pipeline when the user re-executes the pipeline. This way, users do not have to interpret the lineage ids themselves.

Here, we use the functionality of the `RowLineage` to look at a featurised row from the train set that our neural network gets fitted on. We start by printing the first output row from the `DATA_SOURCE` and `GROUP_BY_AGG` operators. As we can see, the `RowLineage` generates unique identifiers for each of the rows when they get created. As these rows flow through the DAG, the lineage id annotations get propagated and combined at operators like `JOIN` and `CONCATENATION`. In our example, the `CONCATENATION` operator is the last one before the model training. By analysing the `Lineage` value for the first output row of the `CONCATENATION` operator, we can see how this featurised row originated from the data initally created by the `DATA_SOURCE` and `GROUP_BY_AGG` operators. When just looking at the `Value` of this featurised row, it is hard to find out that this output row is the feature vector for a patient with the name `Tabby Ward`. With our lineage information, this becomes much easier.

In [None]:
import pandas as pd


lineage_inspection_result = inspection_results[RowLineage(5)]

relevant_nodes = [node for node in extracted_dag.nodes if node.operator_type in {OperatorType.DATA_SOURCE, OperatorType.GROUP_BY_AGG, OperatorType.CONCATENATION}]

#print(lineage_inspection_result)
for dag_node in relevant_nodes:
    if dag_node in lineage_inspection_result: #and lineage_inspection_result[dag_node] is not None:
        print("\n\033[1m{} ({})\033[0m\n{}\n{}".format(
            dag_node.operator_type, dag_node.description, dag_node.source_code, dag_node.code_reference))
        print("\033[1mFirst output row:\033[0m")
        display(lineage_inspection_result[dag_node].head(1))

# Did our checks find issues?

Let us look at the `check_results` to see whether some failed. As all 3 failed, we will look into each result in detail.

In [None]:
from IPython.display import display

check_result_df = PipelineInspector.check_results_as_data_frame(check_results)
display(check_result_df)

# What about issue 5? Did we use something forbidden as a feature?

Let us look at the `check_result` of the `no_illegal_features()` check. There, we see that we did use a illegal feature, `race`.

In [None]:
feature_check_result = check_results[NoIllegalFeatures()]
print("Used illegal features: {}".format(feature_check_result.illegal_features))

# What about issue 6? Were there missing embeddings?

For each operator in the DAG, `MissingEmbeddings` checked if it is the Word2Vec transformer. Once it got to see the output rows for this transformer, it checked each output array if it is equal to the 0-vector. It it finds such values with missing embeddings, it remembers a few example rows (here: the first `20`) to help the user understand what is happening. 

Here, we only have to look at the output of the `NoMissingEmbeddingsCheck`. It only lists the Word2Vec transformer Dag node as operator with missing embeddings. It provides a list of values with missing embeddings.

In [None]:
embedding_check_result = check_results[NoMissingEmbeddings()]

for dag_node, missing_embeddings_info in embedding_check_result.dag_node_to_missing_embeddings.items():
    print("\n\033[1m{} ({})\033[0m\n{}\n{}".format(
            dag_node.operator_type, dag_node.description, dag_node.source_code, dag_node.code_reference))
    print("\033[1mExamples with missing embeddings: {}\033[0m".format(missing_embeddings_info.missing_embeddings_examples))

**We found a missing embedding for a single rare name.**

# We can look at how histograms of sensitive groups change after different Dag nodes

Operators like joins, selections and missing value imputaters can cause *data distribution issues*, which can heavily impact the performance of our model. Mlinspect helps with identifying such issues by offering an inspection to calculate historams for sensitive groups. Thanks to our annotation propagation, this works even if the group columns are projected out at some point (**Issue 2**). To automatically check for significant changes and compute the histograms, we used the `no_bias_introduced_for(...)` check.

Our check has already filtered all operators that can cause data distribution issues. Now we will use the result of the check and create list with all distribution changes. Using this, we can investigate the changes of the different operators one at a time.

In [None]:
no_bias_check_result = check_results[NoBiasIntroducedFor(["age_group", "race"])]

distribution_changes_overview_df = NoBiasIntroducedFor.get_distribution_changes_overview_as_df(no_bias_check_result)





display(distribution_changes_overview_df)

dag_node_distribution_changes_list = list(no_bias_check_result.bias_distribution_change.items())

As we can see, the selection causes the check to fail because of the `race` attribute. Still, we will investigate all of the operator changes to see if there is something else our check may have missed because the change was slightly below the change threshold of the `NoBiasIntroducedFor(["age_group", "race"])` (which can be configured by the user).

## Issue 1: Join might change proportions of groups in data

We start by looking at the first operator that could heavily change the proportion of groups in our data, the join of the `patients` and `histories` datasets. E.g., there could be missing history entries for some patients, leading to many patients being filtered out.

Here, we start by finding the corresponding `JOIN` distribution change info for the `merge` call. Then we use a plot function from `mlinspect` to compare the histograms before and after this join.

In [None]:
_, join_distribution_changes = dag_node_distribution_changes_list[0]
for column, distribution_change in join_distribution_changes.items():
    print("")
    print("\033[1m Column '{}'\033[0m".format(column))
    NoBiasIntroducedFor.plot_distribution_change_histograms(distribution_change)

If you want to write your own plot function instead of using the one provided by `mlinspect` or prefer to look at the raw numbers, you can also directly access the data that backs the plot.

In [None]:
_, join_distribution_changes = dag_node_distribution_changes_list[0]
for column, distribution_change in join_distribution_changes.items():
    print("")
    print("\033[1m Column '{}'\033[0m".format(column))
    display(distribution_change.before_and_after_df)

**As we can see, there are no noteworthy changes because of the join.**

## Issue 3: Selection might change proportions of groups in data

The next operator that could change the data distribution is the filter for patients in a few predefined counties. It could be that patients of different demographic groups are not uniformly distributed across different counties. It could, e.g., be that most of the patients with a specific `age_group` or `race` value live in a specific county.

Again, we need to find the change info for the selection. Then, we look at the histograms before and after.

In [None]:
_, selection_distribution_changes = dag_node_distribution_changes_list[2]
for column, distribution_change in selection_distribution_changes.items():
    print("")
    print("\033[1m Column '{}'\033[0m".format(column))
    NoBiasIntroducedFor.plot_distribution_change_histograms(distribution_change)

**There clearly is an issue here! A lot of values from the `race` `race3` are filtered out!** This is because a lot of patients with `race3` live in `county1` in our example.

## Issue 4: Imputation might change proportions of groups in data

The last operator that we want to look at that can change the distribution of sensitive groups is the missing value imputation for the `race` column. Depending on the imputation strategy, it can also introduce or amplify data distribution issues. It might attribute records with a missing value to the majority race in the dataset.

Again, we need to find the change info for the imputation. Then, we look at the histograms before and after.

In [None]:
_, imputation_distribution_changes = dag_node_distribution_changes_list[3]
for column, distribution_change in imputation_distribution_changes.items():
    print("")
    print("\033[1m Column '{}'\033[0m".format(column))
    NoBiasIntroducedFor.plot_distribution_change_histograms(distribution_change)

**The `most-frequent` imputation amplifies the existing `race` imbalance!**