In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Fairness Analysis with TensorFlow Data Validation and TensorFlow Model Analysis' Fairness Indicators

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/examples/prompt-design/question_answering.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/prompt-design/question_answering.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/prompt-design/question_answering.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

When releasing a machine learning model, it's critical to ensure its performance as well as its behavior. On top of traditional model evaluation, a responsible evaluation of a model requires assessing that data and model are fair, i.e. they avoid creating or reinforcing bias. Bias occurs when there is stereotyping, prejudice, or favoritism towards some things, people, or groups over others.

In this notebook, you perform an initial fairness analysis on the [Civil Comments dataset](https://www.tensorflow.org/datasets/catalog/civil_comments) using two tools offered by TensorFlow for large-scale ML analysis:
- [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) (TFDV) can analyze training and serving data to compute descriptive statistics, infer a schema, and detect data anomalies.<br>You use TFDV to find potential issues that can lead to fairness disparities, such as missing values and data imbalances.
- [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/guide/tfma) (TFMA) can perform extensive model evaluation on training and test set, overall or for specific data slices, using a variety of metrics and visualizations. <br>TFMA provides easy access to [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators): a tool designed to support teams in evaluating and improving models for fairness concerns. <br>You use Fairness Indicators to compute and visualize commonly-identified fairness metrics. 

Learn more about how Google applies fairness in ml at https://ai.google/responsibility/responsible-ai-practices/#fairness. 

## Objective

By the end of the notebook, you should be able to:

1. Use TFRecords to load record-oriented binary format data
2. Use TFDV to generate statistics, and the TFDV widget to answer questions.
3. Use TFMA with Fairness Indicators to calculate fairness metrics, and its API to programmatically access model analysis results.


## Set up
First, you install TFDV and TFMA with some supporting packages.

Then, you import the necessary dependencies for the libraries you'll be using in this exercise, which include TensorFlow Data Validation (TFDV), TensorFlow Model Analysis (TFMA), and Fairness Indicators.

### Install dependencies

The package requirements for this lab have been saved in the `requirements.txt` file. Versions of the TFDV and TFMA packages require specific versions of other packages such as TensorFlow and Apache Beam. You may run the following cell to see the contents of the `requirements.txt` file.

In [1]:
!cat requirements.txt

tensorflow==2.11
pyarrow==6.0.0
apache-beam[gcp]==2.41.0
tensorflow-metadata==1.12.0
tensorflow-model-analysis==0.43.0
tfx-bsl==1.12.0
tensorflow-data-validation==1.12.0
fairness-indicators==0.43.0


Please only run the installation once. The following cell will take 3-5 minutes to install the proper versions of all of the packages.

In [2]:
!pip -q install -r requirements.txt

[0m

#### Restart the kernel
After you install/upgrade the packages, you need to restart the notebook kernel so it can find those packages.

Click **Kernel > Restart Kernel**, or uncomment and run the cell below.

In [3]:
# # Automatically restart kernel after installs
# import os

# if not os.getenv("IS_TESTING"):
#     # Automatically restart kernel after installs
#     import IPython

#     app = IPython.Application.instance()
#     app.kernel.do_shutdown(True)

### Import libraries

In [4]:
import sys, os
import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import apache_beam as beam

from google.protobuf import text_format

import pprint

import tensorflow as tf
import tensorflow_model_analysis as tfma
import tensorflow_data_validation as tfdv
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators

from tfx_bsl.tfxio import tf_example_record

## Load data using TFRecords

### About the Civil Comments dataset

Click below to learn more about the Civil Comments dataset, and the pre-processing it has underfone for this exercise.

#### Overview

The Civil Comments dataset comprises approximately 2 million public comments that were submitted to the Civil Comments platform. [Jigsaw](https://jigsaw.google.com/) sponsored the effort to compile and annotate these comments for ongoing [research](https://arxiv.org/abs/1903.04561); they've also hosted competitions on [Kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) to help classify toxic comments as well as minimize unintended model bias. 

##### Features

Within the Civil Comments data, a subset of comments are tagged with a variety of *identity* attributes pertaining to gender, sexual orientation, religion, race, and ethnicity. 

**NOTE:** These identity attributes are intended *for evaluation purposes only*, to assess how well a classifier trained solely on the comment text performs on different tag sets.

To collect these identity labels, each comment was reviewed by up to 10 annotators, who were asked to indicate all identities that were mentioned in the comment. For example, annotators were posed the question: "What genders are mentioned in the comment?", and asked to choose all of the following categories that were applicable.

* Male
* Female
* Transgender
* Other gender
* No gender mentioned

**NOTE:** *We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender.*

Jigsaw used these ratings to generate an aggregate score for each identity attribute representing the percentage of raters who said the identity was mentioned in the comment. For example, if 10 annotators reviewed a comment, and 6 said that the comment mentioned the identity "female" and 0 said that the comment mentioned the identity "male," the comment would receive a `female` score of `0.6` and a `male` score of `0.0`.

**NOTE:** For the purposes of annotation, a comment was considered to "mention" gender if it contained a comment about gender issues (e.g., a discussion about feminism, wage gap between men and women, transgender rights, etc.), gendered language, or gendered insults. Use of "he," "she," or gendered names (e.g., Donald, Margaret) did not require a gender label. 

##### Label

Each comment was rated by up to 10 annotators for toxicity, who each classified it with one of the following ratings.

* Very Toxic
* Toxic
* Hard to Say
* Not Toxic

Again, Jigsaw used these ratings to generate an aggregate toxicity "score" for each comment (ranging from `0.0` to `1.0`) to serve as the [label](https://developers.google.com/machine-learning/glossary?utm_source=Colab&utm_medium=fi-colab&utm_campaign=fi-practicum&utm_content=glossary&utm_term=label#label), representing the fraction of annotators who labeled the comment either "Very Toxic" or "Toxic." For example, if 10 annotators rated a comment, and 3 of them labeled it "Very Toxic" and 5 of them labeled it "Toxic", the comment would receive a toxicity score of `0.8`.

**NOTE:** For more information on the Civil Comments labeling schema, see the [Data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) section of the Jigsaw Untended Bias in Toxicity Classification Kaggle competition.


#### Preprocessing the data
For the purposes of this exercise, the data has been pre-processed as follows:
    
- the colum *toxicity*, i.e. the label, has been converted to a binary label: any value ≥ 0.5 is labeled as 1, i.e. true (i.e., a comment is considered toxic if 50% or more crowd raters labeled it as toxic); vice versa, any value <-.5 is labeled as 0, i.e. false.
    
- the column *identity* has been grouped by its categories [*gender*, *sexual_orientation*, *disability*, *religion*, *race*] , with a value assigned to the category using a threshold of 0.5. For example, if one comment has `{ male: 0.3, female: 1.0, transgender: 0.0, heterosexual: 0.8, homosexual_gay_or_lesbian: 1.0 }`, data is transformed into `{ gender: [female], sexual_orientation: [heterosexual, homosexual_gay_or_lesbian] }`.

Finally, the *comment_text* column reports the text of the individual comment as is.

**NOTE:** Missing identity fields were converted to False.

**NOTE:** Fairness Indicators currently work with binary and multiclass classification only. 

### Use TFRecords to load record-oriented binary format data




-------------------------------------------------------------------------------------------------------

The [TFRecord format](https://www.tensorflow.org/tutorials/load_data/tfrecord) is a simple [Protobuf](https://developers.google.com/protocol-buffers)-based format for storing a sequence of binary records. It gives you and your machine learning models the ability to handle arbitrarily large datasets (that don't fit in memory!) over the network by:
1. Splitting up large files into 100-200MB chunks
2. Storing the results as serialized binary messages for faster ingestion

Let's access the publicly available training and validation sets for the pre-processed Civil Comments dataset.

In [5]:
train_tf_file = tf.keras.utils.get_file(
    'train_tf_processed.tfrecord',
    'https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord'
)
validate_tf_file = tf.keras.utils.get_file(
    'validate_tf_processed.tfrecord',
    'https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord'
)

Downloading data from https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord
Downloading data from https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord


## Analyze fairness with TFDV

### Use TFDV to generate and visualize statistics




-------------------------------------------------------------------------------------------------------

TensorFlow Data Validation enables you to calculate data statistics automatically.

For tabular data, it supports:
- data stored in a TFRecord file
- data stored in a CSV input format
- data loaded in-memory in a Pandas dataframe

And you can also create your own [custom data connector](https://www.tensorflow.org/tfx/data_validation/get_started#writing_custom_data_connector).

Before you train the model, you want to do an audit of the data so to better understand data distributions and analyze the presence of unexpected values. 
<br> Let's use the [tf.generate_statistics_from_tfrecord()](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord) functionality to generate statistics overall for the data.

**NOTE:** *It could take up to 1 minute to generate the statistics.*

**FUN FACT:** *TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.*

#### Use TFDV to generate and visualize statistics overall for all features

In [6]:
# Compute the statistics for the training set
stats = tfdv.generate_statistics_from_tfrecord(data_location=train_tf_file)





Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


### Use TFDV to infer and analyze the data schema




-------------------------------------------------------------------------------------------------------

TFDV enables you to automatically infer the schema given calculated statistics, and visualize it. 

Let's use the [tfdv.infer_schema()](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) functionality to get familiar with the feature set.

In [7]:
# Infer and visalize the schema
schema = tfdv.infer_schema(stats)
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'comment_text',BYTES,required,,-
'disability',STRING,required,,'disability'
'gender',STRING,required,,'gender'
'race',STRING,required,,'race'
'religion',STRING,required,,'religion'
'sexual_orientation',STRING,required,,'sexual_orientation'
'toxicity',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'disability',"'intellectual_or_learning_disability', 'other_disability', 'physical_disability', 'psychiatric_or_mental_illness'"
'gender',"'female', 'male', 'other_gender', 'transgender'"
'race',"'asian', 'black', 'latino', 'other_race_or_ethnicity', 'white'"
'religion',"'atheist', 'buddhist', 'christian', 'hindu', 'jewish', 'muslim', 'other_religion'"
'sexual_orientation',"'bisexual', 'heterosexual', 'homosexual_gay_or_lesbian', 'other_sexual_orientation'"


From the schema, you can see that you have 7 features, all required. 

There are 6 categorical features, which are: 'comment_text', 'disability', 'gender', 'race', 'religion', and 'sexual_orientation'.
You can get a preview of the values of each categorical feature.

There is 1 numerical feature, which is: 'toxicity'. This is the label. 

### Use TFDV to visualize and analyze the data statistics 




-------------------------------------------------------------------------------------------------------

Let's use the [tfdv.visualize_statistics()](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics) functionality to visualize the generated statistics for the training set, and perform some analysis.

**FUN FACT:** *You can use **tfdv.visualize_statistics()** to visualize training and test set statistics side-by-side too! Other parameters include an allow list of features and a deny list of features to visualize.*

In [8]:
# Visualize the statistics for analysis
tfdv.visualize_statistics(stats)

**What can you learn from the visualizations of the statistics?** 

From this  visualization, you can gather a lot of interesting information as you can see all of the features with statistics and data distributions in one view. 

For example, you can  see that there are no missing values. In fact, the highlighted 92.08% missing values for the numerical column 'toxicity' is expected since the label is 0 for non-toxicity and 1 for toxicity.

The widget provides slighly different statistics for numerical and categorical values:
- For numericals, we get: count, missing, mean, std dev, zeroes, min, median, max.
- For categoricals, we get: count, missing, unique, top, freq top, avg str len.


Spend some time exploring the generated stats, and the information they provide you with.  

#### Use TFDV to generate and visualize statistics for subset groups




-------------------------------------------------------------------------------------------------------

TensorFlow Data Validation enables you to calculate and visualize data statistics for specific subset groups, known as *data slices*. You can do so by adding the [tfdv.StatsOption](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions) parameter to the [tfdv.generate_statistics_from_tfrecord()](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord) functionality. 

Let's generate statistics using the feature "gender" as a data slice.


**NOTE:** *It could take a couple of minutes to generate the statistics.*

In [9]:
# Define slice
slice_fn = tfdv.experimental_get_feature_value_slicer(
    features={'gender': None}
)

In [10]:
# Generate statistics from slice
stats_options = tfdv.StatsOptions(experimental_slice_functions=[slice_fn])
sliced_stats = tfdv.generate_statistics_from_tfrecord(
    data_location=train_tf_file,
    stats_options=stats_options,
)

**NOTE**: *The [visualize_statistics()](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics) method currently only supports comparing up to two datasets. Let's get the name of the statistics datasets created by slicing, and compare the two datasets for all examples and for the female gender.*

In [11]:
# Get the name of the statistics datasets created by slicing
for dataset in sliced_stats.datasets:
    print(dataset.name)

All Examples
gender_female
gender_transgender
gender_male
gender_other_gender


In [27]:
# dataset['All Examples']

In [12]:
# Compare the "All Examples" dataset, i.e. the same we created before, and the "gender_female" dataset
tfdv.visualize_statistics(
    lhs_statistics=tfdv.get_slice_stats(sliced_stats, 'All Examples'),
    rhs_statistics=tfdv.get_slice_stats(sliced_stats, 'gender_female'),
)

**What important difference can you spot between the statistics for the overall dataset and by female-gender slice?**

The statistics from two features differ between the two datasets:

1. `comment_text`: the top comment is different, and the fact that it is a gender-biased comment for females may be problematic; as a next step, you could analyze the correlation between this feature and the label column, *overall* and by the *gender=female* data slice, to see if there are any differences. 

2. `toxicity`: the percentage of gender-related examples that are toxic (100-86.3=13.7%) is nearly double the percentage of toxic examples overall (100-92.02=7.98%). 
<br> In other words, comments related to the female gender are almost two times more likely than comments overall to be labeled as toxic. This skew suggests that a model trained on this dataset might learn a correlation between gender-related content and toxicity. This raises fairness considerations, as the model might be more likely to classify nontoxic comments as toxic if they contain gender terminology, which could lead to [disparate impact](https://developers.google.com/machine-learning/glossary?utm_source=Colab&utm_medium=fi-colab&utm_campaign=fi-practicum&utm_content=glossary&utm_term=disparate-impact#disparate-impact) for gender subgroups. 

If you have time, why not try to run the same analysis for the other *gender* data slices?

## Analyze fairness with TFMA

### Learn about the pre-trained model



-------------------------------------------------------------------------------------------------------

For this exercise, you want to analyze the performance of a trained model for fairness.

You can access a simple pre-trained binary classification model on the pre-processed Civic Dataset in the folder **saved_model/**.

The model is a deep neural network trained on all features analyzed in the TFDV section: the "comment_text" feature is embedded using the https://tfhub.dev/google/nnlm-en-dim128/1 model, and the other categorical features are set as variable-length strings. The model has two hidden layers with 500 and 100 neurons, and uses the Adagrad optimizer with the loss reduced by summing. To handle the unbalanced classes, class weights are added to the training. The model is trained for 1000 steps with batch size of 512. The trained model has been exported to the saved_model format for serving.

### Run model analysis with TFMA's Fairness Indicators on the validation set



-------------------------------------------------------------------------------------------------------

TFMA allows you to run model analysis on a trained / serving model using the [run_model_analysis()](https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/run_model_analysis) functionality. 

On top of traditional metrics, TFMA provides access to the [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators) library which enables easy computation of common fairness metrics--grouped inside the *FairnessIndicators* metric--at scale.

You need to define an evaluation configuration to specify the evaluation you want to perform; the eval_config should include:
- model_specs to define the column names for example labels and (optional) predictions.
- metrics_specs to define the metrics to compute. The FairnessIndicators metric will be required to render the fairness metrics and you can see a list of additional optional metrics [here](https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/metrics).
- slicing_specs to optionally define what feature(s) you’re interested in investigating. More than one slice can be provided, and if the slicing spec is left empty then all featuers are analyzed.

Let's analyze the model's performance on the validation at multiple thresholds. 

**NOTE:** *It could take up to 10-15 minutes to generate the evaluation results.*

**FUN FACT:** *You can compare up to two models!*

In [13]:
# Define input and output paths
tfma_export_dir = "saved_model"
tfma_eval_result_path = 'tfma_eval_result'

In [14]:
# Define the evaluation config
eval_config = text_format.Parse("""
    model_specs {
      label_key: "toxicity"
    }
    metrics_specs {
      metrics {
        class_name: "FairnessIndicators"
        config: '{ "thresholds": [0.1, 0.3, 0.5, 0.7, 0.9] }'
      }
    }
    slicing_specs {}  # overall slice
    slicing_specs {feature_keys: ["gender"]}  # we can slice by any feature
    
    options {
        compute_confidence_intervals { value: False }  # we can optionally compute CIs for a more detailed analysis
    }
""", tfma.EvalConfig())

# Load the pre-trained model to evaluate
eval_shared_model = tfma.default_eval_shared_model(
    eval_saved_model_path=tfma_export_dir,
    eval_config=eval_config,
)

# Run the analysis
eval_result = tfma.run_model_analysis(
    eval_shared_model=eval_shared_model,
    eval_config=eval_config,
    data_location=validate_tf_file,
    output_path=tfma_eval_result_path,
    random_seed_for_testing=42,  # a random seed ensures deterministic results
)





#### Run model analysis with TFMA's Fairness Indicators at scale with Apache Beam


-------------------------------------------------------------------------------------------------------

[Apache Beam](https://beam.apache.org/releases/pydoc/2.6.0/index.html) provides a simple and powerful programming model for building both batch and streaming parallel data processing pipelines.

One of the unique capabilities of TFMA, as well as TFDV, is the ability to perform large-scale distributed computations using [Apache Beam](https://beam.apache.org/releases/pydoc/2.6.0/index.html).

You can explore the code below that generates the same evaluation results as with *tfma.run_model_analysis()* using tfma with Beam. 

We suggest skipping running this code here for the sake of time. 

```python
# Access the validation set
tfx_io = tf_example_record.TFExampleRecord(
    file_pattern=validate_tf_file,
    raw_record_column_name=tfma.ARROW_INPUT_COLUMN)

# Perform the model analysis evaluation
with beam.Pipeline() as pipeline:
    _ = (
        pipeline
        | 'Read TFRecords' >> tfx_io.BeamSource()
        | 'Perform and Save Model Analysis' >> tfma.ExtractEvaluateAndWriteResults(
            eval_config=eval_config,
            eval_shared_model=eval_shared_model,
            output_path=tfma_eval_result_path,
            random_seed_for_testing=42,
        )
    )
```

### Get fairness evaluation results programmatically



-------------------------------------------------------------------------------------------------------

TFMA lets you interactively visualize results using the [render_slicing_metrics()](https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/view/render_slicing_metrics) functionality; unfortunately, there is currently a bug in the rendering for some Jupyter environments including Vertex AI Workbench.

Visualizing results is important for exploration, whila accessing results programmatically is important for monitoring and automation.

The [EvalResult](https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/EvalResult) object returned by TFMA's evaluation has its own API that you can leverage to read TFMA results into your programs. 

Let's use the API to access the fairness evaluation results.

In [15]:
# Define the output prettifier
pp = pprint.PrettyPrinter()

In [28]:
pp

<pprint.PrettyPrinter at 0x7ff27b3bb510>

In [16]:
# Get list of slice names
print("Slices:")
pp.pprint(eval_result.get_slice_names())

Slices:
[(),
 (('gender', 'female'),),
 (('gender', 'male'),),
 (('gender', 'transgender'),),
 (('gender', 'other_gender'),)]


In [37]:
your_list = eval_result.get_metric_names()
filtered_list = [item for item in your_list if "recall" in item]

# Using a loop
filtered_list = []
for item in your_list:
  if "recall" in item:
    filtered_list.append(item)

# Printing the filtered list
print(filtered_list)
extracted_values = [item.split("@")[-1] for item in filtered_list if "recall" in item]

# Print the extracted values
print(extracted_values)

['fairness_indicators_metrics/recall@0.3', 'fairness_indicators_metrics/recall@0.1', 'fairness_indicators_metrics/recall@0.9', 'fairness_indicators_metrics/recall@0.7', 'fairness_indicators_metrics/recall@0.5']
['0.3', '0.1', '0.9', '0.7', '0.5']


In [38]:
# What is the recall at threshold 0.7 overall for the dataset?
import numpy as np
np.mean([float(i) for i in extracted_values])

0.5

In [17]:
# Get evaluated metrics overall for the dataset
print("\nMetrics:")
pp.pprint(eval_result.get_metric_names())


Metrics:
['fairness_indicators_metrics/false_discovery_rate@0.7',
 'fairness_indicators_metrics/false_omission_rate@0.7',
 'fairness_indicators_metrics/recall@0.3',
 'fairness_indicators_metrics/true_negative_rate@0.3',
 'fairness_indicators_metrics/false_positive_rate@0.7',
 'fairness_indicators_metrics/precision@0.5',
 'fairness_indicators_metrics/negative_rate@0.5',
 'fairness_indicators_metrics/positive_rate@0.7',
 'fairness_indicators_metrics/false_discovery_rate@0.5',
 'fairness_indicators_metrics/true_negative_rate@0.5',
 'fairness_indicators_metrics/true_positive_rate@0.3',
 'fairness_indicators_metrics/positive_rate@0.5',
 'fairness_indicators_metrics/positive_rate@0.3',
 'fairness_indicators_metrics/false_omission_rate@0.3',
 'fairness_indicators_metrics/recall@0.1',
 'fairness_indicators_metrics/false_negative_rate@0.7',
 'fairness_indicators_metrics/false_negative_rate@0.5',
 'fairness_indicators_metrics/recall@0.9',
 'fairness_indicators_metrics/false_discovery_rate@0.9',

In [18]:
# Get evaluated metrics for a particular slice, and compare it to a baseline slice composed of all data
baseline_slice = ()
female_slice = (('gender', 'female'),)

print("Baseline metric values:")
pp.pprint(eval_result.get_metrics_for_slice(baseline_slice))
print("\Gender metric values:")
pp.pprint(eval_result.get_metrics_for_slice(female_slice))

Baseline metric values:
{'fairness_indicators_metrics/false_discovery_rate@0.1': {'doubleValue': 0.9140197471327217},
 'fairness_indicators_metrics/false_discovery_rate@0.3': {'doubleValue': 0.8791470819540209},
 'fairness_indicators_metrics/false_discovery_rate@0.5': {'doubleValue': 0.8160845728763951},
 'fairness_indicators_metrics/false_discovery_rate@0.7': {'doubleValue': 0.7085648020321387},
 'fairness_indicators_metrics/false_discovery_rate@0.9': {'doubleValue': 0.483878691141261},
 'fairness_indicators_metrics/false_negative_rate@0.1': {'doubleValue': 0.0059416885449772},
 'fairness_indicators_metrics/false_negative_rate@0.3': {'doubleValue': 0.08693173967113445},
 'fairness_indicators_metrics/false_negative_rate@0.5': {'doubleValue': 0.2728167749067293},
 'fairness_indicators_metrics/false_negative_rate@0.7': {'doubleValue': 0.5442172170789001},
 'fairness_indicators_metrics/false_negative_rate@0.9': {'doubleValue': 0.8882997098245129},
 'fairness_indicators_metrics/false_omiss

In [19]:
# Get metrics for all data slices at once
pp.pprint(eval_result.get_metrics_for_all_slices())

{(): {'fairness_indicators_metrics/false_discovery_rate@0.1': {'doubleValue': 0.9140197471327217},
      'fairness_indicators_metrics/false_discovery_rate@0.3': {'doubleValue': 0.8791470819540209},
      'fairness_indicators_metrics/false_discovery_rate@0.5': {'doubleValue': 0.8160845728763951},
      'fairness_indicators_metrics/false_discovery_rate@0.7': {'doubleValue': 0.7085648020321387},
      'fairness_indicators_metrics/false_discovery_rate@0.9': {'doubleValue': 0.483878691141261},
      'fairness_indicators_metrics/false_negative_rate@0.1': {'doubleValue': 0.0059416885449772},
      'fairness_indicators_metrics/false_negative_rate@0.3': {'doubleValue': 0.08693173967113445},
      'fairness_indicators_metrics/false_negative_rate@0.5': {'doubleValue': 0.2728167749067293},
      'fairness_indicators_metrics/false_negative_rate@0.7': {'doubleValue': 0.5442172170789001},
      'fairness_indicators_metrics/false_negative_rate@0.9': {'doubleValue': 0.8882997098245129},
      'fairness