# Guided Exercise: Fairness

### Goals 🎯
In this tutorial, you will view the fairness test results from part 1, identify the root cause and mitigate the fairness issues in the model.

If you missed part 1, go back:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/rc-1.37/starter-examples/starter-fairness-part-1.ipynb)

The data used is ACS Employment data made available through the [*folktables* repository](https://github.com/zykls/folktables)

### First, set the credentials for your TruEra deployment.
If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [1]:
# connection details
TRUERA_URL = "https://app.truera.net/"
AUTH_TOKEN = "..."

### Install required packages for running in colab

In [2]:
#! pip install truera
! pip install --upgrade pandas==1.5.2



In [3]:
# delete this cell
# uncomment out install of truera
# uncomment out workspace/auth
# clear cell output

import os
import sys

os.chdir("/Users/davidkurokawa/Work/code/truera/truera/python")
if "/Users/davidkurokawa/Work/code/truera/truera/python" not in sys.path:
    sys.path.append("/Users/davidkurokawa/Work/code/truera/truera/python")

from truera.client.truera_authentication import BasicAuthentication, TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

TRUERA_URL = "http://localhost:8000"
AUTH_TOKEN = ""
tru = TrueraWorkspace(TRUERA_URL, BasicAuthentication("ailens", "ailens123"))

INFO:truera.client.remote_truera_workspace:Connecting to 'http://localhost:8000'


### From here, you can run the rest of the notebook to follow the analysis.

In [4]:
import logging

import pandas as pd
import sklearn.metrics
import xgboost as xgb
from sklearn import preprocessing
from sklearn.utils import resample
from truera.client.ingestion import ColumnSpec, ModelOutputContext
from truera.client.truera_authentication import TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

#auth = TokenAuthentication(AUTH_TOKEN)
#tru = TrueraWorkspace(TRUERA_URL, auth)

In [5]:
# create the first project and data collection
project_name = "Starter Example Companion - Fairness"
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")
tru.get_models()

INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". 


['model_1']

In [6]:
tru.set_model(tru.get_models()[0])
tru.tester.get_model_test_results(test_types=["fairness"])

INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


0,1,2,3,4,5,6,7
,Name,Split,Protected Segment,Comparison Segment,Metric,Score,Navigate
❌,Impact Ratio Test,2014-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.6502,Explore in UI
❌,Impact Ratio Test,2014-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.6334,Explore in UI
❌,Impact Ratio Test,2015-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.6891,Explore in UI
❌,Impact Ratio Test,2015-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.6613,Explore in UI


* What? Shown in the model test results, the first version of the test fails the Impact Ratio Test.

* Why? After exploring results, CA has a better impact ratio than in NY, leading us to a hypothesis that CA might have a lower difference in ground truth outcomes between men and women. This phenomenon can also be referred to as *dataset disparity*.

* We can examine the dataset disparity either in the web app or in the SDK (shown below).

In [7]:
# dataset disparity for 2014-NY
explainer = tru.get_explainer("2014-NY")
explainer.set_segment(segment_group_name="Sex", segment_name="Female")
mean_outcome_females_NY_2014 = explainer.get_ys().mean()
explainer.set_segment(segment_group_name="Sex", segment_name="Male")
mean_outcome_males_NY_2014 = explainer.get_ys().mean()
print("2014-NY dataset disparity: " + str(mean_outcome_females_NY_2014 - mean_outcome_males_NY_2014))

# dataset disparity for 2014-CA
explainer = tru.get_explainer("2014-CA")
explainer.set_segment(segment_group_name="Sex", segment_name="Female")
mean_outcome_females_CA_2014 = explainer.get_ys().mean()
explainer.set_segment(segment_group_name="Sex", segment_name="Male")
mean_outcome_males_CA_2014 = explainer.get_ys().mean()
print("2014-CA dataset disparity: " + str(mean_outcome_females_CA_2014 - mean_outcome_males_CA_2014))

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.truera_workspace:Syncing data split "2014-NY" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2014-NY" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing model model_1 to local.
INFO:truera.client.local.local_truera_workspace:Model "model_1" is added and associated with local data collection "Data Collection v1". "model_1" is set as the model for the workspace context.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:The previous data collecti

2014-NY dataset disparity: __truera_label__   -0.124236
dtype: float64


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing data split "2014-CA" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2014-CA" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


2014-CA dataset disparity: __truera_label__   -0.107614
dtype: float64


This calculation of dataset disparity for NY and CA confirm our hypothesis.

Going forward, we should retrain our model on CA with the belief that more fair training data will lead to a more fair model. But let's not stop there.

* We notice that sex is included in the training data, and the second leading contributor to disparity. We should remove it so it's not learned by the model.

* What next? Let's see the fairness outcome after removing sex from the training data and using CA as our training set.

In [8]:
# get data and feature map

from smart_open import open

data_s3_file_name = "https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-fairness/starter-fairness-data-compressed.pickle"
with open(data_s3_file_name, "rb") as f:
    data = pd.read_pickle(f)

feature_map_s3_file_name = "https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-fairness/starter-fairness-feature-map.pickle"
with open(feature_map_s3_file_name, "rb") as f:
    feature_map = pd.read_pickle(f)
feature_map.pop("Sex")

tru.add_data_collection("Data Collection v2", pre_to_post_feature_map=feature_map, provide_transform_with_model=False)

# add data splits to the collection we just created
year_begin = 2014
year_end = 2016  # exclusive
states = ["CA", "NY"]

for year in range(year_begin, year_end):
    for state in states:
        data_split_name = f"{year}-{state}"
        print(f"Ingesting data-split: {data_split_name}...")
        ids = [f"{i}" for i in range(len(data[year][state]["data_preprocessed"]))]
        data[year][state]["data_preprocessed"]["id"] = ids
        data[year][state]["data_postprocessed"]["id"] = ids
        data[year][state]["label"] = pd.DataFrame(data[year][state]["label"])
        data[year][state]["label"]["id"] = ids
        data[year][state]["extra_data"] = pd.DataFrame(data[year][state]["extra_data"][["LANX"]])
        data[year][state]["extra_data"]["id"] = ids
        # data
        merged_data = data[year][state]["data_preprocessed"].\
            merge(data[year][state]["data_postprocessed"]).\
            merge(data[year][state]["label"]).\
            merge(data[year][state]["extra_data"])
        tru.add_data(
            data=merged_data,
            data_split_name=data_split_name,
            column_spec=ColumnSpec(
                id_col_name="id",
                pre_data_col_names=list(data[year][state]["data_preprocessed"].columns.drop(
                    ["id", "Sex"])),  # drop sex from pre-data
                post_data_col_names=list(data[year][state]["data_postprocessed"].columns.drop(
                    ["id", "Sex_Male", "Sex_Female"])),
                label_col_names=list(data[year][state]["label"].columns.drop("id")),
                extra_data_col_names=list(data[year][state]["extra_data"].columns.drop("id")) +
                ["Sex"]  # add sex to extra-data
            ))
tru.set_influences_background_data_split(f"{year_begin}-{states[0]}")

Ingesting data-split: 2014-CA...




Uploading tmpmuo_ypy1.parquet (175.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2014-NY...




Uploading tmpv7qd2cb1.parquet (174.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-CA...




Uploading tmpt1thw06j.parquet (175.3KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-NY...




Uploading tmp6m1uuhzm.parquet (174.0KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


In [9]:
# train xgboost
models = {}
model_name_v2 = "model_2"

models[model_name_v2] = xgb.XGBClassifier(eta=0.2, max_depth=4)

models[model_name_v2].fit(data[2014]["CA"]["data_postprocessed"].drop(["Sex_Female", "Sex_Male", "id"], axis=1),
                          data[2014]["CA"]["label"].drop("id", axis=1))

train_params = {"model_type": "xgb.XGBClassifier", "eta": 0.2, "max_depth": 4}

# ingest the model
tru.set_project(project_name)
tru.set_data_collection("Data Collection v2")
tru.add_python_model(model_name_v2, models[model_name_v2], train_split_name="2014-CA", train_parameters=train_params)

INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBClassifier
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!


Uploading MLmodel (222.0B) -- ### -- file upload complete.
Uploading tmp2o9t576d.json (170.2KiB) -- ### -- file upload complete.
Uploading conda.yaml (208.0B) -- ### -- file upload complete.
Uploading xgboost_classification_predict_wrapper.py (573.0B) -- ### -- file upload complete.
Uploading xgboost_classification_predict_wrapper.cpython-310.pyc (1.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_2" is added and associated with remote data collection "Data Collection v2". "model_2" is set as the model for the workspace context.


Model uploaded to: http://localhost:8000/home/p/Starter%20Example%20Companion%20-%20Fairness/m/model_2/


INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split 2014-CA.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v2". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_2".


In [10]:
tru.set_model("model_2")
tru.set_influences_background_data_split("2014-NY")

# set model output context
model_output_context = ModelOutputContext(model_name=model_name_v2,
                                          score_type="probits",
                                          background_split_name=f"2014-NY",
                                          influence_type='tree-shap-interventional')

for year in range(year_begin, year_end):
    for state in states:
        data_split_name = f"{year}-{state}"
        tru.set_data_split(data_split_name)
        print(f"Ingesting data-split: {data_split_name}...")
        preds = tru.get_ys_pred().reset_index(names="id")
        # predictions
        tru.add_data(data=preds,
                     data_split_name=data_split_name,
                     column_spec=ColumnSpec(id_col_name="id", prediction_col_names="__truera_prediction__"))
        # influences
        tru.compute_feature_influences()

Ingesting data-split: 2014-CA...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data split "2014-CA" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2014-CA" is added to local data collection "Data Collection v2", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing data split "2014-NY" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2014-NY" is added to local data collection "Data Collection v2", and set as the data split for the workspace context.
INFO:

Uploading tmpo8wfhetb.parquet (58.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmpa8tnc0g0.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2014-NY...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_2".
INFO:truera.client.remote_truera_workspace:`model_output_

Uploading tmpoamep60c.parquet (57.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmptfklsm_k.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-CA...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing data split "2015-CA" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2015-CA" is added to local data collection "Data Collection v2", and set as the data split for

Uploading tmpu4jos04h.parquet (58.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmpkc2zdr7_.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-NY...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing data split "2015-NY" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2015-NY" is added to local data collection "Data Collection v2", and set as the data split for

Uploading tmpjnjuld41.parquet (57.6KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmpwywcyf64.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


In [11]:
# view model test results
tru.set_model(model_name_v2)
tru.tester.get_model_test_results(test_types=["fairness"])

0,1,2,3,4,5,6,7
,Name,Split,Protected Segment,Comparison Segment,Metric,Score,Navigate
❌,Impact Ratio Test,2014-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.7773,Explore in UI
❌,Impact Ratio Test,2014-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.776,Explore in UI
✅,Impact Ratio Test,2015-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.8156,Explore in UI
✅,Impact Ratio Test,2015-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.8053,Explore in UI


* What? Our newest model now passes 2/4 impact ratio tests.
* Shown on the Dataset Disparity section of the Fairness page in the web app, we notice that there is a lower positive outcome rate for females in the data. Let's check the performance of females with a positive label.

In [12]:
tru.set_model(model_name_v2)
tru.set_data_split("2014-CA")
tru.add_segment_group(
    "Sex + Label", {
        "Other": "(Sex == 'Male') OR (_DATA_GROUND_TRUTH == 0)",
        "Female + Label 1": "(Sex == 'Female') AND (_DATA_GROUND_TRUTH == 1)"
    })
explainer = tru.get_explainer(base_data_split="2014-CA")

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_2".


In [13]:
explainer.set_segment(segment_group_name="Sex + Label", segment_name="Female + Label 1")
print("Female + label 1 performance: " + str(explainer.compute_performance(metric_type="CLASSIFICATION_ACCURACY")))
explainer.set_segment(segment_group_name="Sex + Label", segment_name="Other")
print("Other performance: " + str(explainer.compute_performance(metric_type="CLASSIFICATION_ACCURACY")))

Female + label 1 performance: 0.7719298245614035
Other performance: 0.8663113994439295


Our hunch was correct. To correct for this, rebalance the training set by overampling female with label==1 in train set because female with label==1 has more error than rest_of_pop with label==1

In [14]:
def rebalance_gender(df, data_type):
    if data_type == 0:
        df_female_true = df[(df["Sex"] == "Female") & (df["PINCP"] == True)]
        df_else = df[~((df["Sex"] == "Female") & (df["PINCP"] == True))]
    else:
        df_female_true = df[(df["Sex_Female"] == 1) & (df["PINCP"] == True)]
        df_else = df[~((df["Sex_Female"] == 1) & (df["PINCP"] == True))]

    if data_type == 0:
        num_samples = len(df[(df["Sex"] == "Male") & (df["PINCP"] == True)])
    else:
        num_samples = len(df[(df["Sex_Male"] == 1) & (df["PINCP"] == True)])
    # Resample female target segment so that they are the same size as male
    df_female_true_resampled = resample(
        df_female_true,
        replace=True,
        n_samples=num_samples,
        random_state=1  # include random seed so we can perform same sampling on each data set
    )

    return pd.concat([df_female_true_resampled, df_else])

In [15]:
data[2014]["CA"]["data_preprocessed_resampled"] = rebalance_gender(data[2014]["CA"]["data_preprocessed"].reset_index(drop=True).\
                                        merge(data[2014]["CA"]["label"].reset_index(drop=True), on="id"), 0).drop(["Sex","PINCP"], axis=1)
data[2014]["CA"]["data_postprocessed_resampled"] = rebalance_gender(data[2014]["CA"]["data_postprocessed"].reset_index(drop=True).\
                                        merge(data[2014]["CA"]["label"].reset_index(drop=True), on="id"), 1).drop(["Sex_Male","Sex_Female", "PINCP"], axis=1)
data[2014]["CA"]["label_resampled"] = rebalance_gender(pd.DataFrame(data[2014]["CA"]["label"].reset_index(drop=True)).\
                                        merge(data[2014]["CA"]["data_preprocessed"].reset_index(drop=True), on="id"), 0).drop(["Sex"], axis=1)[["PINCP","id"]]
data[2014]["CA"]["extra_data_resampled"] = rebalance_gender(pd.DataFrame(data[2014]["CA"]["extra_data"].reset_index(drop=True)).\
                                        merge(data[2014]["CA"]["data_preprocessed"].reset_index(drop=True), on="id").\
                                        merge(data[2014]["CA"]["label"].reset_index(drop=True), on="id"), 0)[["LANX","id","Sex"]]

ids = [f"{i}" for i in range(len(data[2014]["CA"]["data_preprocessed_resampled"]))]
data[2014]["CA"]["data_preprocessed_resampled"]["id"] = ids
data[2014]["CA"]["data_postprocessed_resampled"]["id"] = ids
data[2014]["CA"]["label_resampled"]["id"] = ids
data[2014]["CA"]["extra_data_resampled"]["id"] = ids

In [16]:
merged_data = data[2014]["CA"]["data_preprocessed_resampled"].\
            merge(data[2014]["CA"]["data_postprocessed_resampled"]).\
            merge(data[2014]["CA"]["label_resampled"]).\
            merge(data[2014]["CA"]["extra_data_resampled"])
tru.add_data(data=merged_data,
             data_split_name="2014-CA-resampled",
             column_spec=ColumnSpec(
                 id_col_name="id",
                 pre_data_col_names=list(data[2014]["CA"]["data_preprocessed_resampled"].columns.drop(["id"])),
                 post_data_col_names=list(data[2014]["CA"]["data_postprocessed_resampled"].columns.drop(["id"])),
                 label_col_names=list(data[2014]["CA"]["label_resampled"].columns.drop("id")),
                 extra_data_col_names=list(data[2014]["CA"]["extra_data_resampled"].columns.drop("id"))))

Uploading tmp1xrnrkz0.parquet (178.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


In [17]:
# train xgboost
model_name_v3 = "model_3"

models[model_name_v3] = xgb.XGBClassifier(eta=0.2, max_depth=4)
models[model_name_v3].fit(data[2014]["CA"]["data_postprocessed_resampled"].drop("id", axis=1),
                          data[2014]["CA"]["label_resampled"].drop("id", axis=1))

# ingest the model
tru.add_python_model(model_name_v3,
                     models[model_name_v3],
                     train_split_name="2014-CA-resampled",
                     train_parameters=train_params)

INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBClassifier
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!


Uploading MLmodel (222.0B) -- ### -- file upload complete.
Uploading tmp259xq10y.json (165.8KiB) -- ### -- file upload complete.
Uploading conda.yaml (208.0B) -- ### -- file upload complete.
Uploading xgboost_classification_predict_wrapper.py (573.0B) -- ### -- file upload complete.
Uploading xgboost_classification_predict_wrapper.cpython-310.pyc (1.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_3" is added and associated with remote data collection "Data Collection v2". "model_3" is set as the model for the workspace context.


Model uploaded to: http://localhost:8000/home/p/Starter%20Example%20Companion%20-%20Fairness/m/model_3/


INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split 2014-NY.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v2". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_3".


In [18]:
tru.set_model(model_name_v3)
tru.set_influences_background_data_split("2014-CA-resampled")

# set model output context
model_output_context = ModelOutputContext(model_name=model_name_v3,
                                          score_type="probits",
                                          background_split_name=f"2014-CA-resampled",
                                          influence_type='tree-shap-interventional')

data_split_name = "2014-CA-resampled"
tru.set_data_split(data_split_name)
print(f"Ingesting data-split: 2014-CA-resampled...")

# predictions
preds = tru.get_ys_pred().reset_index(names="id")
tru.add_data(data=preds,
             data_split_name=data_split_name,
             column_spec=ColumnSpec(id_col_name="id", prediction_col_names="__truera_prediction__"))

# influences
tru.compute_feature_influences()

Ingesting data-split: 2014-CA-resampled...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing data split "2014-CA-resampled" to local.
INFO:truera.client.local.local_truera_workspace:Data split "2014-CA-resampled" is added to local data collection "Data Collection v2", and set a

Uploading tmp1__csb4v.parquet (61.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmphhb0rsua.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Unnamed: 0_level_0,Age,Class_of_worker,Marital_status,Relationship,Education_attainment,Work_hours_per_week,Occupation_area,Race
6b1c748f-3a69-43a1-8125-79d660af0c90,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5329,0.062600,-0.006303,0.019363,0.043392,0.132384,0.091780,0.167688,-0.017412
1970,0.025450,-0.041165,-0.008946,0.017926,-0.049954,0.049862,-0.253300,-0.001106
4612,-0.168059,0.003331,-0.020178,-0.040817,-0.079256,0.000743,0.005924,0.009545
4897,0.086994,0.115345,0.026898,0.071541,0.104078,-0.087838,0.025532,0.021203
363,0.063667,-0.078509,0.013049,-0.092652,0.132105,0.260099,0.107027,0.058507
...,...,...,...,...,...,...,...,...
4463,0.084367,0.030146,0.015766,0.026285,-0.072431,-0.073311,-0.080442,-0.077592
5087,-0.097702,-0.004884,-0.002827,-0.037647,-0.051907,-0.078970,-0.075990,0.004802
933,0.066994,0.005356,0.019708,0.019279,0.108000,0.209093,-0.000161,0.026896
2754,0.071651,-0.003272,0.026662,0.062341,0.082455,0.064490,0.043741,0.034906


In [19]:
for year in range(year_begin, year_end):
    for state in states:
        data_split_name = f"{year}-{state}"
        tru.set_data_split(data_split_name)
        print(f"Ingesting data-split: {data_split_name}...")
        preds = tru.get_ys_pred().reset_index(names="id")
        # predictions
        tru.add_data(data=preds,
                     data_split_name=data_split_name,
                     column_spec=ColumnSpec(id_col_name="id", prediction_col_names="__truera_prediction__"))
        # influences
        tru.compute_feature_influences()

Ingesting data-split: 2014-CA...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_3".
INFO:truera.client.remote_truera_workspace:`model_output_

Uploading tmpej5entzk.parquet (58.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmphm57zpdm.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2014-NY...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_3".
INFO:truera.client.remote_truera_workspace:`model_output_

Uploading tmpe6jwi84j.parquet (57.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmp4qqvywlr.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-CA...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_3".
INFO:truera.client.remote_truera_workspace:`model_output_

Uploading tmpgy78x9vq.parquet (58.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmp2uv967s3.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Ingesting data-split: 2015-NY...


INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_3".
INFO:truera.client.remote_truera_workspace:`model_output_

Uploading tmp8xe60x55.parquet (57.6KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpyzzueqm5
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v2" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:Setting local m

Uploading tmp5n93qn0e.parquet (14.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


In [20]:
# view model test results
tru.set_model(model_name_v3)
tru.tester.get_model_test_results(test_types=["fairness"])

0,1,2,3,4,5,6,7
,Name,Split,Protected Segment,Comparison Segment,Metric,Score,Navigate
✅,Impact Ratio Test,2014-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.8213,Explore in UI
✅,Impact Ratio Test,2014-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.8079,Explore in UI
✅,Impact Ratio Test,2015-CA,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.8539,Explore in UI
✅,Impact Ratio Test,2015-NY,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,0.827,Explore in UI
✅,Impact Ratio Test,2014-CA-resampled,Sex--Female: Sex = 'Female',REST OF POPULATION,DISPARATE_IMPACT_RATIO,1.048,Explore in UI


### After rebalancing, the model passes 4/4 impact ratio tests, giving us confidence in the fairness of the model.