# Assessing Policies Using Real Data

In this workflow, CFRL takes in an offline trajectory and then preprocesses the offline trajectory using `SyntheticPreprocessor`. After that, the preprocessed trajectory is passed into `FQI` to train a counterfactually fair policy, which is then assessed using :code:`evaluate_reward_through_fqe()` and `evaluate_fairness_through_model()` based on a `SimulatedEnvironment` that mimics the transition rules of the true environment underlying the training trajectory. The final output of the workflow is the policy trained on the preprocessed data as well as its estimated value and counterfactual fairness metric. This workflow is appropriate when the user is interested in knowing the value and counterfactual fairness achieved by the trained policy when interacting with the true underlying environment.

We begin by importing the liberaries needed for this demonstration.

In [2]:
# Need this temporarily to import CFRL before it is officially published to PyPI
import sys
sys.path.append("E:/learning/university/MiSIL/CFRL Python Package/CFRL")

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from CFRL.reader import read_trajectory_from_dataframe
from CFRL.preprocessor import SequentialPreprocessor
from CFRL.agents import FQI
from CFRL.environment import SimulatedEnvironment
from CFRL.evaluation import evaluate_reward_through_fqe, evaluate_fairness_through_model
np.random.seed(1) # ensure reproducibility
torch.manual_seed(1) # ensure reproducibility

<torch._C.Generator at 0x23d0a6ca650>

## Data Loading

In this demonstration, we use an offline trajectory generated from a `SyntheticEnvironment` using some pre-specified transition rules. Although it is actually synthesized, we treat it as if it is from some unknown environment for pedagogical convenience in this demonstration.

The trajectory contains 500 individuals (i.e. $N=500$) and 10 transitions (i.e. $T=10$). The actions are binary (0 or 1) and were sampled using a random policy that selects 0 or 1 randomly with equal probability. It is stored in a tabular format in a `.csv` file. The sensitive attribute variable is bivariate, stored in columns `z1` and `z2`. The legit values of the sensitive attribute are $[0, 0]$, $[1, 0]$, $[0, 1]$, and $[1, 1]$. The state variable is tri-variate, stored in columns `state1`, `state2`, and `state3`. The actions are stored in the column `action` and rewards in the column `reward`. The tabular data also includes an extra irrelevant column `timestamp`. 

We can load and view the tabular data.

In [4]:
trajectory = pd.read_csv('../data/sample_data_large_multi.csv')
trajectory

Unnamed: 0.1,Unnamed: 0,ID,timestamp,z1,z2,action,reward,state1,state2,state3
0,0,1.0,1.0,0.0,0.0,,,2.124345,-0.111756,-0.028172
1,1,1.0,2.0,0.0,0.0,1.0,3.380339,-0.071876,0.545250,-0.020279
2,2,1.0,3.0,0.0,0.0,0.0,1.849111,-1.084077,-1.696634,-1.179136
3,3,1.0,4.0,0.0,0.0,0.0,-4.421291,-2.317520,-1.787875,-2.148363
4,4,1.0,5.0,0.0,0.0,1.0,-5.142691,-2.936506,-3.603797,-3.590126
...,...,...,...,...,...,...,...,...,...,...
5495,5495,500.0,7.0,0.0,0.0,0.0,-12.563265,-4.024293,-6.587401,-3.859436
5496,5496,500.0,8.0,0.0,0.0,0.0,-14.073520,-5.952644,-5.854450,-4.218220
5497,5497,500.0,9.0,0.0,0.0,0.0,-16.691358,-5.687570,-6.008377,-5.618730
5498,5498,500.0,10.0,0.0,0.0,0.0,-18.394408,-7.551435,-6.816310,-6.740886


We now read the trajectory from the tabular format into Trajectory Arrays.

In [5]:
zs, states, actions, rewards, ids = read_trajectory_from_dataframe(
                                                data=trajectory, 
                                                z_labels=['z1', 'z2'], 
                                                state_labels=['state1', 'state2', 'state3'], 
                                                action_label='action', 
                                                reward_label='reward', 
                                                id_label='ID', 
                                                T=10
                                                )

In [6]:
'''zs, states, actions, rewards, ids = read_trajectory_from_dataframe(
                                                data=trajectory, 
                                                z_labels=['z1'], 
                                                state_labels=['state1'], 
                                                action_label='action', 
                                                reward_label='reward', 
                                                id_label='ID', 
                                                T=10
                                                )'''

"zs, states, actions, rewards, ids = read_trajectory_from_dataframe(\n                                                data=trajectory, \n                                                z_labels=['z1'], \n                                                state_labels=['state1'], \n                                                action_label='action', \n                                                reward_label='reward', \n                                                id_label='ID', \n                                                T=10\n                                                )"

## Train-test Split

We split the trajectory data into a training set (80%) and a testing set (20%). The training set is used to train the policy, while the testing set is used to evaluate the value and counterfactual fairness metric achieved by the policy.

In [7]:
(
    zs_train, zs_test, 
    states_train, states_test, 
    actions_train, actions_test, 
    rewards_train, rewards_test
) = train_test_split(zs, states, actions, rewards, test_size=0.2)

## Preprocessor Training & Trajectory Preprocessing

We now train the preprocessor and preprocess the trajectory. Note that if we train the preprocessor using only a subset of the data and preprocess the remaining subset of the data, then the resulting preprocessed trajectory might be too small to be useful for policy learning. We essentially want to preprocess as many individuals as possible. Fortunately, we can directly preprocess all individuals using the `train_preprocessor()` function when we set `cross_folds` to a relatively large number.

When `cross_folds=K` where `K` is greater than 1, `train_preprocessor()` will internally divide the training data into `K` folds. For each $i=1,\dots,k$, it trains a transition dynamics model based on all the folds other than the $i$-th one, and this model is then used to preprocess data in the $i$-th fold. This results in `K` folds of preprocessed data, each of which is processed using a model that is trained on the other folds. These `K` folds of preprocessed data are then combined and returned by `train_preprocessor()`. This method allows us to preprocess all individuals in the trajectory while reducing overfitting.

To use this functionality, we first initialize a `SequentialPreprocessor` with `cross_folds` greater than 1. We use `cross_folds=5` here.

In [8]:
sp = SequentialPreprocessor(z_space=[[0, 0], [0, 1], [1, 0], [1, 1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')

In [9]:
'''sp = SequentialPreprocessor(z_space=[[0], [1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')'''

"sp = SequentialPreprocessor(z_space=[[0], [1]], \n                            num_actions=2, \n                            cross_folds=5, \n                            mode='single', \n                            reg_model='nn')"

We now simultaneously train the preprocessor and preprocess all individuals in the trajectory using the precedure described above.

In [10]:
states_tilde, rewards_tilde = sp.train_preprocessor(zs=zs_train, 
                                                    xs=states_train, 
                                                    actions=actions_train, 
                                                    rewards=rewards_train)

100%|██████████| 1000/1000 [00:47<00:00, 21.08it/s]
100%|██████████| 1000/1000 [00:51<00:00, 19.55it/s]
100%|██████████| 1000/1000 [01:00<00:00, 16.60it/s]
100%|██████████| 1000/1000 [00:45<00:00, 22.22it/s]
100%|██████████| 1000/1000 [00:42<00:00, 23.81it/s]


## Policy Learning

Now we train a policy using `FQI` and the preprocessed data. Note that although we passed `sp` into `agent` as an internal preprocessor, we set `preprocess=False` during training because the training data `state_tilde` and `rewards_tilde` are already preprocessed.

In [11]:
agent = FQI(num_actions=2, model_type='nn', preprocessor=sp)
agent.train(zs=zs_train, 
            xs=states_tilde, 
            actions=actions_train, 
            rewards=rewards_tilde, 
            max_iter=100, 
            preprocess=False)

100%|██████████| 100/100 [01:16<00:00,  1.30it/s]


## `SimulatedEnvironment` Training

Before moving on to the evaluation stage, there is one more thing to do: We need to train a `SimulatedEnvironment` that mimics the transition rules of the true environment that generated the training trajectory, which will be used by the evaluation functions. To do so, we initialize a `SimulatedEnvironment` and train it on the whole trajectory data (i.e. training set and testing set combined).

In [12]:
env = SimulatedEnvironment(num_actions=2, 
                           state_model_type='nn', 
                           reward_model_type='nn')
env.fit(zs=zs, states=states, actions=actions, rewards=rewards)

  4%|▍         | 44/1000 [00:00<00:17, 53.74it/s]
  3%|▎         | 29/1000 [00:00<00:13, 69.58it/s]
  2%|▏         | 22/1000 [00:00<00:17, 57.40it/s]
  5%|▍         | 46/1000 [00:00<00:20, 46.88it/s]
  6%|▌         | 57/1000 [00:01<00:17, 55.46it/s]
  6%|▌         | 56/1000 [00:00<00:13, 70.80it/s]
  5%|▌         | 52/1000 [00:00<00:16, 59.10it/s]
  5%|▍         | 46/1000 [00:00<00:18, 52.30it/s]


# Value Evaluation

We now estimate the value achieved by the trained policy when interacting with the target environment using fitted Q evaluation (FQE), which is provided by `evaluate_value_through_fqe()`. We use the testing set for evaluation.

In [13]:
value = evaluate_reward_through_fqe(zs=zs_test, 
                                    states=states_test, 
                                    actions=actions_test, 
                                    rewards=rewards_test, 
                                    policy=agent, 
                                    model_type='nn')
value

100%|██████████| 200/200 [02:03<00:00,  1.62it/s]


-48.450016

## Counterfactual Fairness Evaluation

We now estimate the counterfactual fairness acheived by the policy when interacting with the target environment. To do so, we use `evaluate_fairness_through_model()`. This function first estimates the counterfactual trajectories of each individual in the data under a set of legit sensitive attribute values. Then it takes actions based on the counterfactual states using the policy that is to be evaluated. In the end, it calculates and returns a counterfactual fairness metric (CF metric) following the formula 

$\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right),$

where $eval(Z)$ is the set of sensitive attribute values passed in by `z_eval_levels`, $A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)$ is the action taken in the counterfactual trajectory under $Z=z'$, and $A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)$ is the action taken under the counterfactual trajectory under $Z=z$. The CF metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness.

In [14]:
cf_metric = evaluate_fairness_through_model(env=env, 
                                            zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            policy=agent)
cf_metric

0.0009090909090909091

We can see that our policy achieves a low CF metric value, which indicates it is close to being perfectly counterfactually fair. 

## Bonus: Assessing a Fairness-through-unawareness Policy

Fairness-through-unawareness proposes to ensure fairness by excluding the sensitive attribute from the state variable. Nevertheless, it has been argued that this method can still be unfair because the agent might learn the bias indirectly from the states and rewards, which are often biased. In this section, we train a policy following fairness-through-unawareness using the same training trajectory data and estimate its value and CF metric.

We begin by training a fairness-through-unawareness policy. In the code below, `agent_unaware` only uses `states_train`, `actions_train`, and `rewards_train` during training, which enforces fairness-through-unwareness.

In [15]:
agent_unaware = FQI(num_actions=2, model_type='nn', preprocessor=None)
agent_unaware.train(zs=zs_train, 
                    xs=states_train, 
                    actions=actions_train, 
                    rewards=rewards_train, 
                    max_iter=100, 
                    preprocess=False)

100%|██████████| 100/100 [00:50<00:00,  1.97it/s]


We now estimate the value of the fairness-through-unawareness policy.

In [16]:
value_unaware = evaluate_reward_through_fqe(zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            rewards=rewards_test, 
                                            policy=agent_unaware, 
                                            model_type='nn')
value_unaware

100%|██████████| 200/200 [01:15<00:00,  2.66it/s]


-58.894543

Finally, we estimate the CF metric of the fairness-through-unawareness policy.

In [17]:
cf_metric_unaware = evaluate_fairness_through_model(env=env, 
                                                    zs=zs_test, 
                                                    states=states_test, 
                                                    actions=actions_test, 
                                                    policy=agent_unaware)
cf_metric_unaware

0.5527272727272727

We can see that the fairness-through-unawareness policy is much less fair than the policy learned using the preprocessed trajectory. This implies the preprocessing method effectively reduced the bias in the training trajectory. 