# Assessing Policies Using Real Data

In this workflow, CFRL takes in an offline trajectory and then preprocesses the offline trajectory using `SyntheticPreprocessor`. After that, the preprocessed trajectory is passed into `FQI` to train a counterfactually fair policy, which is then assessed using :code:`evaluate_reward_through_fqe()` and `evaluate_fairness_through_model()` based on a `SimulatedEnvironment` that mimics the transition rules of the true environment underlying the training trajectory. The final output of the workflow is the policy trained on the preprocessed data as well as its estimated value and counterfactual fairness metric. This workflow is appropriate when the user is interested in knowing the value and counterfactual fairness achieved by the trained policy when interacting with the true underlying environment.

We begin by importing the liberaries needed for this demonstration.

In [15]:
# Need this temporarily to import CFRL before it is officially published to PyPI
import sys
sys.path.append("E:/learning/university/MiSIL/CFRL Python Package/CFRL")

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from CFRL.reader import read_trajectory_from_dataframe, convert_trajectory_to_dataframe
from CFRL.preprocessor import SequentialPreprocessor
from CFRL.agents import FQI
from CFRL.environment import SimulatedEnvironment
from CFRL.evaluation import evaluate_reward_through_fqe, evaluate_fairness_through_model
np.random.seed(1) # ensure reproducibility

## Data Loading

In this demonstration, we use an offline trajectory generated from a `SyntheticEnvironment` using some pre-specified transition rules. Although it is actually synthesized, we treat it as if it is from some unknown environment for pedagogical convenience in this demonstration.

The trajectory contains 500 individuals (i.e. $N=500$) and 10 transitions (i.e. $T=10$). The actions are binary (0 or 1) and were sampled using a random policy that selects 0 or 1 randomly with equal probability. It is stored in a tabular format in a `.csv` file. The sensitive attribute variable is bivariate, stored in columns `z1` and `z2`. The legit values of the sensitive attribute are $[0, 0]$, $[1, 0]$, $[0, 1]$, and $[1, 1]$. The state variable is tri-variate, stored in columns `state1`, `state2`, and `state3`. The actions are stored in the column `action` and rewards in the column `reward`. The tabular data also includes an extra inrrelevant column `timestamp`. 

We can load and view the tabular data.

In [44]:
trajectory = pd.read_csv('../data/sample_data_large_multi.csv')
trajectory

Unnamed: 0.1,Unnamed: 0,ID,timestamp,z1,z2,action,reward,state1,state2,state3
0,0,1.0,1.0,0.0,1.0,,,3.124345,0.888244,0.971828
1,1,1.0,2.0,0.0,1.0,1.0,7.380339,3.078124,3.695250,3.129721
2,2,1.0,3.0,0.0,1.0,0.0,12.299111,3.700923,3.088366,3.605864
3,3,1.0,4.0,0.0,1.0,0.0,10.933709,3.938980,4.468625,4.108137
4,4,1.0,5.0,0.0,1.0,1.0,14.626809,4.944344,4.277053,4.290724
...,...,...,...,...,...,...,...,...,...,...
5495,5495,500.0,7.0,1.0,0.0,0.0,16.140130,6.236726,3.673618,6.401583
5496,5496,500.0,8.0,1.0,0.0,0.0,17.709536,5.232273,5.330467,6.966696
5497,5497,500.0,9.0,1.0,0.0,0.0,17.863392,6.328855,6.008048,6.397695
5498,5498,500.0,10.0,1.0,0.0,0.0,18.654867,5.213347,5.948473,6.023897


We now read the trajectory from the tabular format into Trajectory Arrays.

In [45]:
zs, states, actions, rewards, ids = read_trajectory_from_dataframe(
                                                data=trajectory, 
                                                z_labels=['z1', 'z2'], 
                                                state_labels=['state1', 'state2', 'state3'], 
                                                action_label='action', 
                                                reward_label='reward', 
                                                id_label='ID', 
                                                T=10
                                                )

In [None]:
'''zs, states, actions, rewards, ids = read_trajectory_from_dataframe(
                                                data=trajectory, 
                                                z_labels=['z1'], 
                                                state_labels=['state1'], 
                                                action_label='action', 
                                                reward_label='reward', 
                                                id_label='ID', 
                                                T=10
                                                )'''

## Train-test Split

We split the trajectory data into a training set (80%) and a testing set (20%). The training set is used to train the policy, while the testing set is used to evaluate the value and counterfactual fairness metric achieved by the policy.

In [46]:
(
    zs_train, zs_test, 
    states_train, states_test, 
    actions_train, actions_test, 
    rewards_train, rewards_test
) = train_test_split(zs, states, actions, rewards, test_size=0.2)

## Preprocessor Training & Trajectory Preprocessing

We now train the preprocessor and preprocess the trajectory. Note that if we train the preprocessor using only a subset of the data and preprocess the remaining subset of the data, then the resulting preprocessed trajectory might be too small to be useful. Fortunately, we can directly preprocess all individuals using the `train_preprocessor()` function when we have a relatively large number of `cross_folds`.

When `cross_folds=K` where `K` is greater than 1, `train_preprocessor()` will internally divide the training data into `K` folds. For each $i=1,\dots,k$, it trains a model based on all the folds other than the $i$-th one, which is then used to preprocess data in the $i$-th fold. This results in `K` folds of preprocessed data, each of which is processed using a model that is trained on the other folds. These `K` folds of data are then combined and returned by `train_preprocessor()`. This method allows us to preprocess all individuals in the trajectory while reducing overfitting.

To use this functionality, we first initialize a `SequentialPreprocessor` with `cross_folds` greater than 1. We use `cross_folds=4` here.

In [49]:
sp = SequentialPreprocessor(z_space=[[0, 0], [0, 1], [1, 0], [1, 1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')

In [None]:
'''sp = SequentialPreprocessor(z_space=[[0], [1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')'''

We now simultaneously train the preprocessor and preprocess all individuals in the trajectory using the precedure described above.

In [50]:
states_tilde, rewards_tilde = sp.train_preprocessor(zs=zs_train, 
                                                    xs=states_train, 
                                                    actions=actions_train, 
                                                    rewards=rewards_train)

100%|██████████| 1000/1000 [00:30<00:00, 32.90it/s]
100%|██████████| 1000/1000 [00:31<00:00, 31.91it/s]
100%|██████████| 1000/1000 [00:31<00:00, 32.24it/s]
100%|██████████| 1000/1000 [00:32<00:00, 30.96it/s]
100%|██████████| 1000/1000 [00:30<00:00, 32.46it/s]


## Policy Learning

Now we train a policy using `FQI` and the preprocessed data. Note that although we passed `sp` into `agent` as an internal preprocessor, we set `preprocess=False` during training because the training data `state_tilde` and `rewards_tilde` are already preprocessed.

In [51]:
agent = FQI(num_actions=2, model_type='nn', preprocessor=sp)
agent.train(zs=zs_train, 
            xs=states_tilde, 
            actions=actions_train, 
            rewards=rewards_tilde, 
            max_iter=100, 
            preprocess=False)

100%|██████████| 100/100 [01:04<00:00,  1.55it/s]


## `SimulatedEnvironment` Training

Before moving on to the evaluation stage, there is one more thing to be done: We need to training a `SimulatedEnvironment` that mimics the transition rules of the true environment that generated the trajectory, which will be used by the evaluation functions. To do so, we initialize a `SimulatedEnvironment` and train it on the whole trajectory data (i.e. training set and testing set combined).

In [52]:
env = SimulatedEnvironment(num_actions=2, 
                           state_model_type='nn', 
                           reward_model_type='nn')
env.fit(zs=zs, states=states, actions=actions, rewards=rewards)

  0%|          | 0/1000 [00:00<?, ?it/s]

  2%|▏         | 22/1000 [00:00<00:18, 52.54it/s]
  5%|▍         | 49/1000 [00:00<00:13, 70.17it/s]
  4%|▎         | 37/1000 [00:00<00:14, 64.83it/s]
  8%|▊         | 84/1000 [00:01<00:13, 69.09it/s]
  4%|▍         | 45/1000 [00:00<00:15, 63.11it/s]
  6%|▌         | 61/1000 [00:00<00:14, 64.60it/s]
  5%|▍         | 49/1000 [00:00<00:15, 62.15it/s]
 14%|█▍        | 142/1000 [00:02<00:12, 68.80it/s]


# Value Evaluation

We now estimate the value achieved by the trained policy when interacting with the target environment using fitted Q evaluation (FQE), which is provided by `evaluate_value_through_fqe()`. We use the testing set for evaluation.

In [57]:
value = evaluate_reward_through_fqe(zs=zs_test, 
                                    states=states_test, 
                                    actions=actions_test, 
                                    rewards=rewards_test, 
                                    policy=agent, 
                                    model_type='nn')
value

100%|██████████| 200/200 [01:55<00:00,  1.73it/s]


178.94748

## Counterfactual Fairness Evaluation

We now estimate the counterfactual fairness acheived by the policy when interacting with the target environment. To do so, we use `evaluate_fairness_through_model()`. This function first estimates the counterfactual trajectories of each individual in the data under a set legit sensitive attribute values. Then it calculates a counterfactual fairness metric (CF metric) following the formula 

$\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right),$

where $eval(Z)$ is the set of sensitive attribute values passed in by `z_eval_levels`, $A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)$ is the action taken in the counterfactual trajectory under $`Z=z'$, and $A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)$ is the action taken under the counterfactual trajectory under $Z=z$. The CF metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness.

In [53]:
cf_metric = evaluate_fairness_through_model(env=env, 
                                            zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            policy=agent)
cf_metric

0

We can see that our policy achieves a low CF metric value, which indicates it is close to being perfectly counterfactually fair. 

## Bonus: Evaluate a Fairness-through-unawareness Policy

Fairness-through-unawareness is proposes to ensure fairness by excluding the sensitive attribute from the state variable. Nevertheless, it has been argued that this method can still be unfair because the agent might learn the bias from the states and rewards, which are often biased. In this section, we train a policy following fairness-through-unawareness on the same trajectory data and estimate its value and CF metric.

We begin by training a fairness-through-unawareness policy. In the code above, `agent_unaware` only uses `states_train`, `actions_train`, and `rewards_train` during training, which enforces fairness-through-unwareness.

In [55]:
agent_unaware = FQI(num_actions=2, model_type='nn', preprocessor=None)
agent_unaware.train(zs=zs_train, 
                    xs=states_train, 
                    actions=actions_train, 
                    rewards=rewards_train, 
                    max_iter=100, 
                    preprocess=False)

100%|██████████| 100/100 [00:52<00:00,  1.92it/s]


We now estimate the value of the fairness-through-unawareness policy.

In [58]:
value_unaware = evaluate_reward_through_fqe(zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            rewards=rewards_test, 
                                            policy=agent_unaware, 
                                            model_type='nn')
value_unaware

100%|██████████| 200/200 [01:17<00:00,  2.57it/s]


169.3682

Finally, we estimate the CF metric of the fairness-through-unawareness policy.

In [56]:
cf_metric_unaware = evaluate_fairness_through_model(env=env, 
                                                    zs=zs_test, 
                                                    states=states_test, 
                                                    actions=actions_test, 
                                                    policy=agent_unaware)
cf_metric_unaware

0.22818181818181815

We can see that, despite achieving a higher value, the fairness-through-unawareness policy is much less fair than the policy learned using the preprocessed trajectory. This implies the preprocessing method effectively mitigated the bias in the training trajectory. 