# Assessing Policies Using Real Data

In this workflow, CFRL takes in an offline trajectory and then preprocesses the offline trajectory using `SyntheticPreprocessor`. After that, the preprocessed trajectory is passed into `FQI` to train a counterfactually fair policy, which is then assessed using :code:`evaluate_reward_through_fqe()` and `evaluate_fairness_through_model()` based on a `SimulatedEnvironment` that mimics the transition rules of the true environment underlying the training trajectory. The final output of the workflow is the policy trained on the preprocessed data as well as its estimated value and counterfactual fairness metric. This workflow is appropriate when the user is interested in knowing the value and counterfactual fairness achieved by the trained policy when interacting with the true underlying environment.

We begin by importing the libraries needed for this demonstration.

In [1]:
# Need this temporarily to import CFRL before it is officially published to PyPI
import sys
sys.path.append("E:/learning/university/MiSIL/CFRL Python Package/CFRL")

In [28]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from cfrl.reader import read_trajectory_from_dataframe
from cfrl.preprocessor import SequentialPreprocessor
from cfrl.agents import FQI
from cfrl.environment import SimulatedEnvironment
from cfrl.evaluation import evaluate_reward_through_fqe, evaluate_fairness_through_model
np.random.seed(10) # ensure reproducibility
torch.manual_seed(10) # ensure reproducibility

<torch._C.Generator at 0x1f4e501a290>

## Data Loading

In this demonstration, we use an offline trajectory generated from a `SyntheticEnvironment` using some pre-specified transition rules. Although it is actually synthesized, we treat it as if it is from some unknown environment for pedagogical convenience in this demonstration.

The trajectory contains 500 individuals (i.e. $N=500$) and 10 transitions (i.e. $T=10$). The actions are binary ($0$ or $1$) and were sampled using a random policy that selects $0$ or $1$ randomly with equal probability. It is stored in a tabular format in a `.csv` file. The sensitive attribute variable is univariate, stored in the column `z1`. The legit values of the sensitive attribute are $0$ and $1$. The state variable is also univariate, stored in the column `state1`. The actions are stored in the column `action` and rewards in the column `reward`. The tabular data also includes an extra irrelevant column `timestamp`. 

We can load and view the tabular data.

In [3]:
trajectory = pd.read_csv('../data/sample_data_large_uni.csv')
trajectory

Unnamed: 0.1,Unnamed: 0,ID,timestamp,z1,action,reward,state1
0,0,1.0,1.0,0.0,,,1.324345
1,1,1.0,2.0,0.0,1.0,1.524345,-0.813722
2,2,1.0,3.0,0.0,1.0,-0.613722,-0.526683
3,3,1.0,4.0,0.0,1.0,-0.326683,-0.464447
4,4,1.0,5.0,0.0,1.0,-0.264447,-2.075518
...,...,...,...,...,...,...,...
5495,5495,500.0,7.0,1.0,1.0,-2.468460,-0.941954
5496,5496,500.0,8.0,1.0,1.0,-1.430345,-2.536595
5497,5497,500.0,9.0,1.0,0.0,-1.068298,-0.946557
5498,5498,500.0,10.0,1.0,0.0,-0.273278,-0.709017


We now read the trajectory from the tabular format into Trajectory Arrays.

In [None]:
zs, states, actions, rewards, ids = read_trajectory_from_dataframe(
                                                data=trajectory, 
                                                z_labels=['z1'], 
                                                state_labels=['state1'], 
                                                action_label='action', 
                                                reward_label='reward', 
                                                id_label='ID', 
                                                T=10
                                                )

## Train-test Split

We split the trajectory data into a training set (80%) and a testing set (20%). The training set is used to train the counterfactually fair policy, while the testing set is used to evaluate the value and counterfactual fairness metric achieved by the policy.

In [5]:
(
    zs_train, zs_test, 
    states_train, states_test, 
    actions_train, actions_test, 
    rewards_train, rewards_test
) = train_test_split(zs, states, actions, rewards, test_size=0.2)

## Preprocessor Training & Trajectory Preprocessing

We now train the preprocessor and preprocess the trajectory. As demonstrated in the other workflows, we might want to first train the preprocessor using only a subset of the data, then preprocess the remaining subset of the data, and finally use the preprocessed subset for policy learning. However, when the amount of data is limited, the preprocessed trajectory resulting from the procedure above might be too small to be useful for policy learning. We essentially want to preprocess as many individuals as possible. Fortunately, we can directly preprocess all individuals using the `train_preprocessor()` function when we set `cross_folds` to a relatively large number.

When `cross_folds=K` where `K` is greater than 1, `train_preprocessor()` will internally divide the training data into `K` folds. For each $i=1,\dots,K$, it trains a transition dynamics model based on all the folds other than the $i$-th one, and this model is then used to preprocess data in the $i$-th fold. This results in `K` folds of preprocessed data, each of which is processed using a model that is trained on the other folds. These `K` folds of preprocessed data are then combined and returned by `train_preprocessor()`. This method allows us to preprocess all individuals in the trajectory while reducing overfitting.

To use this functionality, we first initialize a `SequentialPreprocessor` with `cross_folds` greater than 1. We use `cross_folds=5` here.

In [6]:
sp = SequentialPreprocessor(z_space=[[0], [1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')

We now simultaneously train the preprocessor and preprocess all individuals in the trajectory using the precedure described above.

In [7]:
states_tilde, rewards_tilde = sp.train_preprocessor(zs=zs_train, 
                                                    xs=states_train, 
                                                    actions=actions_train, 
                                                    rewards=rewards_train)

100%|██████████| 1000/1000 [00:59<00:00, 16.89it/s]
100%|██████████| 1000/1000 [00:34<00:00, 28.78it/s]
100%|██████████| 1000/1000 [00:38<00:00, 26.31it/s]
100%|██████████| 1000/1000 [00:33<00:00, 30.10it/s]
100%|██████████| 1000/1000 [00:32<00:00, 30.53it/s]


## Policy Learning

Now we train a policy using `FQI` and the preprocessed data with `sp` as its internal preprocessor. Note that the training data `state_tilde` and `rewards_tilde` are already preprocessed. Thus, we set `preprocess=False` during training so that the input trajectory will not be preprocessed again by the internal preprocessor (i.e. `sp`).

In [8]:
agent = FQI(num_actions=2, model_type='nn', preprocessor=sp)
agent.train(zs=zs_train, 
            xs=states_tilde, 
            actions=actions_train, 
            rewards=rewards_tilde, 
            max_iter=100, 
            preprocess=False)

100%|██████████| 100/100 [01:03<00:00,  1.57it/s]


## `SimulatedEnvironment` Training

Before moving on to the evaluation stage, there is one more thing to do: We need to train a `SimulatedEnvironment` that mimics the transition rules of the true environment that generated the training trajectory, which will be used by the evaluation functions to simulate the true data-generating environment. To do so, we initialize a `SimulatedEnvironment` and train it on the whole trajectory data (i.e. training set and testing set combined).

In [9]:
env = SimulatedEnvironment(num_actions=2, 
                           state_model_type='nn', 
                           reward_model_type='nn')
env.fit(zs=zs, states=states, actions=actions, rewards=rewards)

  3%|▎         | 31/1000 [00:01<00:34, 28.47it/s]
  2%|▏         | 16/1000 [00:00<00:36, 26.66it/s]
  3%|▎         | 26/1000 [00:00<00:30, 31.81it/s]
  2%|▎         | 25/1000 [00:00<00:34, 28.31it/s]


## Value Evaluation

We now estimate the value achieved by the trained policy when interacting with the target environment using fitted Q evaluation (FQE), which is provided by `evaluate_value_through_fqe()`. We use a discount factor of $0.9$ by setting `gamma=0.9`. We use the testing set for evaluation.

In [None]:
value = evaluate_reward_through_fqe(zs=zs_test, 
                                    states=states_test, 
                                    actions=actions_test, 
                                    rewards=rewards_test, 
                                    policy=agent, 
                                    model_type='nn', 
                                    gamma=0.9)
value

100%|██████████| 200/200 [01:59<00:00,  1.68it/s]


7.3576775

## Counterfactual Fairness Evaluation

We now estimate the counterfactual fairness acheived by the policy when interacting with the target environment. To do so, we use `evaluate_fairness_through_model()`. This function first estimates the counterfactual trajectories of each individual in the data under a set of legit sensitive attribute values. Then it takes actions based on the counterfactual states using the policy that is to be evaluated. In the end, it calculates and returns a counterfactual fairness metric (CF metric) following the formula 

$\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right),$

where $eval(Z)$ is the set of sensitive attribute values passed in by `z_eval_levels`, $A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)$ is the action taken in the counterfactual trajectory under $Z=z'$, and $A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)$ is the action taken under the counterfactual trajectory under $Z=z$. The CF metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness. We again use the testing set for evaluation.

In [11]:
cf_metric = evaluate_fairness_through_model(env=env, 
                                            zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            policy=agent)
cf_metric

0.041818181818181824

We can see that our policy achieves a low CF metric value, which indicates it is close to being perfectly counterfactually fair. Indeed, the CF metric should be exactly 0 if we know the true underlying environment; the reason why it is not exactly 0 here is because we need to estimate the true underlying environment during preprocessing, which can introduce errors.

## Comparisons: Assessing the Performance of Baseline Policies

We can follow a similar approach to evaluate the performance of a few baselines: random, fairness-through-unawareness, and full. In this section, we briefly discuss the implementation and performance of these baselines. We will use custom preprocessors and agents to implement these baselines, so we first import the `Preprocessor` and `Agent` classes.

In [12]:
from cfrl.preprocessor import Preprocessor
from cfrl.agents import Agent

### Random

As its name suggests, a random baseline is a policy that selects actions randomly. For this baseline, we implement a custom agent called `RandomAgent` that selects actions at random.

In [None]:
class RandomAgent(Agent):
    def __init__(self, num_action_levels: int):
        self.num_action_levels = num_action_levels
        self.__name__ = 'RandomAgent'

    def act(self, 
            z: list | np.ndarray, 
            xt: list | np.ndarray, 
            xtm1: list | np.ndarray | None = None, 
            atm1: list | np.ndarray | None = None, 
            uat: list | np.ndarray | None = None, 
            **kwargs) -> np.ndarray:
        if uat is None:
            N = z.shape[0]
            out = np.zeros(N)
            for i in range(N):
                out[i] = np.random.randint(self.num_action_levels)
            return out
        else:
            action = (uat.flatten() <= 0.5).astype(int)
            return action

We now initialize an instance of `RandomAgent` and estimate the value of the random policy.

In [None]:
agent_random = RandomAgent(num_action_levels=2)
value_random = evaluate_reward_through_fqe(zs=zs_test, 
                                           states=states_test, 
                                           actions=actions_test, 
                                           rewards=rewards_test, 
                                           policy=agent_random, 
                                           model_type='nn', 
                                           gamma=0.9)
value_random

100%|██████████| 200/200 [01:44<00:00,  1.92it/s]


-1.4444195

Finally, we estimate the CF metric of the random policy.

In [34]:
cf_metric_random = evaluate_fairness_through_model(env=env, 
                                                   zs=zs_test, 
                                                   states=states_test, 
                                                   actions=actions_test, 
                                                   policy=agent_random)
cf_metric_random

0

The random policy achieved perfect fairness. This is expected because all the counterfactual trajectories for the same individual should share the same randomness, which means the random policy should select the same action in these counterfactual trajectories.

### Fairness-through-unawareness

Fairness-through-unawareness proposes to ensure fairness by excluding the sensitive attribute from the state variable (and thus from the agent's decision-making). Nevertheless, it has been argued that this method can still be unfair because the agent might learn the bias indirectly from the states and rewards, which are often biased. In this section, we train a policy following fairness-through-unawareness using the same training trajectory data and estimate its value and CF metric.

We begin by training a fairness-through-unawareness policy. As shown in the code below, we directly use the raw training trajectory for policy learning without performing preprocessing. This enforces fairness-through-unawareness because `agent_unaware` only uses `states_train`, `actions_train`, and `rewards_train` during training (i.e. the sensitive attribute is not used).

In [35]:
agent_unaware = FQI(num_actions=2, model_type='nn', preprocessor=None)
agent_unaware.train(zs=zs_train, 
                    xs=states_train, 
                    actions=actions_train, 
                    rewards=rewards_train, 
                    max_iter=100, 
                    preprocess=False)

100%|██████████| 100/100 [00:47<00:00,  2.11it/s]


We now estimate the value of the fairness-through-unawareness policy.

In [None]:
value_unaware = evaluate_reward_through_fqe(zs=zs_test, 
                                            states=states_test, 
                                            actions=actions_test, 
                                            rewards=rewards_test, 
                                            policy=agent_unaware, 
                                            model_type='nn', 
                                            gamma=0.9)
value_unaware

100%|██████████| 200/200 [01:12<00:00,  2.75it/s]


8.588442

Finally, we estimate the CF metric of the fairness-through-unawareness policy.

In [37]:
cf_metric_unaware = evaluate_fairness_through_model(env=env, 
                                                    zs=zs_test, 
                                                    states=states_test, 
                                                    actions=actions_test, 
                                                    policy=agent_unaware)
cf_metric_unaware

0.44636363636363635

We can see that the fairness-through-unawareness policy is much less fair than the policy learned using the preprocessed trajectory. This suggests that the preprocessing method likely reduced the bias in the training trajectory effectively. 

### Full

The full baseline directly uses the sensitive attribute as part of the state variable for policy learning. It should achieve higher value than the other baselines, but the fairness can be compromised in return. Note that the state variable in our original trajectory does not contain the sensitive attribute. Therefore, to enforce the full baseline, we should concatenate the sensitive attribute to the state variable before policy learning or decision-making. This can be done using the following custom preprocessor, which we call `ConcatenatePreprocessor`.

In [38]:
class ConcatenatePreprocessor(Preprocessor):
        def __init__(self) -> None:
            pass

        def preprocess(
                self, 
                z: list | np.ndarray, 
                xt: list | np.ndarray
            ) -> tuple[np.ndarray]:
            if xt.ndim == 1:
                xt = xt[np.newaxis, :]
                z = z[np.newaxis, :]
                xt_new = np.concatenate([xt, z], axis=1)
                return xt_new.flatten()
            elif xt.ndim == 2:
                xt_new = np.concatenate([xt, z], axis=1)
                return xt_new
            
        def preprocess_single_step(
                self, 
                z: list | np.ndarray, 
                xt: list | np.ndarray, 
                xtm1: list | np.ndarray | None = None, 
                atm1: list | np.ndarray | None = None, 
                rtm1: list | np.ndarray | None = None, 
                verbose: bool = False
            ) -> tuple[np.ndarray, np.ndarray] | np.ndarray:
            z = np.array(z)
            xt = np.array(xt)
            if verbose:
                print("Preprocessing a single step...")

            xt_new = self.preprocess(z, xt)
            if rtm1 is None:
                return xt_new
            else:
                return xt_new, rtm1
            

        def preprocess_multiple_steps(
                self, 
                zs: list | np.ndarray, 
                xs: list | np.ndarray, 
                actions: list | np.ndarray, 
                rewards: list | np.ndarray | None = None, 
                verbose: bool = False
            ) -> tuple[np.ndarray, np.ndarray] | np.ndarray:
            zs = np.array(zs)
            xs = np.array(xs)
            actions = np.array(actions)
            rewards = np.array(rewards)
            if verbose:
                print("Preprocessing multiple steps...")
        
            # some convenience variables
            N, T, xdim = xs.shape
            
            # define the returned arrays; the arrays will be filled later
            xs_tilde = np.zeros([N, T, xdim + zs.shape[-1]])
            rs_tilde = np.zeros([N, T - 1])

            # preprocess the initial step
            np.random.seed(0)
            xs_tilde[:, 0, :] = self.preprocess_single_step(zs, xs[:, 0, :])

            # preprocess subsequent steps
            if rewards is not None:
                for t in range (1, T):
                    np.random.seed(t)
                    xs_tilde[:, t, :], rs_tilde[:, t-1] = self.preprocess_single_step(zs, 
                                                                                    xs[:, t, :], 
                                                                                    xs[:, t-1, :], 
                                                                                    actions[:, t-1], 
                                                                                    rewards[:, t-1]
                                                                                    )
                return xs_tilde, rs_tilde                
            else:
                for t in range (1, T):
                    np.random.seed(t)
                    xs_tilde[:, t, :] = self.preprocess_single_step(zs, 
                                                                    xs[:, t, :], 
                                                                    xs[:, t-1, :], 
                                                                    actions[:, t-1]
                                                                    )
                return xs_tilde

We perform policy learning using an FQI agent with `ConcatenatePreprocessor` as its internal preprocessor. In this case, we can directly pass the raw trajectories to the FQI agent, and the internal `ConcatenatePreprocessor` will concatenate the sensitive attribute to the state variable before policy learning or decision-making. 

In [41]:
cp = ConcatenatePreprocessor()
agent_full = FQI(num_actions=2, model_type='nn', preprocessor=cp)
agent_full.train(zs=zs_train, 
                 xs=states_train, 
                 actions=actions_train, 
                 rewards=rewards_train, 
                 max_iter=100)

100%|██████████| 100/100 [00:55<00:00,  1.81it/s]


We now estimate the value of the full policy.

In [None]:
value_full = evaluate_reward_through_fqe(zs=zs_test, 
                                         states=states_test, 
                                         actions=actions_test, 
                                         rewards=rewards_test, 
                                         policy=agent_full, 
                                         model_type='nn', 
                                         gamma=0.9)
value_full

100%|██████████| 200/200 [01:32<00:00,  2.15it/s]


8.606067

Finally, we estimate the CF metric of the full policy.

In [None]:
cf_metric_full = evaluate_fairness_through_model(env=env, 
                                                 zs=zs_test, 
                                                 states=states_test, 
                                                 actions=actions_test, 
                                                 policy=agent_full)
cf_metric_full

0.4072727272727273

The full policy is also much less fair than the policy learned using the preprocessed trajectory. This again suggests that the preprocessing method likely reduced the bias in the training trajectory effectively. 