# Assessing Preprocessors Using Synthetic Data

In this workflow, CFRL first uses `sample_trajectory()` to sample a trajectory from a 
`SyntheticEnvironment` whose transition rules are pre-specified. It then preprocesses the 
sampled trajectory using some custom preprocessor defined by the user. 
After that, the preprocessed trajectory is passed into `FQI` to train a policy, which is then 
assessed using synthetic data via `evaluate_reward_through_simulation()` and 
`evaluate_fairness_through_simulation()`. The final output of the workflow is the policy trained 
on the preprocessed data as well as its estimated value and counterfactual fairness metric. This 
workflow is appropriate when the user wants to examine the impact of some trajectory preprocessing 
method on the value and counterfactual fairness of the trained policy.

We begin by importing the liberaries needed for this demonstration.

In [1]:
# Need this temporarily to import CFRL before it is officially published to PyPI
import sys
sys.path.append("E:/learning/university/MiSIL/CFRL Python Package/CFRL")

In [2]:
import pandas as pd
import numpy as np
#import pytorch as torch
from CFRL.preprocessor import Preprocessor
from CFRL.agents import FQI
from CFRL.environment import SyntheticEnvironment, sample_trajectory
from CFRL.evaluation import evaluate_reward_through_simulation
from CFRL.evaluation import evaluate_fairness_through_simulation
from examples.baseline_agents import RandomAgent
np.random.seed(1) # ensure reproducibility
#torch.manual_seed(1)

## Demonstration Setup

Suppose we have a preprocessor `ConcatenatePreprocessor`, which is defined in the code block below. It essentially adds the senstive attribute to the state variable, which means the policy will directly use the senstive attribute during decision-making. We want to assess how this preprocessing method performs in terms of the value and counterfactual fairness of resulting policies. 

In [3]:
class ConcatenatePreprocessor(Preprocessor):
        def __init__(self) -> None:
            pass

        def preprocess(
                self, 
                z: list | np.ndarray, 
                xt: list | np.ndarray
            ) -> tuple[np.ndarray]:
            if xt.ndim == 1:
                xt = xt[np.newaxis, :]
                z = z[np.newaxis, :]
                xt_new = np.concatenate([xt, z], axis=1)
                return xt_new.flatten()
            elif xt.ndim == 2:
                xt_new = np.concatenate([xt, z], axis=1)
                return xt_new
            
        def preprocess_single_step(
                self, 
                z: list | np.ndarray, 
                xt: list | np.ndarray, 
                xtm1: list | np.ndarray | None = None, 
                atm1: list | np.ndarray | None = None, 
                rtm1: list | np.ndarray | None = None, 
                verbose: bool = False
            ) -> tuple[np.ndarray, np.ndarray] | np.ndarray:
            z = np.array(z)
            xt = np.array(xt)
            if verbose:
                print("Preprocessing a single step...")

            xt_new = self.preprocess(z, xt)
            if rtm1 is None:
                return xt_new
            else:
                return xt_new, rtm1
            

        def preprocess_multiple_steps(
                self, 
                zs: list | np.ndarray, 
                xs: list | np.ndarray, 
                actions: list | np.ndarray, 
                rewards: list | np.ndarray | None = None, 
                verbose: bool = False
            ) -> tuple[np.ndarray, np.ndarray] | np.ndarray:
            zs = np.array(zs)
            xs = np.array(xs)
            actions = np.array(actions)
            rewards = np.array(rewards)
            if verbose:
                print("Preprocessing multiple steps...")
        
            # some convenience variables
            N, T, xdim = xs.shape
            
            # define the returned arrays; the arrays will be filled later
            xs_tilde = np.zeros([N, T, xdim + zs.shape[-1]])
            rs_tilde = np.zeros([N, T - 1])

            # preprocess the initial step
            np.random.seed(0)
            xs_tilde[:, 0, :] = self.preprocess_single_step(zs, xs[:, 0, :])

            # preprocess subsequent steps
            if rewards is not None:
                for t in range (1, T):
                    np.random.seed(t)
                    xs_tilde[:, t, :], rs_tilde[:, t-1] = self.preprocess_single_step(zs, 
                                                                                    xs[:, t, :], 
                                                                                    xs[:, t-1, :], 
                                                                                    actions[:, t-1], 
                                                                                    rewards[:, t-1]
                                                                                    )
                return xs_tilde, rs_tilde                
            else:
                for t in range (1, T):
                    np.random.seed(t)
                    xs_tilde[:, t, :] = self.preprocess_single_step(zs, 
                                                                    xs[:, t, :], 
                                                                    xs[:, t-1, :], 
                                                                    actions[:, t-1]
                                                                    )
                return xs_tilde

Meanwhile, we also define the environment in which we want to assess `ConcatenatePreprocessor`. We define the transition rules of the environment as follows, which has bivariate sensitive attributes and tri-variate states.

In [4]:
# an environment with bivariate zs and trivariate states
# x0_i = 0.5 + zs_1 + zs_2 + ux0_i (assuming z_coef=1)
def f_x0_multi(
        zs: np.ndarray, 
        ux0: np.ndarray, 
        z_coef: int = 1
    ) -> np.ndarray:
    gamma0 = np.array([np.repeat(np.array([0.5]), repeats=3), 
                       np.repeat(np.array(1 * z_coef), repeats=3), 
                       np.repeat(np.array(1 * z_coef), repeats=3)])
    n = zs.shape[0]
    M = np.concatenate(
        [
            np.ones([n, 1]),
            zs,
        ],
        axis=1,
    )
    x0 = M @ gamma0
    x0 = x0.reshape(-1, 3)
    x0 = x0 + ux0
    return x0

# xt_i = -0.5 + (zs_1 - 0.5) + (zs_2 - 0.5) + 0.3 * (xtm1_1 + xtm1_2 + xtm1_3) 
# + 0.2 * (atm1 - 0.5) + 0.3 * (zs_1 - 0.5) * (atm1 - 0.5) 
# + 0.3 * (zs_1 - 0.5) * (atm1 - 0.5) + uxt (assuming z_coef=1)
def f_xt_multi(
        zs: np.ndarray, 
        xtm1: np.ndarray, 
        atm1: np.ndarray, 
        uxt: np.ndarray, 
        z_coef: int = 1
    ) -> np.ndarray:
    gamma = np.array([np.repeat(np.array([-0.5]), repeats=3), 
                      np.repeat(np.array(1 * z_coef), repeats=3), 
                      np.repeat(np.array(1 * z_coef), repeats=3), 
                      np.array([0.3, 0.3, 0.3]),
                      np.array([0.3, 0.3, 0.3]), 
                      np.array([0.3, 0.3, 0.3]), 
                      np.array([0.2, 0.2, 0.2]), 
                      np.array([0.3, 0.3, 0.3]), 
                      np.array([0.3, 0.3, 0.3])])
    n = xtm1.shape[0]
    M = np.concatenate(
        [
            np.ones([n, 1]),
            zs - 0.5,
            xtm1,
            atm1.reshape(-1, 1) - 0.5, 
            (zs[:, 0].reshape(-1, 1) - 0.5) * (atm1.reshape(-1, 1) - 0.5), 
            (zs[:, 1].reshape(-1, 1) - 0.5) * (atm1.reshape(-1, 1) - 0.5)
        ],
        axis=1,
    )
    xt = M @ gamma
    xt = xt.reshape(-1, 3)
    xt = xt + uxt
    return xt

# rt_i = -0.5 + zs_1 + zs_2 + xt_1 + xt_2 + xt_3 + at + urt (assuming z_coef=1)
def f_rt_multi(
        zs: np.ndarray, 
        xt: np.ndarray, 
        at: np.ndarray, 
        urtm1: np.ndarray, 
        z_coef: int = 1
    ) -> np.ndarray:
    lmbda = np.array([-0.5, 1, 1, 1, 1 * z_coef, 1 * z_coef, 1, 1])
    n = xt.shape[0]
    at = at.reshape(-1, 1)
    M = np.concatenate(
        [np.ones([n, 1]), xt, zs, at, urtm1], axis=1
    )
    rt = M @ lmbda
    return rt

## Training Trajectory Generation

We now generate a trajectory with 300 inviduals (i.e. $N=300$) and 10 transitions used for training the `FQI` agent. Note that we do not train the preprocessor here because `ConcatenatePreprocessor` does not require training.

We first initialize a `SyntheticEnvironment` using the transition rules defined in the previous section. This `SyntheticEnvironment` will be used to generate trajectories throughout this demonstration.

In [5]:
env = SyntheticEnvironment(state_dim=3, 
                           z_coef=1, 
                           f_x0=f_x0_multi, 
                           f_xt=f_xt_multi, 
                           f_rt=f_rt_multi)

Before generating the trajectory, we first generate the sensitive attributes of the 300 individuals in the trajectory. We allow the senstive attributes to take on four different values: $[0, 0]$, $[0, 1]$, $[1, 0]$, and $[1, 1]$. Each individual's sensitive attribute value is sampled randomly from a uniform distribution over all the legit sensitive attribute values.

In [6]:
zs_in = np.zeros((300, 2))
z_levels = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
Z_idx = np.random.choice(range(z_levels.shape[0]), size=300, replace=True)
for i in range(300):
    zs_in[i] = z_levels[Z_idx[i]]

We now generate the trajectory using the `sample_trajectory()` function. The actions in the trajectory are taken using a random agent that selects 0 and 1 with equal probabilities.

In [7]:
behavior_agent = RandomAgent(2)
zs, states, actions, rewards = sample_trajectory(env=env, 
                                                 zs=zs_in, 
                                                 state_dim=3, 
                                                 T=10, 
                                                 policy=behavior_agent)

We check the shapes of Trajectory Arrays generated by `sample_trajectory()`.

In [8]:
zs.shape, states.shape, actions.shape, rewards.shape

((300, 2), (300, 11, 3), (300, 10), (300, 10))

## Policy Learning

We now learn a policy using `FQI` and `ConcatenatePreprocessor`. We first initialize an `FQI` agent that uses `ConcatenatePreprocessor` as its internal preprocessor.

In [9]:
cp = ConcatenatePreprocessor()
agent = FQI(num_actions=2, model_type='nn', preprocessor=cp)

We now perform training. Since we set `preprocess=True` in `train()`, `agent` will use its internal preprocessor (i.e. `cp`) to automatically preprocess the input training trajectory before using the trajectory for policy learning. Therefore, we can directly pass in the unpreprocessed states and rewards.

In [10]:
agent.train(zs=zs, 
            xs=states, 
            actions=actions, 
            rewards=rewards, 
            max_iter=100, 
            preprocess=True)

100%|██████████| 100/100 [00:46<00:00,  2.17it/s]


In [11]:
agent.act(z=zs, xt=states[:, 0])

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1], d

## Value Evaluation

We now estimate the value achieved by the trained policy when interacting with the target environment (i.e. `env`). Since the underlying transition rules are known, we can directly use `evaluate_rewards_through_simulation()`. This function generates a new trajectory using `agent` under `env` and computes the discounted cumulative rewards collected in the trajectory.

We evaluate the discounted cumulative rewards using a simulation with 100 individuals (i.e. $N=100$) and 500 transitions ($T=500$). The discount factor is $\gamma=0.9$.

In [12]:
value = evaluate_reward_through_simulation(env=env, 
                                           z_eval_levels=[[0, 0], [0, 1], 
                                                          [1, 0], [1, 1]], 
                                           state_dim=3, 
                                           N=100, 
                                           T=500, 
                                           policy=agent)
value

-16.296371287239186

## Counterfactual Fairness Evaluation

We now estimate the counterfactual fairness acheived by the policy when interacting with the target environment (i.e. `env`). To do so, we use `evaluate_fairness_through_simulation()`. This function first generates the counterfactual trajectories of each individual in the data under a set legit sensitive attribute values using the policy that is to be evaluated. It then  calculates and returns a counterfactual fairness metric (CF metric) following the formula 

$\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right),$

where $eval(Z)$ is the set of sensitive attribute values passed in by `z_eval_levels`, $A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)$ is the action taken in the counterfactual trajectory under $`Z=z'$, and $A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)$ is the action taken under the counterfactual trajectory under $Z=z$. The CF metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness.

We evaluate the counterfactual fairness using a simulation with 100 individuals (i.e. $N=100$) and 10 transitions ($T=10$).

In [13]:
cf_metric = evaluate_fairness_through_simulation(env=env, 
                                                 z_eval_levels=[[0, 0], [0, 1], 
                                                                [1, 0], [1, 1]], 
                                                 state_dim=3, 
                                                 N=100, 
                                                 T=10, 
                                                 policy=agent)
cf_metric

0.47000000000000003

## SCRATCH WORK (SHOULD DELETE LATER)

In [14]:
from CFRL.preprocessor import SequentialPreprocessor
sp = SequentialPreprocessor(z_space=[[0, 0], [0, 1], [1, 0], [1, 1]], 
                            num_actions=2, 
                            cross_folds=5, 
                            mode='single', 
                            reg_model='nn')
states_tilde, rewards_tilde = sp.train_preprocessor(zs=zs, 
                                                    xs=states, 
                                                    actions=actions, 
                                                    rewards=rewards)

100%|██████████| 1000/1000 [00:22<00:00, 44.34it/s]
100%|██████████| 1000/1000 [00:22<00:00, 43.89it/s]
100%|██████████| 1000/1000 [00:21<00:00, 45.70it/s]
100%|██████████| 1000/1000 [00:22<00:00, 44.74it/s]
100%|██████████| 1000/1000 [00:21<00:00, 45.63it/s]


In [15]:
agent_cf = FQI(num_actions=2, model_type='nn', preprocessor=sp)
agent_cf.train(zs=zs, 
            xs=states_tilde, 
            actions=actions, 
            rewards=rewards_tilde, 
            max_iter=100, 
            preprocess=False)

100%|██████████| 100/100 [00:45<00:00,  2.22it/s]


In [16]:
agent_cf.act(z=zs, xt=states[:, 0])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], d

In [17]:
value = evaluate_reward_through_simulation(env=env, 
                                           z_eval_levels=[[0, 0], [0, 1], 
                                                          [1, 0], [1, 1]], 
                                           state_dim=3, 
                                           N=100, 
                                           T=500, 
                                           policy=agent_cf)
value

-17.855928340263535

In [18]:
cf_metric = evaluate_fairness_through_simulation(env=env, 
                                                 z_eval_levels=[[0, 0], [0, 1], 
                                                                [1, 0], [1, 1]], 
                                                 state_dim=3, 
                                                 N=100, 
                                                 T=10, 
                                                 policy=agent_cf)
cf_metric

0.018000000000000002