Goal: measure how correlated certain features of the policy $\pi_0$ are to the value $P(URS | \pi=\pi_0)$, where $URS$ indicates that the policy was optimized for a random reward function $R \in U[-1,1]^{|T|}$ (where $|T|$ is the number of transitions with non-zero probability). For simplicity's sake, we assume that it was either optimized for some $R$ or generated uniformly randomly from the set of all policies, with a 50% chance of each scenario. We also assume that the reward is generated i.i.d. via $R(s, a, s') \sim N(0, 1)$.

We can also analyze $P(USS | \pi = \pi_0)$ where $USS$ consists of sampling a sparsity factor $k \in [1, |T|]$, then zeroing out $k$ values from a randomly sampled $R$ as before.

In [34]:
import mdptoolbox as mdpt, numpy as np
import mdptoolbox.example

In [59]:
### Generate a bunch of MDPs with different parameters, sparsity
from functools import partial

NUM_MDPs = 100
NUM_STATES = 10
NUM_ACTIONS = 4

def get_transition_matrix(num_states, num_actions, generator = np.random.dirichlet, **kwargs):
    """
    Returns a determinstic transition matrix for a given number of states and actions
    
    Returns:
        P: (num_actions, num_states, num_states) array, where P[a, s, s'] is the probability of 
        transitioning from state s to state s' given action a
    """
    P = np.zeros((num_actions, num_states, num_states)) # (A, S, S') shape
    for a in range(num_actions):
        for s in range(num_states):
            P[a, s, :] = generator(np.ones(num_states))
    return P

def get_reward_matrix(transitions, sparsity = 0.0, generator = partial(np.random.uniform, -1, 1), **kwargs):
    """
    Returns a reward matrix for a given number of states and actions
    """
    num_pos_transitions = np.count_nonzero(transitions)
    num_sparse_rewards = max(1, int(sparsity * num_pos_transitions))
    rewards = np.array([(0 if i < num_sparse_rewards else generator()) for i in range(num_pos_transitions)])
    np.random.shuffle(rewards) # create a random permutation of the rewards
    # num_pos_transitions number of rewards, with num_sparse_rewards number of zeros
    out = np.zeros(transitions.shape)
    i = 0
    for a, s, s_prime in np.argwhere(transitions):
        out[a, s, s_prime] = rewards[i]
        i += 1
    assert np.count_nonzero(out) == num_pos_transitions - num_sparse_rewards
    return out

DISCOUNT = 0.9
EPSILON = 0.01 # roughly indicates the "skill level" of the agent
MAX_ITER = 1000

In [60]:
def generate_tests(num_mdps = NUM_MDPs, sparsity_levels: np.ndarray = None, mdp_generator = mdpt.mdp.PolicyIterationModified, P_generator = None, **kwargs):
    """
    Generate a bunch of MDPs with different sparsity levels, and return the sparsity levels and the MDPs

    Args:
        sparsity_levels: a list of sparsity levels to generate MDPs with
    Returns:
        sparsity_levels: the sparsity levels used to generate the MDPs, in the same order as the MDPs
        MDPS: an array of MDPs
    """
    (max_iter, epsilon) = (kwargs['max_iter'], kwargs['epsilon']) if 'max_iter' in kwargs and 'epsilon' in kwargs else (MAX_ITER, EPSILON)
    sparsity_levels = sparsity_levels if sparsity_levels is not None else np.arange(num_mdps) / num_mdps
    sparsity_copy = sparsity_levels.copy() # defensive copy
    np.random.shuffle(sparsity_copy)
    transitions = np.array([get_transition_matrix(NUM_STATES, NUM_ACTIONS, **kwargs) if P_generator is None else P_generator(NUM_STATES, NUM_ACTIONS, **kwargs) for i in range(num_mdps)])
    MDPS = np.array([mdp_generator(
        transitions[i], 
        get_reward_matrix(transitions[i], sparsity_copy[i], **kwargs), 
        DISCOUNT, max_iter = max_iter) 
        for i in range(num_mdps)
    ])
    for mdp in MDPS:
        if mdp_generator == mdpt.mdp.ValueIteration:
            mdp.epsilon = epsilon
    return sparsity_copy, MDPS

We build a transition function with various settings for properties (e.g. deterministic, sparse, fixed) and train a classifier to predict URS | $\pi = \pi_0$ (baseline probability = 0.5).

In [61]:
### Generate a bunch of MDPs (with baseline/zero sparsity), solve some of them, 
# generate random policy for others

def transition_function_sparse_loops(states, actions, fixed = False, **kwargs):
    """
    Sparse transition function with guaranteed loops
    TODO: possibly implement terminal states
    """
    # print(fixed)
    rng = np.random.default_rng(seed = 0) if fixed else None
    transitions = np.zeros((actions, states, states))
    for state in range(states):
        for action in range(actions):
            if action == 0:
                for next_state in range(states):
                    transitions[action, state, next_state] = 1 if next_state == state else 0
            else: # sparse randomness
                transitions[action, state, :] = np.zeros(states)
                transitions[action, state, np.random.randint(states) if not fixed else rng.integers(0, states)] = 1
    return transitions

NUM_MDPs = 10000
fixed = False
#print(np.random.uniform(1.0/NUM_ACTIONS/NUM_STATES**2, 1, NUM_MDPs))
sparsity_levels = np.random.uniform(1.0/NUM_ACTIONS/NUM_STATES**2, 1, NUM_MDPs)
# URS would be np.zeros(NUM_MDPs)

MDPS = generate_tests(NUM_MDPs, sparsity_levels = sparsity_levels,
                      P_generator = transition_function_sparse_loops, fixed = fixed)[1]
# Problem with _bounditer in ValueIteration happening when upper uniform bound is too high/sparse
random_pol_indices = np.random.choice(NUM_MDPs, int(NUM_MDPs / 2), replace = False) # The indices of the MDPs with random policies

In [62]:
# print(random_pol_indices)
for i in range(NUM_MDPs): # 50% RR, 50% random
    MDPS[i].run()
for i in random_pol_indices:
    MDPS[i].policy = np.random.randint(NUM_ACTIONS, size = NUM_STATES)
policies = np.array([mdp.policy for mdp in MDPS])
# print(policies.shape)
random_pol_set = set(random_pol_indices)
random_or_rr = np.array([0 if i in random_pol_set else 1 for i in range(NUM_MDPs)])
# 0 if random, 1 if generated from RR

In [63]:
# print([MDPS[1].P[j] == MDPS[0].P[j] for j in range(NUM_ACTIONS)])
assert not fixed or np.all([np.all([MDPS[i].P[j] == MDPS[0].P[j] for j in range(NUM_ACTIONS)]) for i in range(NUM_MDPs)])

In [64]:
print(policies[0:10], random_or_rr[0:10])

[[1 1 2 1 0 3 1 0 1 0]
 [2 2 3 0 1 0 1 3 2 2]
 [3 1 2 0 0 3 1 2 2 3]
 [2 2 3 1 1 1 2 3 1 2]
 [1 2 2 3 1 0 1 1 1 1]
 [0 1 3 0 0 0 3 3 0 1]
 [0 2 0 3 1 0 3 1 3 0]
 [2 2 1 3 3 3 1 3 3 3]
 [3 3 2 3 1 2 1 2 2 3]
 [1 2 3 1 3 2 2 0 0 2]] [0 0 0 0 1 0 0 0 1 0]


In [94]:
### Linear Regression
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, log_loss
from sklearn.preprocessing import OneHotEncoder
from tensorflow import keras

def regression(X, y, test_size = 0.2, regression = LinearRegression):
    """
    Trains a linear regression model on the given data, and returns the model and the mean squared error
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)
    model = regression().fit(X_train, y_train)
    return model, model.predict_proba(X_test), y_test

def neural_network(X, y, test_size = 0.2, *args, **kwargs):
    """
    Trains a neural network on the given data, and returns the model and the mean squared error
    """
    def build_model():
        model = keras.Sequential([
            keras.layers.Dense(64, activation = 'relu', input_shape = [X.shape[1]]),
            keras.layers.Dropout(0.2),
            keras.layers.Dense(64, activation = 'relu'),
            keras.layers.Dropout(0.2),
            keras.layers.Dense(1, activation = 'sigmoid')
        ])
        return model
    model = build_model()
    model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['mae'])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)
    model.fit(X_train, y_train, epochs = 100, validation_split = 0.2, verbose = 1, 
              callbacks = [keras.callbacks.EarlyStopping(patience = 3)])
    return model, model.predict(X_test), y_test

def find_loop_dist_and_length(transitions, policy: np.ndarray, initial_state):
    visited_states = {}  # Using a dict for quicker lookups
    current_state = initial_state
    step = 0  # Track the number of steps taken to find the loop length directly

    while current_state not in visited_states:
        visited_states[current_state] = step
        # Simulate a transition
        current_state = np.random.choice(np.arange(len(policy)), 1, 
                                         p = transitions[policy[current_state]][current_state]).item()
        step += 1
    
    #distance to loop = visited_states[current_state]; loop length = step - visited_states[current_state]
    return visited_states[current_state], step - visited_states[current_state]


### Generate features
encoder = OneHotEncoder(categories = 'auto', sparse_output = False, drop = 'first')
# Drop first to avoid multicollinearity, large coefficients
# encoder.fit(np.arange(NUM_ACTIONS))
# print(encoder.categories_)

### Train the model
# features = np.array([np.concatenate((np.array(MDPS[i].P).flatten(), policies_encoded[i]), axis = 0)
#                      for i in range(NUM_MDPs)])
features = encoder.fit_transform(policies)
loop_lengths = np.array([[find_loop_dist_and_length(MDPS[i].P, policies[i], policies[i][j])[0] for j in range(NUM_STATES)] 
                         for i in range(NUM_MDPs)])
features = encoder.fit_transform(loop_lengths)

In [98]:
features = loop_lengths # for interpretability
model, y_pred, y_test = regression(features, random_or_rr, regression = partial(LogisticRegression, max_iter = 1000))
print("Average cross-entropy loss:", log_loss(y_test, y_pred, normalize = True))
print("Accuracy:", np.mean([np.round(y_pred[i][0]) != y_test[i] for i in range(len(y_pred))])) 

# if round(y_pred[0]) is 0, then model thinks 1 is more likely; if 1, then 0 is more likely
# print(y_pred)
print("Baseline log loss:", log_loss(y_test, np.full(y_pred.shape, 0.5), normalize = True))
print("Model coefficients, intercept:", model.coef_, model.intercept_)
print("Sample outputs:", [(y_pred[i], y_test[i]) for i in range(10)])

Average cross-entropy loss: 0.6326985483625261
Accuracy: 0.6645
Baseline log loss: 0.6931471805599454
Model coefficients, intercept: [[0.18003894 0.03842198 0.02386525 0.04267284 0.0498999  0.09509716
  0.05531086 0.12606996 0.05786498 0.09934552]] [-0.9748501]
Sample outputs: [(array([0.45669389, 0.54330611]), 1), (array([0.34906547, 0.65093453]), 1), (array([0.3190951, 0.6809049]), 0), (array([0.72608517, 0.27391483]), 0), (array([0.47355157, 0.52644843]), 0), (array([0.56282145, 0.43717855]), 1), (array([0.25057232, 0.74942768]), 1), (array([0.64209953, 0.35790047]), 1), (array([0.37293366, 0.62706634]), 0), (array([0.55463121, 0.44536879]), 1)]


In [99]:
### Grab the five policies with the highest and lowest probabilities of being random
import networkx as nx

if fixed:
    # Generate a graph of the first MDP
    G = nx.DiGraph()
    for i in range(NUM_STATES):
        G.add_node(i)
    enumerated_edges = {}
    for i in range(NUM_ACTIONS):
        enumerated_edges[i] = []
        for j in range(NUM_STATES):
            for k in range(NUM_STATES):
                if MDPS[0].P[i][j, k] == 1:
                    G.add_edge(j, k, action = i)
                    enumerated_edges[i].append((j, k))
    edge_labels = {(u, v): f"{d['action']}" for u, v, d in G.edges(data=True)}
    pos = nx.spring_layout(G, k=0.5, iterations=20)  # k: Optimal distance between nodes. Increase/decrease to spread nodes out
    nx.draw(G, pos = pos, with_labels = True)
    nx.draw_networkx_edge_labels(G, pos = pos, edge_labels = edge_labels)
    
    for i in range(NUM_ACTIONS):
        print(f"Action {i} transitions:", enumerated_edges[i])

highest_probs = np.argsort(y_pred[:, 1])[-5:]
lowest_probs = np.argsort(y_pred[:, 1])[:5]
#print("Highest probabilities:", [(y_pred[i], y_test[i]) for i in highest_probs])
for i in np.concatenate((highest_probs, lowest_probs)):
    print("Policy:", policies[i], "Probability:", y_pred[i], "Actual:", y_test[i])

Policy: [3 2 0 2 0 2 3 2 0 3] Probability: [0.07292828 0.92707172] Actual: 1
Policy: [0 0 0 1 0 0 0 0 0 0] Probability: [0.05545048 0.94454952] Actual: 0
Policy: [2 2 0 2 2 3 1 2 0 0] Probability: [0.05441745 0.94558255] Actual: 0
Policy: [2 3 3 0 1 1 3 2 1 1] Probability: [0.04486485 0.95513515] Actual: 1
Policy: [1 1 3 2 2 2 1 2 1 1] Probability: [0.04056318 0.95943682] Actual: 0
Policy: [3 1 1 3 1 3 1 3 3 1] Probability: [0.72608517 0.27391483] Actual: 0
Policy: [2 2 0 2 3 2 1 3 0 1] Probability: [0.72608517 0.27391483] Actual: 0
Policy: [2 2 2 0 2 1 0 3 3 0] Probability: [0.72608517 0.27391483] Actual: 0
Policy: [0 1 2 2 3 2 1 3 3 0] Probability: [0.72608517 0.27391483] Actual: 0
Policy: [3 2 2 1 1 1 2 1 1 0] Probability: [0.72608517 0.27391483] Actual: 1


- On a random deterministic MDP(s), it doesn't seem like URS is identifiable, which is perhaps to be expected as every policy is optimal for some (normalized) reward function
    - This also matches our results when looking at the distribution of optimal policies for "cloud"-y MDPs
- Apparently my neural networks aren't predicting very well
- With MDPs with loops, logistic regression achieves ~0.74-0.76 accuracy (0.59 - 0.56 log loss); neural network does slightly better than random?
    - Although the NN's accuracy is basically 0.5
    - This holds true when we use the label predictions for regression (model.predict), as well as the probability prediction (model.predict_proba)
- Distance to loop correlates somewhat well with P(URS) (~0.66-0.68 accuracy, 0.61-0.63 log loss on a diverse dataset of sparse transition functions), length of loop not as well (~0.56 accuracy, 0.687 log loss)
    - Putting them together doesn't give improvement (~0.67 accuracy, 0.62 log loss)
    - Intuitively, the length of the loop an optimal policy takes is its “goal complexity”; distance to loop = “agency” 
- Setting $k \in U[1, N/2]$ gives ~0.72-0.74 accuracy, ~0.54 log loss
    - 0.56 accuracy, 0.688 log loss with length of loop; 0.66-0.672 accuracy, 0.61 log loss with distance to loop
    - Setting the upper bound of $k$ too high results in some weird MDP package errors, I suspect because sparsity is too high
    - This matches the distribution results we found in reward_function.ipynb, as sparsity didn't seem to "matter" until around ~0.9 given (S, A) = (10, 4)
- $k \in U[1, N]$ gives ~0.74-0.76 accuracy, 0.58-0.59 log loss
    - (Note that this was run with PolicyIterationModified instead of ValueIteration with the same settings, which I don't expect to change any of the results, but I might be wrong)
- TODO: run these tests multiple times to make box plots in the writeup