You are given stochastic policy $\pi(a|s)$. In reinforcement learning, a stochastic policy is a probability distribution over possible actions that an agent can take in a particular state. Unlike a deterministic policy, which selects a single action with the highest probability, a stochastic policy allows the agent to select multiple actions with varying probabilities. 

The stochastic policy guides the agent as he interacts with a predefined MDP environment. The policy is disturbed with a "$\epsilon$-greedy" like process - with probability $\epsilon$ agent performs a random action (i.e. sampled from uniform distribution) and with probability 1 - $\epsilon$ agent performs a policy action (i.e. sampled from $\pi(a|s)$). 

The environment has 200 different states and 5 allowed actions. Furthermore, environment has deterministic transitions and a fixed starting state:

$$ P(s'|s,a) = 1 \quad \forall (s, a) $$

Your task is to write a function which, given the interactions with the environment, outputs the log-likelihood of the trajectory. Please bear in mind the epsilon-greedy disturbance!

## Hint:

Let $s_i, a_i$ be a random variables indicating state and action samples at step $i \in [0,t]$. The log-likelihood we are looking for is equal to:

$$ L = \log \bigl(P(s_0) * P(a_0|s_0) * P(s_1|s_0, a_0) * P(a_1|s_1) \  *\  ...\  *\  P(s_{t}|s_{t-1}, a_{t-1}) * P(a_t|s_t)\bigr) $$

Since starting state is fixed and transitions are deterministic, $P(s_0)$ and $P(s_{i+1}|s_i, a_i)$ can be neglected. However, note that $P(a_i|s_i)$ is affected by the $\epsilon$-greedy disturbance. Let E be a random variable with the following distrubution:

$$ P(E = 1) = \epsilon \quad \text{and} \quad P(E = 0) = 1 - \epsilon $$

E represents the epsilon greedy disturbance. As such, given 5 possible actions in each state we have:

$$ P(a|s, E=1) = \frac{1}{5} \quad \text{and} \quad P(a|s, E=0) = \pi(a|s)$$

Try to first calculate $P(a|s)$ using law of total probability!


In [1]:
import numpy as np
np.set_printoptions(suppress=True)

TIMESTEPS = 100
N_STATES = 200
N_ACTIONS = 5
EPSILON = 0.1

We define the environment:

In [2]:
class Env():
    def __init__(self):
        self.transitions = np.random.choice(np.arange(N_STATES), size=(N_STATES, N_ACTIONS))
        self.rewards = np.random.randint(0,10, size=(N_STATES, N_ACTIONS))

    def reset(self, starting_state=0):
        state = starting_state
        self.state = state
        self.step_idx = 0
        return state
    
    def step(self, action):
        self.step_idx += 1
        new_state = self.transitions[self.state, action]
        reward = self.rewards[self.state, action]
        self.state = new_state
        terminal = True if self.step_idx > TIMESTEPS else False
        return new_state, reward, terminal

The policy:

In [3]:
class Policy():
    def __init__(self):
        logits = np.random.uniform(1,10, size=(N_STATES, N_ACTIONS))        
        self.probabilities = np.exp(logits / 0.4) / np.exp(logits / 0.4).sum(1, keepdims=True)
        
    def get_action_and_probs(self, state):
        epsilon = np.random.uniform(0,1)
        if epsilon > EPSILON:
            action = np.random.choice(np.arange(N_ACTIONS), p=self.probabilities[state])
        else:
            action = np.random.randint(0, N_ACTIONS)
        return action, self.probabilities[state] 

And a function that generates the trajectory data for $t$ timesteps:

In [4]:
def get_trajectory_data():
    np.random.seed(1)
    actions = np.zeros(TIMESTEPS).astype(np.int16)
    policy = Policy()
    env = Env()
    policy_matrix = policy.probabilities
    transition_matrix = env.transitions
    state = env.reset()
    for i in range(TIMESTEPS):
        action, probs = policy.get_action_and_probs(state)
        new_state, reward, terminal = env.step(action)
        actions[i] = action
        state = new_state
        if terminal:
            state = env.reset()
    return actions.astype(np.int16), policy_matrix, transition_matrix

actions, policy_matrix, transition_matrix = get_trajectory_data()

Lets investigate the arrays returned by **get_trajectory_data**. First there is *actions*:

In [5]:
print(actions)

[1 0 4 2 0 1 0 0 1 3 1 2 4 3 4 4 0 0 2 4 4 4 0 0 2 4 4 4 0 0 2 2 4 1 4 1 4
 4 1 3 2 0 2 2 4 3 3 2 1 2 1 1 0 2 4 4 3 4 1 0 4 4 4 3 3 4 4 0 0 4 0 1 3 3
 1 3 2 4 3 2 1 4 4 1 3 2 3 3 4 1 1 2 1 2 1 2 1 4 1 2]


*Actions* array shows which action were performed at timestep $i$ (e.g. agent performed actions[0] in timestep 0). There are 100 timesteps.

In [6]:
print(policy_matrix[:10,:])

[[0.00108576 0.99882944 0.00000009 0.00008223 0.00000248]
 [0.00004114 0.00034041 0.01226394 0.03881531 0.9485392 ]
 [0.00003235 0.01286576 0.00000026 0.98710162 0.        ]
 [0.92229417 0.00309783 0.07457955 0.0000061  0.00002234]
 [0.02003422 0.8683322  0.00000035 0.00174708 0.10988615]
 [0.59156613 0.00000001 0.         0.00000005 0.40843381]
 [0.         0.00000567 0.99741475 0.00007057 0.00250901]
 [0.00000714 0.03011679 0.84377811 0.00000001 0.12609795]
 [0.98458086 0.0043779  0.00000012 0.01104112 0.        ]
 [0.00003149 0.99996665 0.00000098 0.00000086 0.00000002]]


Furthermore, there is *policy_matrix*. The array is of size (N_STATES x N_ACTIONS) and shows $\pi(a|s)$ for particular state-action pair (e.g. the policy probability of performing action "1" in state "0" ($\pi(a_1|s_0)$) is equal to policy_matrix[0,1]).

In [7]:
print(transition_matrix[:10,:])

[[ 61 167 134 135 140]
 [115 128   3  51 143]
 [195  81  59  66 143]
 [ 61 192  19 172  87]
 [ 61  21 120 128  63]
 [ 94 129  84 188  22]
 [142 152 174 104  32]
 [198  54 152 163  57]
 [159  39  14 113 176]
 [110  13  19 143  97]]


Finally, there is *transition_matrix*. The array is of size (N_STATES x N_ACTIONS) and shows the new state after performing certain action in specific state (e.g. after performing action "1" in state "0" the agent will be transitioned to state transition_matrix[0,1]). Note the deterministic transitions.

Complete the function that calculates log-likelihood of the sampled trajectory. You should first find which states were visited (using the starting state, *actions* and *transition_matrix*) and later find probability values that we are interested in (using the sequance of visited states, *actions* and *policy_matrix*).

1. **Agent starts in state "0" (i.e. the first row of policy_matrix and transition_matrix). Remember to keep the starting state in your list of visited states!**
2. **Keep in mind that *policy_matrix* does not incorporate the $\epsilon$-greedy!**

In [8]:
def calculate_loglikelihood(actions, policy_matrix, transition_matrix):
    ## YOUR CODE HERE ##
    return None

print(calculate_loglikelihood(actions, policy_matrix, transition_matrix))

None
