# Model Free prediction
The goal of this lesson is to be able to estimate the value of an unknown MDP with a given policy, in the previous lesson we assumed the full knowledge of the MDP which allowed us to use iterative methods for computing the value function without probing any episode.

Here on the contrary we will no assume that we know the MDP and we will introduce methods for computing the value function and the action-value function

We will use the Student MRP coded in the previous lessons to illustrate our points

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

class StudentMarkovChain():
    """
    This class models the Markov process described in the lesson
    everything is hard coded, this class is not meant to be general
    """
    def __init__(self):
        #transition probabilities of each state
        self.transition = np.array([[0, 0.5 , 0 , 0 , 0 , 0.5 , 0 ],
                                    [0 , 0 , 0.8 , 0 , 0 , 0 , 0.2],
                                    [0 , 0 , 0 , 0.6 , 0.4 , 0 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1],
                                    [0.2 , 0.4 , 0.4 , 0 , 0 , 0 , 0],
                                    [0.1 , 0 , 0 , 0 , 0 , 0.9 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1]
                                    ])
        #name of states
        self.titles=["C1","C2","C3","Pass","Pub","FB","Sleep"]
        #first state
        self.state=0
        #the class will keep the history
        self.history = [self.titles[self.state]]

    #function to change state in the markov process
    def step(self):
        #Next state is picked following the probabilities of the transition matrix 
        self.state = np.random.choice(range(7),p=self.transition[self.state])
        self.history.append(self.titles[self.state])
        #if the state is  final
        if self.state != 6:
            #we return a state a bool telling id it is finished and a null reward
            return self.state,False,0
        else:
            return self.state,True,0
    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [self.titles[self.state]]


#function to run the reward process or 
def main_markov():
    finished = False
    smc = StudentMarkovRewardProcess()
    while not finished:
        _,finished,_ = smc.step()
    print(smc.history)

class StudentMarkovRewardProcess(StudentMarkovChain):
    """
    Class to add rewards to the student markov chain
    it is inheriting the transition probabilities and the names from the markov chain
    """
    def __init__(self):

        StudentMarkovChain.__init__(self)
        # we are adding here the rewards of the different states
        self.rewards=[-2,-2,-2,10,1,-1,0]
        #and the shape of the history includes the rewards 
        self.history[-1]=(self.history[-1],self.rewards[self.state])

    # change the step function of the markov chain to add the rewards    
    def step(self):
        state,finished,_ = StudentMarkovChain.step(self)
        reward = self.rewards[state]
        self.history[-1]=(self.history[-1],reward)
        return self.state,finished,reward

    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [(self.titles[self.state],self.rewards[self.state])]



## Monte Carlo evaluation
Here we run many episode and we compute the average of the return over all the episode for each state:
$$ v_{\pi}(s) = \mathbb E _{\pi} [G_t |S_t = s]$$

Here is an implementation of the **every-visit** monte carlo policy evaluation

In [4]:
#MC eveluation of the return to compute the value function
def monte_carlo_eval(number_of_sample):
    finished = False
    #initialize markov reward process
    smc = StudentMarkovRewardProcess()
    #values of the states
    values = np.zeros(7)
    #number of times each state was visited
    numbers = np.zeros(7)
    gamma = 0.9
    #Here we start sampling episodes
    for sample in tqdm(range(number_of_sample)):
        finished=False
        history=[(0,-2)]
        #making a full evaluation of the process
        while not finished:
            state, finished, reward = smc.step()
            history.append((state,reward))
        #computing returns (sum discounted rewards)
        returns=[]
        #offline updates
        for i in range(len(history)):
            state = history[i][0]
            current_return = sum([gamma**j*reward for j,(_,reward) in enumerate(history[i:])])
            returns.append((state,current_return))
        #updating values
        for state,local_return in returns:
            numbers[state]+=1
            values[state]+=(local_return-values[state])/numbers[state]
        smc.reboot()
    print(dict(zip(smc.titles,values)))

In [5]:
monte_carlo_eval(2000)

100%|██████████| 2000/2000 [00:03<00:00, 524.79it/s]

{'C1': -4.893740722266557, 'C2': 0.8574090139837567, 'C3': 4.028911854387622, 'Pass': 10.0, 'Pub': 2.002390065808052, 'FB': -7.520075693235476, 'Sleep': 0.0}





Here we managed to compute the value of the states, without using an knowledge on the MRP. However Monte carlo doesn't allow to learn on the flight (inside an episode). All the updates are made offline (meaning after the episode was performed) The update that I will present afterwards will allow on-line updates

## Temporal Difference evaluation

The goal here is to use the Markov property of the MDP (or MRP) to compute the state values and it relies on the Bellman equation

$$ v_{\pi}(s) = \mathbb E _{\pi} [R_{t+1} + \gamma v_{\pi}(S{t+1}) |S_t = s]$$

Therefore we can rewritte the update of the state in the following manner:

$$ V(S_t) \rightarrow V(S_t) + \alpha( R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$$

To perform the update we only need two consecutive states


In [6]:
#TD(0) implemented for the student reward process
def temporal_difference_eval(number_of_sample):
    finished = False
    smc = StudentMarkovRewardProcess()
    values = np.zeros(7)
    numbers = np.zeros(7)
    gamma = 0.9
    for sample in tqdm(range(number_of_sample)):
        finished=False
        state=smc.state
        reward = smc.rewards[state]
        #making a full evaluation of the process
        while not finished:
            former_state,former_reward = state,reward 
            state, finished, reward = smc.step()
            #online updates
            numbers[former_state]+=1
            values[former_state] += (former_reward + gamma*values[state] -values[former_state])/numbers[former_state]
        smc.reboot()
    print(dict(zip(smc.titles,values)))

In [8]:
temporal_difference_eval(4000)

100%|██████████| 4000/4000 [00:07<00:00, 529.46it/s]

{'C1': -4.251866548264395, 'C2': 1.0092612149890148, 'C3': 4.1734589428482005, 'Pass': 10.0, 'Pub': 2.1862308492278344, 'FB': -6.238723778685403, 'Sleep': 0.0}





We managed to reduce the variance by introducing some bias in the TD target. But we would like to make the best of both worlds. A compromise can be found by using TD($\lambda$)

## TD($\lambda$)

We can obtain multiple TD targets by applying iteratively the bellman equation
$$ G_t^{(1)} =  R_{t+1} + \gamma V(S_{t+1} $$
$$  G_t^{(2)} =  R_{t+1} + \gamma  R_{t+2} + \gamma^{2} V(S_{t+2}$$ 
$$ ... $$
$$ G_t^{(n)}= R_{t+1} + \gamma  R_{t+2} + ... + \gamma^{n}R_{t+n}$$

These target can now be averaged using a geometric law of parameter $\lambda$
$$ G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)} $$

By making $\lambda$ vary we can adjust whether we want the update to be closer to the Temporal difference update of the monte carlo update.


In [9]:
#TD lambda for the student reward process forward version
def td_lambda(number_of_sample):
    finished = False
    smc = StudentMarkovRewardProcess()
    values = np.zeros(7)
    numbers = np.zeros(7)
    gamma = 0.9
    # parameter lambda for the update
    lambda_td = 0.5
    for sample in tqdm(range(number_of_sample)):
        finished=False
        state=smc.state
        reward = smc.rewards[state]
        #making a full evaluation of the process
        history=[(0,-2)]
        while not finished:

            state, finished, reward = smc.step()
            history.append((state,reward))


        #offline updates (forward view)
        returns = []
        for i in range(len(history)):
            state = history[i][0]
            discounted_rewards = [reward*gamma**j for j,(_,reward) in enumerate(history[i:])]
            tds = [sum(discounted_rewards[:j])+gamma**j*values[state] for j,(state,reward) in enumerate(history[i:])][1:]
            if state != 6:
                #renormalization of the lambda targets (pay attention to the 1- lambda**n) as it is characteritic of a finite reward process
                td_target = (1-lambda_td)/(1-lambda_td**len(tds))*sum([td*lambda_td**j for j,td in enumerate(tds)])
            else:
                td_target=0
            returns.append((state,td_target))

        for state,td_target in returns:
            numbers[state]+=1
            values[state]+=(td_target-values[state])/numbers[state]

        smc.reboot()
    print(dict(zip(smc.titles,values)))

In [10]:
td_lambda(2000)

100%|██████████| 2000/2000 [00:06<00:00, 303.13it/s]

{'C1': -4.413332452197901, 'C2': 0.7623870306962792, 'C3': 3.9575381417084747, 'Pass': 10.0, 'Pub': 1.8784545599088731, 'FB': -6.35082787121543, 'Sleep': 0.0}





We implemented here the forward view of TD($\lambda$), the backward view can allow online updates.