# Planning by Dynamic programming
I spent less time coding on this part of the lesson but it introduces key concepts such as Policy evaluation, Value Iteration... that will be used in the next lesson, so it is useful to know what is at stake

Dynamic programming assumes full knowledge of the MDP, this is a rather strong condition as we don't necesseraly know the transition probability or the reward that the agent will obtain by acting in a specific state: this allows us to assess the value function without running episodes

I take the exemple of the previous class to illustrate the concept

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

class StudentMarkovChain():
    """
    This class models the Markov process described in the lesson
    everything is hard coded, this class is not meant to be general
    """
    def __init__(self):
        #transition probabilities of each state
        self.transition = np.array([[0, 0.5 , 0 , 0 , 0 , 0.5 , 0 ],
                                    [0 , 0 , 0.8 , 0 , 0 , 0 , 0.2],
                                    [0 , 0 , 0 , 0.6 , 0.4 , 0 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1],
                                    [0.2 , 0.4 , 0.4 , 0 , 0 , 0 , 0],
                                    [0.1 , 0 , 0 , 0 , 0 , 0.9 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1]
                                    ])
        #name of states
        self.titles=["C1","C2","C3","Pass","Pub","FB","Sleep"]
        #first state
        self.state=0
        #the class will keep the history
        self.history = [self.titles[self.state]]

    #function to change state in the markov process
    def step(self):
        #Next state is picked following the probabilities of the transition matrix 
        self.state = np.random.choice(range(7),p=self.transition[self.state])
        self.history.append(self.titles[self.state])
        #if the state is  final
        if self.state != 6:
            #we return a state a bool telling id it is finished and a null reward
            return self.state,False,0
        else:
            return self.state,True,0
    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [self.titles[self.state]]


#function to run the reward process or 
def main_markov():
    finished = False
    smc = StudentMarkovRewardProcess()
    while not finished:
        _,finished,_ = smc.step()
    print(smc.history)





class StudentMarkovRewardProcess(StudentMarkovChain):
    """
    Class to add rewards to the student markov chain
    it is inheriting the transition probabilities and the names from the markov chain
    """
    def __init__(self):

        StudentMarkovChain.__init__(self)
        # we are adding here the rewards of the different states
        self.rewards=[-2,-2,-2,10,1,-1,0]
        #and the shape of the history includes the rewards 
        self.history[-1]=(self.history[-1],self.rewards[self.state])

    # change the step function of the markov chain to add the rewards    
    def step(self):
        state,finished,_ = StudentMarkovChain.step(self)
        reward = self.rewards[state]
        self.history[-1]=(self.history[-1],reward)
        return self.state,finished,reward

    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [(self.titles[self.state],self.rewards[self.state])]

## Policy evaluation

Knowing the transition probability we can compute iteratively the value of the states in the markov reward process, Policy evaluation allows us to assess the value of a given policy on a MDP as an MDP with fixed policy can be interpreted as an MRP

In [2]:
def policy_evaluation(nb):
    mrp = StudentMarkovRewardProcess()
    state_values = np.zeros(7)
    gamma = 0.9
    for _ in range(nb):
        state_values = np.array(mrp.rewards) + gamma*np.dot(np.array(mrp.transition),state_values)
    print(dict(zip(mrp.titles,state_values)))

In [3]:
nb_iterations = 2000
policy_evaluation(nb_iterations)

{'C1': -5.012728910014519, 'C2': 0.9426552976939075, 'C3': 4.087021246797093, 'Pass': 10.0, 'Pub': 1.9083923522141468, 'FB': -7.637608431059506, 'Sleep': 0.0}


We find the same values that the direct solving of the MRP

Using policy evaluation we can assess the quality of the states given a policy, knowing the states that have the best value we can act accordingly.
We will perform here policy evaluation on the MDP with a random policy

In [4]:
class MarkovStudentDecisionProcess():
    """
    Markov Decision Process of the Student graph presented in the lesson

    """
    def __init__(self):

        #less states than before 
        self.titles=["C1","C2","C3","FB","Sleep"]
        self.state=0
        #transition probabilites depend on the action taken by the agent
        self.transition = [{"Study":[0,1,0,0,0],'Facebook':[0,0,0,1,0]},
                           {'Study':[0,0,1,0,0],'Sleep':[0,0,0,0,1]},
                           {'Study':[0,0,0,0,1],'Pub':[0.2,0.4,0.4,0,0]},
                           {'Facebook':[0,0,0,1,0],"Quit":[1,0,0,0,0]},
                           {"Sleep":[0,0,0,0,1]}
                            ]
        #the reward depends on the state and action taken by the agent
        self.reward=[{"Study":-2,'Facebook':-1},
                   {'Study':-2,'Sleep':0},
                   {'Study':10,'Pub':1},
                   {'Facebook':-1,"Quit":0},
                   {"Sleep":0}]


    #same step depends now of the action taken by the agent    
    def step(self,action):
        reward = self.reward[self.state][action]
        self.state = np.random.choice(range(5),p=self.transition[self.state][action])
        
        
        finished=(action == "Sleep")
        return self.state,reward,finished
    #no history stored here 
    def reboot(self):
        self.state = 0


class AgentRandom():
    def __init__(self):
        """
        Random agent has no will it just pick random actions
        """
        self.actions = [["Study",'Facebook'],
                       ['Study','Sleep'],
                       ['Study','Pub'],
                       ['Facebook',"Quit"],
                       ["Sleep"]]

    #selecting action randomly                   
    def select_action(self,state):
        return np.random.choice(self.actions[state])

    #virtual update
    def update(self,**kwargs):
        pass



In [8]:
def policy_evaluation(nb_iteration):
    mdp = MarkovStudentDecisionProcess()
    agent = AgentRandom()
    state_values = np.zeros(5)
    gamma = 0.9
    # performing policy evaluation
    for _ in range(nb_iteration):
        for state in range(5):
            actions = agent.actions[state]
            state_values[state] = (1/len(actions))*np.sum([mdp.reward[state][action] + gamma*np.sum(mdp.transition[state][action]*state_values) for action in actions])
    print(state_values)
        
        
    

In [10]:
policy_evaluation(2000)

[-1.48447749  2.15815786  7.01812859 -2.1236634   0.        ]


Using this evaluation we can iterate on the policy acting in a greedy manner, this is called Policy iteration and it is bound to converge to the optimal policy in a finite environement

I will stop here the scripts on the lesson even if there are key concepts, to focus more in detail on the model free prediction