# Markov Decision Process


The lessons introduce the notion of Markov Decision Process (MDP) in three steps. This notion is import as a lot of RL problems can be formalised as MDPs

## Student Markov Chain


A state $S_t$ is Markov if and only if:

$\mathbb P [S_{t+1}|S_{t}]=\mathbb P [S_{t+1}|S_{1},...,S_{t}]$


A Markov Process is a tuple $(\mathcal S ,\mathcal P )$:
* $ \mathcal S $ a finite set of states
* $ \mathcal P $ a state transition probability matrix : $\mathcal P _{ss'}= \mathbb P [S_{t+1}=s'|S_{t}=s]$



A running example of the course on which I did some implementations is the student Markov chain
<p align="center">
	<img src="./Images/MP.png">
</p>
I implemented this markov chain in python be able to able probe the markov process:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

class StudentMarkovChain():
    """
    This class models the Markov process described in the lesson
    everything is hard coded, this class is not meant to be general
    """
    def __init__(self):
        #transition probabilities of each state
        self.transition = np.array([[0, 0.5 , 0 , 0 , 0 , 0.5 , 0 ],
                                    [0 , 0 , 0.8 , 0 , 0 , 0 , 0.2],
                                    [0 , 0 , 0 , 0.6 , 0.4 , 0 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1],
                                    [0.2 , 0.4 , 0.4 , 0 , 0 , 0 , 0],
                                    [0.1 , 0 , 0 , 0 , 0 , 0.9 , 0],
                                    [0 , 0 , 0 , 0 , 0 , 0 , 1]
                                    ])
        #name of states
        self.titles=["C1","C2","C3","Pass","Pub","FB","Sleep"]
        #first state
        self.state=0
        #the class will keep the history
        self.history = [self.titles[self.state]]

    #function to change state in the markov process
    def step(self):
        #Next state is picked following the probabilities of the transition matrix 
        self.state = np.random.choice(range(7),p=self.transition[self.state])
        self.history.append(self.titles[self.state])
        #if the state is  final
        if self.state != 6:
            #we return a state a bool telling if it is finished
            return self.state,False
        else:
            return self.state,True
    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [self.titles[self.state]]


#function to run the markov chain
def main_markov():
    finished = False
    smc = StudentMarkovChain()
    while not finished:
        _,finished = smc.step()
    print(smc.history)


In [2]:
for i in range(10):
    main_markov()

['C1', 'C2', 'C3', 'Pub', 'C3', 'Pass', 'Sleep']
['C1', 'C2', 'C3', 'Pass', 'Sleep']
['C1', 'FB', 'C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pub', 'C2', 'Sleep']
['C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pub', 'C1', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pub', 'C2', 'C3', 'Pub', 'C2', 'C3', 'Pub', 'C1', 'C2', 'C3', 'Pass', 'Sleep']
['C1', 'FB', 'FB', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pass', 'Sleep']
['C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pub', 'C3', 'Pass', 'Sleep']
['C1', 'C2', 'C3', 'Pub', 'C2', 'Sleep']
['C1', 'FB', 'C1', 'C2', 'Sleep']
['C1', 'C2', 'C3', 'Pass', 'Sleep']
['C1', 'FB', 'FB', 'FB', 'C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'C2', 'C3', 'Pass', 'Sleep']


You'll see after probing the markov process that students are enclined to pass an awful amount of time on facebook

## Markov Reward Process

A Markov Reward Process is a tuple $(\mathcal S ,\mathcal P,\mathcal R ,\gamma)$ where:
* $ \mathcal S $ a finite set of states
* $ \mathcal P $ a state transition probability matrix : $\mathcal P _{ss'}= \mathbb P [S_{t+1}=s'|S_{t}=s]$
* $ \mathcal R $ is a reward function, $ \mathcal R _s = \mathbb E [R_{t+1}|S_t = s] $ 
* $ \gamma $ is a discount factor, $\gamma \in [0,1]$

Here we present the Student reward process:

<p align="center">
	<img src="./Images/MRP.png">
</p>


In [7]:

class StudentMarkovRewardProcess(StudentMarkovChain):
    """
    Class to add rewards to the student markov chain
    it is inheriting the transition probabilities and the names from the markov chain
    """
    def __init__(self):
        """
        Constructor 
        """

        StudentMarkovChain.__init__(self)
        # we are adding here the rewards of the different states
        self.rewards=[-2,-2,-2,10,1,-1,0]
        #and the shape of the history includes the rewards 
        self.history[-1]=(self.history[-1],self.rewards[self.state])

    # change the step function of the markov chain to add the rewards    
    def step(self):
        state,finished = StudentMarkovChain.step(self)
        reward = self.rewards[state]
        self.history[-1]=(self.history[-1],reward)
        return self.state,finished,reward

    #function to restart
    def reboot(self):
        self.state = 0
        self.history = [(self.titles[self.state],self.rewards[self.state])]
        
#function to run the markov chain
def main_markov_reward():
    finished = False
    srp = StudentMarkovRewardProcess()
    while not finished:
        _,finished,_ = srp.step()
    print(srp.history)


In [10]:
#the history now has a reward attached to it
for i in range(10):
    main_markov_reward()

[('C1', -2), ('C2', -2), ('C3', -2), ('Pass', 10), ('Sleep', 0)]
[('C1', -2), ('C2', -2), ('Sleep', 0)]
[('C1', -2), ('C2', -2), ('Sleep', 0)]
[('C1', -2), ('C2', -2), ('C3', -2), ('Pass', 10), ('Sleep', 0)]
[('C1', -2), ('C2', -2), ('C3', -2), ('Pass', 10), ('Sleep', 0)]
[('C1', -2), ('C2', -2), ('C3', -2), ('Pub', 1), ('C1', -2), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('C1', -2), ('FB', -1), ('C1', -2), ('FB', -1), ('FB', -1), ('FB', -1), ('C1', -2), ('C2', -2), ('C3', -2), ('Pub', 1), ('C3', -2), ('Pub', 1), ('C3', -2), ('Pass', 10), ('Sleep', 0)]
[('C1', -2), (

A central notion in reinforcement is the return $G_{t}$ that the agent can expect from now on to the end of the episode:
$$ G_{t}= R_{t+1}+\gamma R_{t+2}+ ... = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}$$

The state value is of a MRP is the expected return starting from state $s$:
$$ v(s) = \mathbb E [G_{t}|S_t = s]$$

In [21]:
def compute_return_markov_reward():
    gamma =0.9
    finished = False
    srp = StudentMarkovRewardProcess()
    while not finished:
        _,finished,_ = srp.step()
    print(srp.history)
    print("Return of state 1: ",sum([gamma**j*reward for j,(_,reward) in enumerate(srp.history)]))

In [22]:
compute_return_markov_reward()

[('C1', -2), ('FB', -1), ('C1', -2), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('FB', -1), ('C1', -2), ('C2', -2), ('C3', -2), ('Pass', 10), ('Sleep', 0)]
Return of state 1:  -8.45756140197053
