**Learning Markov Decision Process (MDP) Algorithm with the MDPToolBox Python Package**

In [1]:
! pip install pymdptoolbox

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymdptoolbox
  Downloading pymdptoolbox-4.0-b3.zip (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pymdptoolbox
  Building wheel for pymdptoolbox (setup.py) ... [?25l[?25hdone
  Created wheel for pymdptoolbox: filename=pymdptoolbox-4.0b3-py3-none-any.whl size=25655 sha256=7678a1587274be39e06b159e8a6ee1664c007d14dcfb59bfcd34a80fc3846eca
  Stored in directory: /root/.cache/pip/wheels/85/c2/e4/29e0b5aab8da79e0f54cf086f8549d47c824e5242ae687e93a
Successfully built pymdptoolbox
Installing collected packages: pymdptoolbox
Successfully installed pymdptoolbox-4.0b3


In [2]:
import mdptoolbox.example
import mdptoolbox.mdp
import numpy as np

**Forest Management Example**


*   Trees can be either young, middle-aged, or old (states = 0, 1, 2)
*   Each year, the trees get one stage older (S+1)
*   Each year, there is a 10% chance that the whole forest burns down!
*   If the forest burns down, you get nothing.
*   If you cut down the trees, you get 0 points for a young one, 1 point for a middle-aged one, and 2 points for an old one
*   If the forest reaches its oldest state, and you do not cut, you will receive 4 points!

What's the best strategy, give these facts?


In [5]:
#Defintions:
'''
S: The number of states, which should be an integer greater than 1. Default: 3.
r: The reward when the forest is in its oldest state and action ‘Wait’ is performed. Default: 4.
r2: The reward when the forest is in its oldest state and action ‘Cut’ is performed. Default: 2.
p: The probability of wild fire occurence, in the range ]0, 1[. Default: 0.1.
is_sparse: If True, then the probability transition matrices will be returned in sparse format, otherwise they will be in dense format. Default: False.
'''

'\nS: The number of states, which should be an integer greater than 1. Default: 3.\nr: The reward when the forest is in its oldest state and action ‘Wait’ is performed. Default: 4.\nr2: The reward when the forest is in its oldest state and action ‘Cut’ is performed. Default: 2.\np: The probability of wild fire occurence, in the range ]0, 1[. Default: 0.1.\nis_sparse: If True, then the probability transition matrices will be returned in sparse format, otherwise they will be in dense format. Default: False.\n'

In [24]:
# This function is used to generate a transition probability (A × S × S) array P and a reward (S × A) matrix R that model the following problem.
P, R = mdptoolbox.example.forest(S=3, r1=4, r2=2, p=0.1, is_sparse=False)

In [6]:
P[0]

array([[0.1, 0.9, 0. ],
       [0.1, 0. , 0.9],
       [0.1, 0. , 0.9]])

In [8]:
P[1]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [9]:
print(P[0][0][1])

0.9


In [10]:
print(P[0][2][0])

0.1


In [11]:
R

array([[0., 0.],
       [0., 1.],
       [4., 2.]])

In [12]:
np.sum(np.multiply(R.T[0], [0, 0, 1]))

4.0

In [16]:
np.sum(np.multiply(R.T[0], [1, 0, 0])) # youngest (S0)

0.0

In [17]:
np.sum(np.multiply(R.T[0], [0, 1, 0])) # middle-aged (S1)

0.0

In [18]:
model = mdptoolbox.mdp.QLearning(P, R, discount = 0.1)
model.run()
model.policy

(0, 1, 0)

In [19]:
model.policy[0] # should we wait(0) or cut(1) for the youngest one?

0

In [21]:
model.policy[1]

1