#pymdptoolbox tutorial

In this notebook, we will show how to take a MDP graph and represent it usign pymdptoolbox in python.  Then we will use Value Iteration to find the optimal policy and expected value of the given mdp

##The problem
A forest is managed by two actions: ‘Wait’ and ‘Cut’. An action is decided each year with first the objective to maintain an old forest for wildlife and second to make money selling cut wood Each year there is a probability p that a fire burns the forest.

###Visual Representation 
![alt text](./mdp.jpeg "Logo Title Text 1")

## Step 1
The first thing we need to do is setup matricies for the transition probablities and the rewarewards.  

The transition probablities will be represented in a num actions x num states x num states matrix

The rewards will be represented in a num states x num actions array

In [None]:
prob = np.zeros((2, 5, 5))

prob[0] = [[0.3, 0.7, 0., 0., 0.],
           [0.3, 0.0, 0.7, 0., 0.],
           [0.3, 0.0, 0., 0.7, 0.],
           [0.3, 0.0, 0., 0., 0.7],
           [0.3, 0.0, 0., 0., 0.7]]

prob[1] = [[1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.]]

rewards = np.zeros((5, 2))
rewards[0] = [0., 0.]
rewards[1] = [0., 1.]
rewards[2] = [0., 1.]
rewards[3] = [0., 1.]
rewards[4] = [0.3, 2.]

## Step 2
Now we need to setup the MDP in pymdptoolbox and run Value Iteration to get the expected value and optimal policy

In [None]:
vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 0.9)
vi.run()

Then we can extract the optimal policy and expected value of each state

In [None]:
optimal_policy = vi.policy
expected_values = vi.V

##Putting it all together

Here is the final code

In [None]:
import mdptoolbox
import numpy as np

prob = np.zeros((2, 5, 5))

prob[0] = [[0.3, 0.7, 0., 0., 0.],
           [0.3, 0.0, 0.7, 0., 0.],
           [0.3, 0.0, 0., 0.7, 0.],
           [0.3, 0.0, 0., 0., 0.7],
           [0.3, 0.0, 0., 0., 0.7]]

prob[1] = [[1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.]]

rewards = np.zeros((5, 2))
rewards[0] = [0., 0.]
rewards[1] = [0., 1.]
rewards[2] = [0., 1.]
rewards[3] = [0., 1.]
rewards[4] = [0.3, 2.]

vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 0.9)
vi.run()

optimal_policy = vi.policy
expected_values = vi.V

print(optimal_policy)
print(expected_values)