# pymdptoolbox Introduction Tutorial

In this notebook, we will show how to take a MDP graph and represent it using pymdptoolbox in python.  Then we will use Value Iteration to find the optimal policy and expected value of the given mdp

## The problem
A forest is managed by two actions: ‘Wait’ and ‘Cut’. An action is decided each year with first the objective to maintain an old forest for wildlife and second to make money selling cut wood Each year. There is a probability p that a fire burns the forest.
As showed in visual representation,blue line is probalibity, transitions.For each year there is 30% chance a fire will burns the forest and bring the state to 0(1 in graph). In comparision, there is 70% change the forest will endure and get into the next state. 
As showed in visual representation,black line is reward. For the first your years, there is no money reward for waiting, plus suffering fire burns.If cut the tree there will be some benefits (reward=1). But once the tree reach the fifth year (state 4), it will start to maintain wildlife (reward =0.3) and have doubled money reward (reward =1)once cut.
### Visual Representation 
![alt text](./mdp.jpeg "Logo Title Text 1")

## Step 1
The first thing we need to do is setup matricies for the transition probablities and the rewards.  

The transition probablities will be represented in a num actions x num states x num states matrix

The rewards will be represented in a num states x num actions array

In [None]:
import numpy as np

prob = np.zeros((2, 5, 5)) #creat 2 empty 5 X 5 array

prob[0] = [[0.3, 0.7, 0., 0., 0.],
           [0.3, 0.0, 0.7, 0., 0.],
           [0.3, 0.0, 0., 0.7, 0.],
           [0.3, 0.0, 0., 0., 0.7],
           [0.3, 0.0, 0., 0., 0.7]]

prob[1] = [[1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.]]

rewards = np.zeros((5, 2))
rewards[0] = [0., 0.]
rewards[1] = [0., 1.]
rewards[2] = [0., 1.]
rewards[3] = [0., 1.]
rewards[4] = [0.3, 2.]

## Step 2
Now we need to setup the MDP in pymdptoolbox and run Value Iteration to get the expected value and optimal policy

In [None]:
import mdptoolbox
vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 0.9)
vi.run()

Then we can extract the optimal policy and expected value of each state

In [None]:
optimal_policy = vi.policy
expected_values = vi.V

##Putting it all together

Here is the final code

In [None]:
import mdptoolbox
import numpy as np

prob = np.zeros((2, 5, 5))

prob[0] = [[0.3, 0.7, 0., 0., 0.],
           [0.3, 0.0, 0.7, 0., 0.],
           [0.3, 0.0, 0., 0.7, 0.],
           [0.3, 0.0, 0., 0., 0.7],
           [0.3, 0.0, 0., 0., 0.7]]

prob[1] = [[1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.]]

rewards = np.zeros((5, 2))
rewards[0] = [0., 0.]
rewards[1] = [0., 1.]
rewards[2] = [0., 1.]
rewards[3] = [0., 1.]
rewards[4] = [0.3, 2.]

vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 0.9)
vi.run()

optimal_policy = vi.policy
expected_values = vi.V

print(optimal_policy)
print(expected_values)