# pymdptoolbox Introduction Tutorial

In this notebook, we will show how to take a MDP graph and represent it using pymdptoolbox in python.  Then we will use Value Iteration to find the optimal policy and expected value of the given mdp

## The problem
The game DieN is played in the following way.
Consider a die with N sides (where N is an integer greater than 1) and a nonempty set B of integers. The rules of the game are:
1. You start with 0 dollars.
2. Roll an N-sided die with a different number from 1 to N printed on each side.
a. If you roll a number not in B, you receive that many dollars. (eg. if you roll the number 2 and 2 is not in B, then you receive 2 dollars.)
b. If you roll a number in B, then you lose all of your obtained money and the game ends.
3. After you roll the die (and don’t roll a number in B), you have the option to quit the game.If you quit, you keep all the money you’ve earned to that point. If you continue to roll, go back to step 2.


### Visual Representation 
![alt text](./mdp.jpeg "Logo Title Text 1")

### Step 1
##The first thing we need to do is setup matricies for the transition probablities and the rewards.  
Number of states is equal to total possible bankroll in the game. for N = 6, isBadSide = {1,1,1,0,0,0}, possible bankroll after roll 1 are {0,4,5,6}; after roll 2 are {0,8,9,10,11,12}.Possible bankroll states increase as fibonacci series list.But I will also include all non-exsit bankroll just for convience.
I will use truncated matrix no more than roll 2 for DieN = 6.
There is one more state call quit. quit != leave
You can choice to leave(action=0), but you are force to quit (s in B)
state {quit	0	4	5	6	8	9	10	11	12}

The transition probablities will be represented in a num actions x num states x num states matrix

The rewards will be represented in a num states x num actions array

## Step 2
Now we need to setup the MDP in pymdptoolbox and run Value Iteration to get the expected value and optimal policy

In [1]:
import mdptoolbox
vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 1)
vi.run()

ModuleNotFoundError: No module named 'mdptoolbox'

Then we can extract the optimal policy and expected value of each state

In [None]:
optimal_policy = vi.policy
expected_values = vi.V
print optimal_policy
print expected_values


##Putting it all together

Here is the final code

In [None]:
import mdptoolbox.example
import numpy as np
prob = np.zeros((2, 10, 10)) 
#if leave
prob[0] = [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

#if roll
p=1.0/6
prob[1] = [[0, p, p, p, 0, 0, 0, 0, 0, 0.5],
           [0, 0, 0, 0, p, p, p, 0, 0, 0.5],
           [0, 0, 0, 0, 0, p, p, p, 0, 0.5],
           [0, 0, 0, 0, 0, 0, p, p, p, 0.5],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
np.sum(prob[0],axis=1)
np.sum(prob[1],axis=1)

rewards = np.zeros((2, 10, 10))
# if leave
rewards[0] = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
#if roll
rewards[1] = [[0, 4, 5, 6, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 4, 5, 6, 0, 0, -4],
            [0, 0, 0, 0, 0, 4, 5, 6, 0, -5],
            [0, 0, 0, 0, 0, 0, 4, 5, 6, -6],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -8],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -9],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -10],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -11],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -12],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 1)
vi.run()

optimal_policy = vi.policy
expected_values = vi.V

print optimal_policy
print expected_values