# AAI Workshop 9
<small>(Version 1.2)</small>

Below there are 6 examples and one exercise to be completed by the given deadline (read the text).

These mainly focus on policy iteration.
    
---

## EXAMPLE 1: Solving an MDP (again)

Again we will use the MDP Toolbox, an implementation of some MDP algorithms in Python. If you did not do this last time,  will need to install MDP Toolbox using: 

pip install pymdptoolbox

Documentation is at: https://pymdptoolbox.readthedocs.io/en/latest/index.html

We'll start with the problem that was used in Example 4 in Workshop 8. (This is thus the solution to that Example that I promised)

We have 4 states and four actions, and the problem is basically the top right corner of the example from the MDP we looked at in the slides and which is in the textbook.

The actions are: 0 is Right, 1 is Left, 2 is Up and 3 is Down.
    
The motion model is the same as in the lectures (0.8 probability of moving in the direction of the action, and 0.1 probability of moving in each of the directions perpendicular to that of the action).

The states are 0, 1, 2, 3, and they are arranged like this:
    
$$
\begin{array}{cc}
2 & 3\\
0 & 1\\
\end{array}
$$

So that 2 is Up from 0 and 1 is Right of 0, and so on. The cost of any action (in any state) is -0.04.

The reward for state 3 is 1, and the reward for state 1 is -1, and the agent does not leave those states.




In [2]:
import mdptoolbox
import numpy as np

# Since the probability array is (A, S, S), there are 4 arrays. Each
# is a 4 x 4 array:
P2 = np.array([[[0.1, 0.8, 0.1, 0  ], # Right, State 0
                [0,   1,   0,   0  ], # State 1 is absorbing
                [0.1, 0,   0.1, 0.8],
                [0,   0,   0,   1  ]],# State 3 is absorbing
               [[0.9, 0,   0.1, 0  ], # Left
                [0,   1,   0,   0  ],
                [0.1, 0,   0.9, 0  ],
                [0,   0,   0,   1  ]],
               [[0.1, 0.1, 0.8, 0  ], # Up
                [0,   1,   0,   0  ],
                [0,   0,   0.9, 0.1],
                [0,   0,   0,   1  ]],
               [[0.9, 0.1, 0,   0  ], # Down
                [0,   1,   0,   0  ],
                [0.8, 0,   0.1, 0.1],
                [0,   0,   0,   1  ]]])

# The reward array has one set of values for each state. Each is the
# value of all the actions. Here there are four actions, all with the
# usual cost:
R2 = np.array([[-0.04, -0.04, -0.04, -0.04],
               [-1,    -1,    -1,    -1],
               [-0.04, -0.04, -0.04, -0.04],
               [1,      1,     1,     1]])

mdptoolbox.util.check(P2, R2)
vi2 = mdptoolbox.mdp.ValueIteration(P2, R2, 0.99)
vi2.run()
print('Values:\n', vi2.V)
print('Policy:\n', vi2.policy)


Values:
 (88.23133912867952, -99.9949804281095, 97.54814303782808, 99.9949804281095)
Policy:
 (1, 0, 0, 0)


This says that the optimum policy is to go Right in every state except State 0, and in State 0 go Left.

We can basically ignore the actions for State 1 and State 3. Since they are terninal states, the action doesn't matter. As the motion model says, whatever action is picked, then agent stays in place and I strongly suspect that the code is always picking action 0 (whatever 0 is) in this situation.

Right obviously makes sense for State 2 since that takes the agent to the goal state 3.

But why is the policy to go Left in State 1 rather than Up?

The short answer is that is because it is the action with the highest expected utility :-).

The longer answer is that Left means that there is zero probability that the agent will end up in State 1 with its reward of -1, but will also, eventually, find its way to State 2, and hence to the positive reward of State 3.

---

## EXAMPLE 2: A different discount

The choice of Left in State 0 is partly the result of the discount, which we have set to 0.99 (close to what we used in the lecture, but less than 1 to ensure convergence.)

This means the agent doesn't lose much by putting off the positive reward from State 3.

A lower value of the discount (which means rewards further away count for less) will force the agent to go Up in State 0.

What value will do this?

---

## EXAMPLE 3: A different cost of actions

The choice of Left in State 0 is also partly the result of the cost of action/reward for non-terminal states.

The value of -0.04 is not a big price to pay for avoiding State 3.

A higher action cost will force the agent to go Up in State 0. 

What value will do this?

---

## EXAMPLE 4: Finally policy iteration

Although we have been looking at the policy, we go it through value iteration.

Solving the same problem using policy iteration is easy with the MDP Toolbox:


In [4]:
pi2 = mdptoolbox.mdp.PolicyIteration(P2, R2, 0.9)
pi2.run()
print('Values:\n', pi2.V)
print('Policy:\n', pi2.policy)


Values:
 (5.633171754225077, -10.000000000000002, 8.425258744923362, 10.000000000000002)
Policy:
 (2, 0, 0, 0)


Note that the methods disagree on the value while agreeing on the policy. This is typical, and is a feature of the termination/convergance conditions.

---

## EXAMPLE 5: Q-learning

Solving a problem using reinforcement learning (well, the Q-learning kind of RL) is also easy using the MDP Toolbox:



In [22]:
rl2 = mdptoolbox.mdp.QLearning(P2, R2, 0.9)
rl2.run()
print('Values:\n', rl2.V)
print('Policy:\n', rl2.policy)


Values:
 (0.23990723950307208, -8.920270055220184, 3.8875765744931616, 9.999161625022545)
Policy:
 (1, 1, 0, 0)


What does this polRun this several times. What do you notice about the policy? Why do you think that this is the case?

---

## EXAMPLE 6: Now do it yourself

Go back to the first example from the last workshop (the one with the states in a line).

Solve it by policy iteration.

How does the result compare with the result of value iteration?

Look at the setVerbose() function and the time attribute of the MDP objects in MDPToolbox and use them to compare the number of iterations used by value iteration and policy iteration (using setVerbose()) and the CPU time used to come up with a solution (the time attribute).

Now solve it using Q-learning.

How does that result compare with the results of value iteration and policy iteration?

---

## EXAMPLE 6: A new version MDP of your own.

Modify the MDP from Example 1 to make the "bad place" (the state with reward -1) State 2 rather than State 1, and then solve it using (a) policy iteration; and (b) Q-learning.

Hint: you will have to modify the motion model as well as the rewards.


---

## EXERCISE

The reward and probability matrices for the example from the slides are in the file setup.py which you can find 
on Blackboard with last week's workshop material.

Use these to create and solve an MDP using either policy iteration or Q-learning from the MDP Toolbox.

How does the policy you get compare with the one from the lectures? 

Alter the discount until the policies agree.

Now solve the problem using value iteration and compare the number of iterations, and the CPU time, used by the two methods.

Write a short document (PDF, max 1 page) or Jupyter Notebook file (preferred) describing your solution and send 
it to **sparsons@lincoln.ac.uk** with subject *AAI Workshop 9 - NAME SURNAME*. Please submit your work by the 
<u>13th January 2022</u>. **It will not be graded, but only used by the lecturer to check the progress of the class**.