# iLykei Lecture Series

# Advanced Machine Learning and Artificial Intelligence (MScA 32017)

# Reinforcement Learning

## Notebook 1: Q-Value Iterations and Q-Learning

## Yuri Balasanov, Mihail Tselishchev, &copy; iLykei 2018

##### Main texts: 

Hands-On Machine Learning with Scikit-Learn and TensorFlow, Aurelien Geron, &copy; Aurelien Geron 2017, O'Reilly Media, Inc

Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning series), &copy; 2018 Richard S. Sutton, Andrew G. Barto, The MIT Press

This notebook discusses example of Markov Decision Process from Chapter 16 of the book. 

## Description of Markov Decision Process

Consider example of MDP from slide 16 of the session materials.

Create numPy array of transition probabilities.

In [1]:
import numpy as np
nan=np.nan
# Transition probabilities
T=np.array([ # shape=[s,a,s'] \
    [[.7,.3,.0],[1.0,0.0,0.0],[0.8,0.2,0.0]], \
    [[0.0,1.0,0.0],[nan,nan,nan],[0.0,0.0,1.0]], \
    [[nan,nan,nan],[0.8,0.1,0.1],[nan,nan,nan]],])
print('Shape of T: ',T.shape)
T

Shape of T:  (3, 3, 3)


array([[[0.7, 0.3, 0. ],
        [1. , 0. , 0. ],
        [0.8, 0.2, 0. ]],

       [[0. , 1. , 0. ],
        [nan, nan, nan],
        [0. , 0. , 1. ]],

       [[nan, nan, nan],
        [0.8, 0.1, 0.1],
        [nan, nan, nan]]])

The first matrix of the array shows transition probabilities from state $S_0$ using actions $a_0$ (first row), $a_1$ (second row) and $a_2$ (third row). Similarly, second matrix contains transition probabilities from state $S_1$ and the last matrix contains transition probabilities from state $S_2$. 

Create numPy array of rewards.

In [2]:
R=np.array([ # shape=[a,s,a'] \ 
[[10.0,0.0,0.0],[0.0,0.0,0.0],[0.0,0.0,0.0]], \
[[0.0,0.0,0.0],[nan,nan,nan],[0.0,0.0,-50.0]], \
[[nan,nan,nan],[40.0,0.0,0.0],[nan,nan,nan]],])
print('Shape of R: ',R.shape)
R

Shape of R:  (3, 3, 3)


array([[[ 10.,   0.,   0.],
        [  0.,   0.,   0.],
        [  0.,   0.,   0.]],

       [[  0.,   0.,   0.],
        [ nan,  nan,  nan],
        [  0.,   0., -50.]],

       [[ nan,  nan,  nan],
        [ 40.,   0.,   0.],
        [ nan,  nan,  nan]]])

Array of available actions for each of the states is:

In [3]:
possible_actions=[[0,1,2],[0,2],[1]]

Now the description of the MDP is done.

## Q-Value Iterations

Initiate Q as -inf for impossible actions, make Q=0 for all possible actions.

In [4]:
Q=np.full((3,3),-np.inf)
print(Q)
for state,actions in enumerate(possible_actions):
    Q[state,actions]=0.0
print('Initiated Q:')
print(Q)

[[-inf -inf -inf]
 [-inf -inf -inf]
 [-inf -inf -inf]]
Initiated Q:
[[  0.   0.   0.]
 [  0. -inf   0.]
 [-inf   0. -inf]]


Set discount rate $\gamma=0.95$ and the number of iterations of the Q-Value algorithm
$$Q^*_{k+1}(S,a)=\sum_{s'} P(S,a,S') \left[R(S,a,S') + \gamma \max_{a'} Q^*_k(S',a') \right].$$

In [5]:
discount_rate=0.95
n_iterations=100

Run just one iteration to reproduce manual calculations on slide 23.

In [6]:
Q_prev=Q.copy()
for s in range(3):
    for a in possible_actions[s]:
        Q[s,a]=np.sum([T[s,a,sp]*(R[s,a,sp] \
                                  +discount_rate*np.max(Q_prev[sp])) \
                      for sp in range(3)])
print('Q: ')
print(np.matrix(Q))
print('Q_prev: ')
print(np.matrix(Q_prev))

Q: 
[[  7.   0.   0.]
 [  0. -inf -50.]
 [-inf  32. -inf]]
Q_prev: 
[[  0.   0.   0.]
 [  0. -inf   0.]
 [-inf   0. -inf]]


Array `Q_prev` is $Q^*_1(S,a)$. Note that the first row was initiated at 0. Array $Q$ is the next iteration: the first row changed to `[7.,0.,0.]`. This is consistent with slide 23.

Initiate the matrix again and run the recursion `n_iterations` times.

In [7]:
Q=np.full((3,3),-np.inf)
for state,actions in enumerate(possible_actions):
    Q[state,actions]=0.0
for iteration in range(n_iterations):
    Q_prev=Q.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q[s,a]=np.sum([T[s,a,sp]*(R[s,a,sp] \
                                  +discount_rate*np.max(Q_prev[sp])) \
                           for sp in range(3)])

The quality matrix $Q^*(S,a)$ has converged to

In [8]:
print('Q:')
print(np.matrix(Q))

Q:
[[21.88646117 20.79149867 16.854807  ]
 [ 1.10804034        -inf  1.16703135]
 [       -inf 53.8607061         -inf]]


And the optimal action for each state that follows from $Q^*(S,a)$ is

In [9]:
print('Optimal actions by state: ')
print(np.argmax(Q,axis=1)) # max by rows

Optimal actions by state: 
[0 2 1]


To show importance of the discount rate try the same iterations with $\gamma=0.9$

In [10]:
discount_rate=0.9
Q=np.full((3,3),-np.inf)
for state,actions in enumerate(possible_actions):
    Q[state,actions]=0.0
for iteration in range(n_iterations):
    Q_prev=Q.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q[s,a]=np.sum([T[s,a,sp]*(R[s,a,sp] \
                                  +discount_rate*np.max(Q_prev[sp])) \
                           for sp in range(3)])
print('Q:')
print(np.matrix(Q))
print('Optimal actions by state: ')
print(np.argmax(Q,axis=1)) # max by rows

Q:
[[18.91891892 17.02702703 13.62162162]
 [ 0.                -inf -4.87971488]
 [       -inf 50.13365013        -inf]]
Optimal actions by state: 
[0 0 1]


With higher discount rate (future rewards are more important) in state 1 the best strategy is to select action $a_2$ and take the hit of -50 reward points: sacrifice for opportunity to make more rewards in the future. 
But with lower discount rate (future rewards are less important) it is better to stay in $S_1$ forever, with no more rewards, but with no big losses either.

## Q-Learning

Apply Q-Learning to the same MDP. This method is not using knowledge about transition probabilities or all rewards. Only the immediate reward is experienced when the action is selected.

The iterations follow the equation
$$Q_{k+1}(S,a)=(1-\alpha) Q_k(S,a)+ \alpha \left[r+ \gamma \max_{a'}Q_k(S',a') \right] $$
$$=Q_k(S,a)+\alpha \left( [r+ \gamma \max_{a'}Q_k(S',a')]-Q_k(S,a) \right),$$ where the second term of the last equation represents learning by taking difference between the future expected Q-value $[r+ \gamma \max_{a'}Q_k(S',a')]$ and the current value $Q_k(S,a)$, weighted by $\alpha$ as an adjustment to $Q_k(S,a)$.

In [11]:
discount_rate=.99
learning_rate0 = 0.05
learning_rate_decay = 0.1
n_iterations = 20000

In [12]:
s = 0 # starting state
Q=np.full((3,3),-np.inf) # -inf for impossible actions
for state,actions in enumerate(possible_actions):
    Q[state,actions]=0.0 # 0 for possible actions

In the cell below transition probabilities and reward matrices are used only to simulate response from the environment. Agent's learning process does not assume them known and learns action-value function $Q(S,a)$ only from experience.

In [13]:
for iteration in range(n_iterations):
    a = np.random.choice(possible_actions[s]) # random action from available in s
    sp = np.random.choice(range(3),p=T[s,a]) # random selection of new state
    reward = R[s,a,sp]
    learning_rate = learning_rate0/(1+iteration*learning_rate_decay) # gradually decaying learning rate
    Q[s,a] = ((1-learning_rate)* \
              Q[s,a]+learning_rate* \
              (reward+discount_rate*np.max(Q[sp])))
    s = sp # next state


In [14]:
Q

array([[  5.03138964,   1.5442264 ,   1.27921677],
       [  0.        ,         -inf, -14.64484033],
       [        -inf,  13.30664129,         -inf]])

In [15]:
np.argmax(Q,axis=1)

array([0, 0, 1])

Note that this iterative process does not result in selection of action 2 in state 1 even if discount rate goes up to 0.99. Possible explanation is: low learning rate and its quick decay do not allow the learning process to see enough benefits of taking immediate significant loss.