# Introduction to Reinforcement Learning

This Jupyter notebook and the others in the same folder act as supporting materials for **Chapter 22 Reinforcement Learning** of the book* Artificial Intelligence: A Modern Approach*. The notebooks make use of the implementations in `reinforcement_learning4e.py` module. We also make use of the implementation of MDPs in the `mdp4e.py` module to test our agents. It might be helpful if you have already gone through the Jupyter notebook dealing with the Markov decision process. Let us import everything from the `reinforcement_learning4e` module. It might be helpful to view the source of some of our implementations.

In [1]:
import os, sys
sys.path = [os.path.abspath("../../")] + sys.path
from reinforcement_learning4e import *

Before we start playing with the actual implementations let us review a couple of things about RL.

1. Reinforcement Learning is concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. 

2. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

-- Source: [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning)

In summary, we have a sequence of state action transitions with rewards associated with some states. Our goal is to find the optimal policy $\pi$ which tells us what action to take in each state.

# Passive Reinforcement Learning

In passive Reinforcement Learning the agent follows a fixed policy $\pi$. Passive learning attempts to evaluate the given policy $pi$ - without any knowledge of the Reward function $R(s)$ and the Transition model $P(s'\ |\ s, a)$.

This is usually done by some method of **utility estimation**. The agent attempts to directly learn the utility of each state that would result from following the policy. Note that at each step, it has to *perceive* the reward and the state - it has no global knowledge of these. Thus, if a certain the entire set of actions offers a very low probability of attaining some state $s_+$ - the agent may never perceive the reward $R(s_+)$.

Consider a situation where an agent is given the policy to follow. Thus, at any point, it knows only its current state and current reward, and the action it must take next. This action may lead it to more than one state, with different probabilities.

For a series of actions given by $\pi$, the estimated utility $U$:
$$U^{\pi}(s) = E(\sum_{t=0}^\inf \gamma^t R^t(s'))$$
Or the expected value of summed discounted rewards until termination.

Based on this concept, we discuss three methods of estimating utility: direct utility estimation, adaptive dynamic programming, and temporal-difference learning.

### Implementation

Passive agents are implemented in `reinforcement_learning4e.py` as various `Agent-Class`es.

To demonstrate these agents, we make use of the `GridMDP` object from the `MDP` module. `sequential_decision_7x6_environment` is similar to that used for the `MDP` notebook but has discounting with $\gamma = 0.9$ and a 7x6 environment.

1.	Jāizvēlas konkrēta vides konfigurācija, kurā ne vairāk kā 5 lauciņos aģents nespēj ieiet, t. i. vidē gadījuma ceļā jāizvieto ne vairāk kā 5 šķēršļi. Jāizvēlas pietiekami atšķirīgas balvu vērtības diviem galējiem stāvokļiem (viena balva ir pozitīva, bet otra negatīva). 
a)	Šķēršļi ir (0,3) (1,1) (3,5) (5,5)
b)	Balvas ir +1 un -1
2.	Jāizvēlas aģentu darbības stohastisks modelis, kas specificē to, ka aģents no lauciņa, kurā tas pašlaik atrodas, vienā solī nokļūst uz lauciņa nodomātajā virzienā ar  varbūtību 0.6.  Trīs pārējos virzienos (divos perpendikulāros nodomātajam virzienam un pretēji tam), kuros aģentu pārvieto stohastiskā vide, varbūtības jāspecificē šādi:  varbūtība, ka vide aģentu pārvietos nodomātajam virzienam pretējā virzienā ir mazāka nekā varbūtība, ka aģentu vide pārvietos pa labi no nodomātā virziena, bet pēdējā, savukārt ir lielāka nekā varbūtība, ka vide to pārvietos pa kreisi no nodomātā virziena (obligāta prasība ir, ka aģents spēj pārvietoties visos 4 nodomātajos virzienos – augšup, lejup, pa labi, pa kreisi).
a)	Uz “priekšu” = 0.6
b)	Pretēji = 0.1 (mazāks nekā pa labi)
c)	Pa labi = 0.17
d)	Pa kreisi = 0.13 (mazāks nekā pa labi)


In [2]:
from mdp4e import sequential_decision_7x6_environment

<img src="images/grid_mdp_7x6.jpg">  

The `sequential_decision_7x6_environment` is a GridMDP object.
The rewards are **+1** and **-1** in the terminal states, and **-0.04** in the rest. Now we define actions and a policy similar to **Fig 22.1** in the book.
3.	Jāizvēlas fiksēta politika, lietojot vērtību iterāciju metodi (politikas izvēle jābalsta uz vismaz tik daudz iterācijām, kas ir pietiekamas, lai aģents viennozīmīgi spētu izvēlēties darbību katrā no vides stāvokļiem).
Un jāapmāca aģents, lietojot lietderību tiešās novērtēšanas metodi pasīvā apmācībā.

In [3]:
# Action Directions
to_north = (0, 1)
to_south = (0,-1)
to_west = (-1, 0)
to_east = (1, 0)

policy = {
    (0, 5): to_east,  (1, 5): to_east,  (2, 5): to_south,                  (4, 5): to_south,                     (6, 5): None,
    (0, 4): to_east,  (1, 4): to_east,  (2, 4): to_east, (3, 4): to_east,  (4, 4): to_east,   (5, 4): to_east,   (6, 4): to_north,
                      (1, 3): to_east,  (2, 3): to_east, (3, 3): to_east,  (4, 3): to_north,  (5, 3): to_north,  (6, 3): None,
    (0, 2): to_east,  (1, 2): to_east,  (2, 2): to_east, (3, 2): to_north, (4, 2): to_north,  (5, 2): to_north,  (6, 2): to_west,
    (0, 1): to_north,                   (2, 1): to_north,(3, 1): to_north, (4, 1): to_north,  (5, 1): to_north,  (6, 1): to_north,
    (0, 0): to_north, (1, 0): to_west,  (2, 0): to_north,(3, 0): to_north, (4, 0): to_north,  (5, 0): to_north,  (6, 0): to_north,
}
# (0,3) (1,1) (3,5) (5,5) are blocks

This enviroment will be extensively used in the following demonstrations.

In [10]:
agent = PassiveDUEAgent(policy, sequential_decision_7x6_environment)
run_single_trial(agent, sequential_decision_7x6_environment)
agent.estimate_U()
print('\n'.join([str(k)+':'+str(v) for k, v in sorted(agent.U.items(), key=lambda item: item[1])]))

(0, 0):-1.7200000000000002
(0, 1):-1.6099999999999999
(0, 2):-1.4933333333333334
(1, 2):-1.36
(1, 3):-1.28
(1, 4):-1.24
(2, 4):-1.2
(3, 4):-1.16
(4, 4):-1.12
(5, 4):-1.08
(5, 3):-1.04
(6, 3):-1.0


4.	Lietojot gadījuma skaitļu ģeneratoru, jāiegūst 20 stāvokļu pāreju secības.

## Direct Utility Estimation (DUE)
 
 The first, most naive method of estimating utility comes from the simplest interpretation of the above definition. We construct an agent that follows the policy until it reaches the terminal state. At each step, it logs its current state, reward. Once it reaches the terminal state, it can estimate the utility for each state for *that* iteration, by simply summing the discounted rewards from that state to the terminal one.

 It can now run this 'simulation' $n$ times and calculate the average utility of each state. If a state occurs more than once in a simulation, both its utility values are counted separately.
 
 Note that this method may be prohibitively slow for very large state-spaces. Besides, **it pays no attention to the transition probability $P(s'\ |\ s, a)$.** It misses out on information that it is capable of collecting (say, by recording the number of times an action from one state led to another state). The next method addresses this issue.
 
### Examples

The `PassiveDEUAgent` class in the `reinforcement_learning4e` module implements the Agent Program described in **Fig 22.1** of the AIMA Book. `PassiveDEUAgent` sums over rewards to find the estimated utility for each state. It thus requires the running of several iterations.

In [4]:
#%psource PassiveDUEAgent

Now let's try the `PassiveDEUAgent` on the newly defined `sequential_decision_7x6_environment`:

In [11]:
DUEagent = PassiveDUEAgent(policy, sequential_decision_7x6_environment)

We can try passing information through the markove model for 200 times in order to get the converged utility value:

In [12]:
for i in range(200):
    run_single_trial(DUEagent, sequential_decision_7x6_environment)
    DUEagent.estimate_U()

Now let's print our estimated utility for each position:

In [13]:
sorted_dict = dict(sorted(DUEagent.U.items()))
print('\n'.join([str(k)+':'+str(v) for k, v in sorted_dict.items()]))

(0, 0):0.23346490243162893
(0, 1):0.22199089306633613
(0, 2):0.2842610024899924
(0, 4):0.6799999999999999
(1, 0):0.31665469411460473
(1, 2):0.32610679472570037
(1, 3):-0.28800200642144197
(1, 4):0.12299479166666648
(1, 5):0.07124999999999976
(2, 0):-0.05157052040100124
(2, 1):0.38559920361596045
(2, 2):0.4933342366343027
(2, 3):0.5160999419770573
(2, 4):0.6053573696536477
(2, 5):0.23883333333333323
(3, 0):-0.7125000000000001
(3, 1):0.12112346700824853
(3, 2):0.4569095199818807
(3, 3):0.6094453339410846
(3, 4):0.6256035727132658
(4, 1):-0.22843750000000007
(4, 2):0.14554508659033152
(4, 3):0.6884562902238323
(4, 4):0.8031542522018851
(4, 5):0.654228305866321
(5, 1):0.28375000000000006
(5, 2):0.352307443579038
(5, 3):0.27170504122347133
(5, 4):0.8904439509410755
(6, 1):0.52
(6, 2):0.25578125
(6, 3):-1.0
(6, 4):0.9549854439535427
(6, 5):1.0


In [8]:
print('\n'.join([str(k)+':'+str(v) for k, v in sorted(DUEagent.U.items(), key=lambda item: item[1])]))

(6, 3):-1.0
(5, 2):-0.6000000000000001
(1, 0):0.41712506011128425
(0, 0):0.5131216459717928
(0, 1):0.5566184870131354
(0, 2):0.596970659879227
(0, 3):0.5989916992187501
(1, 2):0.6353923554670262
(2, 1):0.64612060546875
(1, 3):0.6519667303352616
(2, 2):0.6785270391594272
(1, 5):0.6799999999999999
(2, 3):0.6988415338959475
(2, 5):0.7
(1, 4):0.70875
(3, 2):0.719210936815116
(3, 1):0.72
(4, 2):0.7337813091278076
(3, 4):0.7391733066146668
(3, 3):0.7559416781945131
(2, 4):0.75810546875
(5, 3):0.8010791015625001
(4, 3):0.8033556276418217
(4, 4):0.8592706295364754
(4, 5):0.889328611157761
(5, 4):0.9190230726866866
(5, 5):0.9577049762018335
(6, 4):0.96
(6, 5):1.0


## Adaptive Dynamic Programming (ADP)
 
 This method makes use of knowledge of the past state $s$, the action $a$, and the new perceived state $s'$ to estimate the transition probability $P(s'\ |\ s,a)$. It does this by the simple counting of new states resulting from previous states and actions.<br> 
 The program runs through the policy a number of times, keeping track of:
    - each occurrence of state $s$ and the policy-recommended action $a$ in $N_{sa}$
    - each occurrence of $s'$ resulting from $a$ on $s$ in $N_{s'|sa}$.
     
 It can thus estimate $P(s'\ |\ s,a)$ as $N_{s'|sa}/N_{sa}$, which in the limit of infinite trials, will converge to the true value.<br>
 Using the transition probabilities thus estimated, it can apply `POLICY-EVALUATION` to estimate the utilities $U(s)$ using properties of convergence of the Bellman functions.
 
### Examples

The `PassiveADPAgent` class in the `rl` module implements the Agent Program described in **Fig 22.2** of the AIMA Book. `PassiveADPAgent` uses state transition and occurrence counts to estimate $P$, and then $U$. Go through the source below to understand the agent.

In [9]:
#%psource PassiveADPAgent

We instantiate a `PassiveADPAgent` below with the `GridMDP` shown and train it for 200 steps. The `rl` module has a simple implementation to simulate a single step of the iteration. The function is called `run_single_trial`.

In [10]:
ADPagent = PassiveADPAgent(policy, sequential_decision_7x6_environment)
for i in range(200):
    run_single_trial(ADPagent, sequential_decision_7x6_environment)



The utilities are calculated as :

In [11]:
sorted_dict = dict(sorted(ADPagent.U.items()))
print('\n'.join([str(k)+':'+str(v) for k, v in sorted_dict.items()]))

(0, 0):-0.08026888141196768
(0, 1):-0.032096537436560046
(0, 2):0.02043548721496828
(0, 3):0.07248616806429156
(0, 4):0.09301792533298361
(0, 5):0.0
(1, 0):-0.11611317736242366
(1, 2):0.08218194786404698
(1, 3):0.13452983411189218
(1, 4):0.14779769504190282
(1, 5):0.0655462272174191
(2, 0):0.0
(2, 1):0.0875036277840179
(2, 2):0.14167069753779768
(2, 3):0.21131587814754058
(2, 4):0.2580869532622623
(2, 5):0.11727358579713232
(3, 0):0.0
(3, 1):0.0
(3, 2):0.2124274450434772
(3, 3):0.2953635345216884
(3, 4):0.39967774182673516
(4, 0):0.0
(4, 1):0.0
(4, 2):0.2793370193177036
(4, 3):0.3886667230241825
(4, 4):0.5175897351919777
(4, 5):0.6440655530676214
(5, 0):0.0
(5, 1):0.0
(5, 2):0.2812763308445338
(5, 3):0.35697370093837083
(5, 4):0.6544386123370751
(5, 5):0.8112234362131165
(6, 0):0.0
(6, 1):0.0
(6, 2):0.0
(6, 3):-1.0
(6, 4):0.86
(6, 5):1.0


In [12]:
print('\n'.join([str(k)+':'+str(v) for k, v in sorted(ADPagent.U.items(), key=lambda item: item[1])]))

(6, 3):-1.0
(1, 0):-0.11611317736242366
(0, 0):-0.08026888141196768
(0, 1):-0.032096537436560046
(4, 0):0.0
(3, 1):0.0
(5, 1):0.0
(0, 5):0.0
(6, 2):0.0
(3, 0):0.0
(5, 0):0.0
(6, 1):0.0
(4, 1):0.0
(2, 0):0.0
(6, 0):0.0
(0, 2):0.02043548721496828
(1, 5):0.0655462272174191
(0, 3):0.07248616806429156
(1, 2):0.08218194786404698
(2, 1):0.0875036277840179
(0, 4):0.09301792533298361
(2, 5):0.11727358579713232
(1, 3):0.13452983411189218
(2, 2):0.14167069753779768
(1, 4):0.14779769504190282
(2, 3):0.21131587814754058
(3, 2):0.2124274450434772
(2, 4):0.2580869532622623
(4, 2):0.2793370193177036
(5, 2):0.2812763308445338
(3, 3):0.2953635345216884
(5, 3):0.35697370093837083
(4, 3):0.3886667230241825
(3, 4):0.39967774182673516
(4, 4):0.5175897351919777
(4, 5):0.6440655530676214
(5, 4):0.6544386123370751
(5, 5):0.8112234362131165
(6, 4):0.86
(6, 5):1.0


When comparing to the result of `PassiveDUEAgent`, they both have -1.0 for utility at (3,1) and 1.0 at (3,2). Another point to notice is that the spot with the highest utility for both agents is (2,2) beside the terminal states, which is easy to understand when referring to the map.

## Temporal-difference learning (TD)
 
 Instead of explicitly building the transition model $P$, the temporal-difference model makes use of the expected closeness between the utilities of two consecutive states $s$ and $s'$.
 For the transition $s$ to $s'$, the update is written as:
$$U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha \left( R(s) + \gamma U^{\pi}(s') - U^{\pi}(s) \right)$$
 This model implicitly incorporates the transition probabilities by being weighed for each state by the number of times it is achieved from the current state. Thus, over a number of iterations, it converges similarly to the Bellman equations.
 The advantage of the TD learning model is its relatively simple computation at each step, rather than having to keep track of various counts.
 For $n_s$ states and $n_a$ actions the ADP model would have $n_s \times n_a$ numbers $N_{sa}$ and $n_s^2 \times n_a$ numbers $N_{s'|sa}$ to keep track of. The TD model must only keep track of a utility $U(s)$ for each state.
 
### Examples

`PassiveTDAgent` uses temporal differences to learn utility estimates. We learn the difference between the states and back up the values to previous states.  Let us look into the source before we see some usage examples.

In [13]:
#%psource PassiveTDAgent

In creating the `TDAgent`, we use the **same learning rate** $\alpha$ as given in the footnote of the book: $\alpha(n)=60/(59+n)$

In [14]:
TDagent = PassiveTDAgent(policy, sequential_decision_7x6_environment, alpha = lambda n: 60./(59+n))

Now we run **200 trials** for the agent to estimate Utilities.

In [15]:
for i in range(200):
    run_single_trial(TDagent,sequential_decision_7x6_environment)

The calculated utilities are:

In [16]:
sorted_dict = dict(sorted(TDagent.U.items()))
print('\n'.join([str(k)+':'+str(v) for k, v in sorted_dict.items()]))

(0, 0):-0.058167923242457176
(0, 1):-0.004343240836352189
(0, 2):0.036823965592078185
(0, 3):0.12282370538536669
(0, 4):0.1626234030730335
(0, 5):0.0
(1, 0):-0.10858397515541818
(1, 2):0.07319232552719433
(1, 3):0.1720640375861118
(1, 4):0.20824029606940025
(1, 5):0.0
(2, 0):0.0
(2, 1):0.0662351573608459
(2, 2):0.14199058183780378
(2, 3):0.23228738623345338
(2, 4):0.30999198140288836
(2, 5):-0.04481642609136607
(3, 0):0.0
(3, 1):0.16876023329671247
(3, 2):0.21191872416900676
(3, 3):0.28632984324804917
(3, 4):0.40120751812514505
(4, 0):0.0
(4, 1):0.0
(4, 2):0.21440343103614912
(4, 3):0.36538456696931626
(4, 4):0.4985479030894473
(4, 5):0.5885602255558744
(5, 0):0.0
(5, 1):0.0
(5, 2):-0.7239567785487893
(5, 3):0.27728540692184256
(5, 4):0.5393173195256209
(5, 5):0.8153583891750166
(6, 0):0.0
(6, 1):0.0
(6, 2):0.0
(6, 3):-1
(6, 4):0.86
(6, 5):1


In [17]:
print('\n'.join([str(k)+':'+str(v) for k, v in sorted(TDagent.U.items(), key=lambda item: item[1])]))

(6, 3):-1
(5, 2):-0.7239567785487893
(1, 0):-0.10858397515541818
(0, 0):-0.058167923242457176
(2, 5):-0.04481642609136607
(0, 1):-0.004343240836352189
(4, 0):0.0
(5, 1):0.0
(0, 5):0.0
(6, 2):0.0
(3, 0):0.0
(5, 0):0.0
(1, 5):0.0
(6, 1):0.0
(4, 1):0.0
(2, 0):0.0
(6, 0):0.0
(0, 2):0.036823965592078185
(2, 1):0.0662351573608459
(1, 2):0.07319232552719433
(0, 3):0.12282370538536669
(2, 2):0.14199058183780378
(0, 4):0.1626234030730335
(3, 1):0.16876023329671247
(1, 3):0.1720640375861118
(1, 4):0.20824029606940025
(3, 2):0.21191872416900676
(4, 2):0.21440343103614912
(2, 3):0.23228738623345338
(5, 3):0.27728540692184256
(3, 3):0.28632984324804917
(2, 4):0.30999198140288836
(4, 3):0.36538456696931626
(3, 4):0.40120751812514505
(4, 4):0.4985479030894473
(5, 4):0.5393173195256209
(4, 5):0.5885602255558744
(5, 5):0.8153583891750166
(6, 4):0.86
(6, 5):1


When comparing to previous agents, the result of `PassiveTDAgent` is closer to `PassiveADPAgent`.