# Learn RL Notes

# Books

# 1 Intro to RL
## 1.1 Many faces of RL
<img src="images/rl-faces.png" width=600 />
## 1.2 Terminology 
### Reward, Reward hypothesis
<img src="images/rl-reward.png" width=600 />
### Total reward, sequential decision making
<img src="images/rl-sequence.png" width=600 />
### Agent, Environment, History, State
<img src="images/rl-agent-environment.png" width=600 />
<img src="images/rl-history-state.png" width=600 />
### Environment state, Agent state, Information state
<img src="images/rl-environment-state.png" width=600 />
<img src="images/rl-agent-state.png" width=600 />
<img src="images/rl-information-state.png" width=600 />

## 1.3 MDP, FOMDP
<img src="images/rl-mdp.png" width=600 />
<img src="images/rl-fomdp.png" width=600 />




## 1.4 Major components of RL agent
### 1.4.1 Policy: agent's behavior function
> A policy is the agent's behavior

> it's a map from state to action

#### Deterministric Policy
$$ a = \pi(s) $$
#### Stochastic Policy
$$ \pi(a|s) = P[A_{t} = a | S_{t} = s] $$
<img src="images/maze.png" width=600 />
<img src="images/maze-policy.png" width=600 />

### 1.4.2 Value function: how good is each state and/or action
> Value function is a prediction of the future reward

> Used to evaluate the goodness/badness of states
<img src="images/maze-value-function.png" width=600 />

Therefore need to select between actions, e.g.
$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \lambda R_{t+2}+\lambda^{2}_{t+3}+...| S_{t} = s] $$

### 1.4.3 Model: agent's representation of environment
> A Model predicts what the environment will do next
<img src="images/maze-model.png" width=600 />

#### $\mathcal{P}$ predicts next state
$$ \mathcal{P}^{a}_{ss'}=P[S_{t+1}=s'|S_{t}=s, A_{t}=a]$$
#### $\mathcal{R}$ predicts next (intermediate) reward
$$ \mathcal{R}^{a}_{s} = E[R_{t+1}|S_{t}=s, A_{t}=a] $$


## 1.5 RL Agent Taxonomy
<img src="images/rl-taxonomy.png" width=600 />

## 1.6 Learning and Planning
Two fundamental problems in sequential decision making
### Reinforcement Learning:
* The environment is initially unknown
* The agent interacts with the environment
* The agent improves its policy
<img src="images/rl-atari-learning.png" width=600 />
### Planning:
* A model of the environment is known
* The agent performs computations with its model (without any external interaction)
* The agent improves its policy
* a.k.a. deliberation, reasoning, introspection, pondering, thought, search
<img src="images/rl-atari-planing.png" width=600 />

## 1.7 Exploration and Exploitation
* Reinforcement learning is like trial-and-error learning
* The agent should discover a good policy
* From its experiences of the environment
* Without losing too much reward along the way

* Exploration finds more information about the environment
* Exploitation exploits known information to maximise reward
* It is usually important to explore as well as exploit
## Examples
* Restaurant Selection
    * Exploitation Go to your favourite restaurant
    * Exploration Try a new restaurant
* Online Banner Advertisements
    * Exploitation Show the most successful advert
    * Exploration Show a different advert
* Oil Drilling
    * Exploitation Drill at the best known location
    * Exploration Drill at a new location
* Game Playing
    * Exploitation Play the move you believe is best
    * Exploration Play an experimental move

## 1.8 Prediction and Control
* Prediction: evaluate the future
    * Given a policy
* Control: optimise the future
    * Find the best policy
<img src="images/rl-gridworld-prediction.png" width=600 />
<img src="images/rl-gridworld-control.png" width=600 />


# 2 Markov Decision Processes

## 2.1 Markov Processes
* Markov decision processes formally describe an environment for reinforcement learning
* Where the environment is **fully observable**
* i.e. The current state completely characterises the process
* Almost all RL problems can be formalised as MDPs, e.g. 
    * Optimal control primarily deals with continuous MDPs
    * **Partially observable** problems can be converted into MDPs
    * **Bandits** are MDPs with one state
    
### 2.1.2 Markov Property
> “The future is independent of the past given the present”

A state $S_{t}$ is Markov if and only if $$P[S_{t+1}|S_{t}] = P[S_{t+1}|S_{1}, S_{2}. ..., S_{t}]$$ 
* The **state** captures all relevant information from the **history**
* Once the state is known, the history may be thrown away
* i.e. The state is a sufficient statistic of the future

### 2.1.3 State Transition Matrix
For a Markov state s and successor state s , the **state transition probability** is defined by
$$ \mathcal{P}_{ss'} = P[S_{t+1} = s' | S_{t} = s]$$
**State transition matrix** $\mathcal{P}$ defines transition probabilities from all states s to all successor states s',
$$ \mathcal{P} = \begin{bmatrix} P_{11} & \dots & P_{1n} \\
\dots \\
P_{n1} & \dots & P_{nn}\end{bmatrix}$$
where each row of the matrix sums to 1.

### 2.1.4 Markov Process
> A Markov process is a memoryless random process, i.e. a sequence of random states $S_{1} , S_{2}$ , ... with the Markov property.

A Markov Process (or Markov Chain) is a tuple (S, P)
* S is a (finite) set of states
* P is a state transition probability matrix, $ \mathcal{P}_{ss'} = P[S_{t+1} = s' | S_{t} = s]$

<img src="images/rl-markov.png" width=600 />
<img src="images/rl-markov-episodes.png" width=600 />
<img src="images/rl-markov-state-transition-matrix.png" width=600 />

## 2.2 Markov Reward Processes
> A Markov reward process is a Markov chain with values.

A Markov Reward Process is a tuple $(S, P, R, \gamma)$
* S is a finite set of states
* P is a state transition probability matrix, $P_{ss'} = P [S_{t+1} = s' | S_{t} = s]$
* R is a reward function, $R_{s} = E[R_{t+1} | S_{t} = s]$
* $\gamma$ is a discount factor, $\gamma$ ∈ [0, 1]

### 2.2.1 Return
The return $G_{t}$ is the total discounted reward from time-step t.
$$ G_{t} = R_{t+1} + \gamma R_{t+2}+ ... = \sum_{k=0}^{\infty}\gamma^{k} R_{t+k+1}$$
* The discount γ ∈ [0, 1] is the present value of future rewards 
* The value of receiving reward R after k + 1 time-steps is $γ^{k} R$ . 
* This values immediate reward above delayed reward.
    * γ close to 0 leads to ”myopic” evaluation
    * γ close to 1 leads to ”far-sighted” evaluation

### 2.2.2 Value Function
> The value function v(s) gives the long-term value of state s

The state value function v(s) of an MRP is the expected return starting from state s
$$ v(s) = E[G_{t}|S_{t} = s]$$

### 2.2.3 Example MRP
<img src="images/rl-mrp-returns.png" width=600 />
<img src="images/rl-mrp-gamma0.png" width=600 />
<img src="images/rl-mrp-gamma1.png" width=600 />
<img src="images/rl-mrp-gamma2.png" width=600 />

### 2.2.4 Bellman Equation for MRPs
The value function can be decomposed into two parts: 
* immediate reward $R_{t+1}$
* discounted value of successor state $γv(S_{t+1})$
<img src="images/rl-bellman.png" width=600 />
#### 2.2.4.1 Example MRP Bellman
<img src="images/rl-bellman-example.png" width=600 />
#### 2.2.4.2 Bellman in matrix form
<img src="images/rl-bellman-matrix.png" width=600 />
#### 2.2.4.3 Solving Bellman Equation
<img src="images/rl-bellman-solving.png" width=600 />




## 2.3 Markov Decision Processes
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

A Markov Decision Process is a tuple ⟨S, A, P, R, γ⟩ 
* S is a finite set of states
* A is a finite set of actions
* P is a state transition probability matrix,
* $P^{a}_{ss′}\ =\ P[S_{t+1}=s′|S_{t}=s,\ A_{t}=a] $
* R is a reward function, $R^{a}_{s} = E[R_{t+1}\ | S_{t} = s,\ A_{t} = a] $
* γ is a discount factor γ ∈ [0, 1].

<img src="images/rl-mdp-example.png" width=600 />


### 2.3.1 Policies 
A policy π is a distribution over actions given states, $$π(a|s)=P[A_{t} =a|S_{t} =s]$$
* A policy fully defines the behaviour of an agent
* MDP policies depend on the current state (not the history)
* i.e. Policies are stationary (time-independent), $$A_{t}\ ∼\ π(·|S_{t}),∀t>0$$

* Given an MDP M = ⟨S,A,P,R,γ⟩ and a policy π
* The state sequence S1, S2, ... is a Markov process $⟨S, P^{π}⟩$
* The state and reward sequence S1, R2, S2, ... is a Markov reward process $⟨S, P^{π}, R^{π}, γ⟩$
* where 

$$ P^{\pi}_{s,s'} = \sum_{a \in A}\pi(a|s)P^a_{ss'}$$
$$ R^{\pi}_{s} = \sum_{a \in A}\pi(a|s)R^{a}_{s} $$

### 2.3.2 Value function

The **state-value function $v_{π}(s)$** of an MDP is the expected return starting from state s, and then following policy π
$$v_{π}(s)=E_{π}[G_{t} |S_{t} =s]$$

The **action-value function $q_{π}(s,a)$** is the expected return starting from state s, taking action a, and then following policy π
$$q_{π}(s,a)=E_{π}[G_{t} |S_{t} =s,A_{t} =a]$$

<img src="images/rl-mdp-state-value.png" width=600 />

### 2.3.3 Bellman Expectation Equation
The state-value function can again be decomposed into immediate reward plus discounted value of successor state,
$$ v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\ |\ S_{t} = s]$$

The action-value function can similarly be decomposed,
$$ q_{\pi}(s, a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1})\ |\ S_{t} = s, A_{t} = a]$$

**Bellman Expectation Equation for $V_{π}$ and $Q_{π}$**
> state value
<img src="images/rl-mdp-vpi.png" width=600/> 
> action value
<img src="images/rl-mdp-qpi.png" width=600/>
> state value2
<img src="images/rl-mdp-vpi2.png" width=600/>
> action value2
<img src="images/rl-mdp-qpi2.png" width=600/>

** Example**
<img src="images/rl-mdp-bellman-example.png" width=600/>

** Bellman Expectation Equation (Matrix Form) **
<img src="images/rl-mdp-bellman-matrix-form.png" width=600/>

### 2.3.4 Optimal Value Function
The **optimal state-value function** $v_{∗}(s)$ is the maximum value function over all policies
$$ v_{∗}(s) = \underset{\pi}{max}\ v_{\pi} (s)$$

The **optimal action-value function** $q_{∗} (s, a)$ is the maximum action-value function over all policies
$$ q_{∗} (s, a) = \underset{\pi}{max}\ q_{\pi} (s, a)$$


* The optimal value function specifies the best possible performance in the MDP.
* An MDP is “solved” when we know the optimal value fn.

<img src="images/rl-mdp-optimal-state.png" width=600/>
<img src="images/rl-mdp-optimal-action.png" width=600/>

### 2.3.5 Optimal policy

Define a **partial ordering over policies**
$$π ≥ π' if v_{π} (s') ≥ v_{π} (s), ∀s$$

**Theorem**

For any Markov Decision Process
* There exists an optimal policy $π_{∗}$ that is better than or equal to all other policies, $π_{∗} ≥ π, ∀π$
* All optimal policies achieve the optimal value function, $v_{π_{∗}} (s) = v_{∗} (s)$
* All optimal policies achieve the optimal action-value function, $q_{π_{∗}} (s, a) = q_{∗} (s, a)$

<img src="images/rl-mdp-optimal-policy.png" width=600/>

### 2.3.6 Finding an Optimal Policy
An optimal policy can be found by maximising over $q_{∗} (s, a)$,
$$ π_{∗} (a|s) = \ 
\begin{cases}
1\ if a = \underset{a \in A}{argmax}\ q_{∗} (s, a) \\
0\ otherwise
\end{cases}
$$

* There is always a deterministic optimal policy for any MDP 
* If we know $q_{∗} (s, a)$, we immediately have the optimal policy

### 2.3.7 Bellman Optimality Equation
** Bellman Optimality Equation for $v_{∗}$ and $Q_{∗}$ **
> state value
<img src="images/rl-mdp-optimality-state.png" width=600 />
> action value
<img src="images/rl-mdp-optimality-action.png" width=600/>
> state value2
<img src="images/rl-mdp-optimality-state2.png" width=600/>
> action value 2
<img src="images/rl-mdp-optimality-action2.png" width=600/>
** Example**
<img src="images/rl-mdp-optimality-example.png" width=700/>

### 2.3.8 Solving the Bellman Optimality Equation
* Bellman Optimality Equation is **non-linear**
* **No closed form solution (in general)**
* Many iterative solution methods
    * **Value Iteration**
    * **Policy Iteration**
    * **Q-learning**
    * **Sarsa**
    


## 2.4 Extensions to MDPs

### 2.4.1 Infinite and continuous MDPs
The following extensions are all possible:
* Countably infinite state and/or action spaces
    * Straightforward
* Continuous state and/or action spaces
    * Closed form for linear quadratic model (LQR)
* Continuous time
    * Requires partial differential equations
    * Hamilton-Jacobi-Bellman (HJB) equation
    * Limiting case of Bellman equation as time-step → 0
    
### 2.4.2 Partially observable MDPs (POMDP)
A Partially Observable Markov Decision Process is an MDP with **hidden states**. It is a **hidden Markov model** with actions.

Definition
A POMDP is a tuple $(S, A, O, P, R, Z, \gamma)$
* S is a finite set of states
* A is a finite set of actions
* O is a finite set of observations
* P is a state transition probability matrix, $P^{a}_{ss'}\ =\ P[S_{t+1} = s'| S_{t} = s, A_{t} = a]$
* R is a reward function, $R^{a}_{s}\ =\ E [R_{t+1} | S_{t} = s, A_{t} = a]$
* Z is an observation function,$Z^{a}_{s'o}\ =\ P [O_{t+1} = o | S_{t+1} = s' , A_{t} = a]$
* $\gamma$ is a discount factor $\gamma \in [0, 1]$.

**Belief States**

Definition
A history $H_{t}$ is a sequence of actions, observations and rewards,
$$H_{t} = A_{0} , O_{1} , R_{1} , ..., A_{t−1} , O_{t} , R_{t}$$

Definition
A belief state b(h) is a probability distribution over states, conditioned on the history h
$$b(h) = (P[S_{t} = s^{1} | H_{t} = h] , ..., P [S_{t} = s^{n} | H_{t} = h])$$

**Reductions of POMDPs**
<img src="images/rl-pomdp-reduction.png" width=600 />

### 2.4.3 Undiscounted, average reward MDPs
<img src="images/rl-ergodic-markov.png" width=600 />
<img src="images/rl-ergodic-markov2.png" width=600 />
<img src="images/rl-average-reward.png" width=600 />


# 3 Planning by Dynamic Programming
## 3.1 Introduction

### What is Dynamic Programming?

**Dynamic** sequential or temporal component to the problem

**Programming** optimising a “program”, i.e. a policy
    * c.f. linear programming

* A method for solving complex problems
* By breaking them down into **subproblems**
* Solve the subproblems
* Combine solutions to subproblems

### Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties:

** Optimal substructure **
* Principle of optimality applies
* Optimal solution can be decomposed into subproblems

** Overlapping subproblems **
* Subproblems recur many times
* Solutions can be cached and reused

** Markov decision processes satisfy both properties **
* Bellman equation gives recursive decomposition
* Value function stores and reuses solutions

### Planning by Dynamic Programming
* Dynamic programming assumes **full knowledge** of the MDP
* It is used for **planning** in an MDP
* For **prediction**:
    * Input: MDP (S, A, P, R, γ) and policy π
    * or: MRP $(S, P^{π} , R^{π} , γ)$
    * Output: value function $v_{π}$
* Or for **control**:
    * Input: MDP (S, A, P, R, γ)
    * Output: optimal value function $v_{∗}$
    *    and: optimal policy $π_{∗}$

### Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
* Scheduling algorithms
* String algorithms (e.g. sequence alignment)
* Graph algorithms (e.g. shortest path algorithms)
* Graphical models (e.g. Viterbi algorithm)
* Bioinformatics (e.g. lattice models)

## 3.2 Policy Evaluation
### Iterative Policy Evaluation
* Problem: evaluate a given policy π
* Solution: iterative application of **Bellman expectation backup**
* $v_{1} → v_{2} → ... → v_{π}$
* Using **synchronous backups**,
    * At each iteration k + 1
    * For all states s ∈ S
    * Update $v_{k+1}(s)$ from $v_{k}(s')$
    * where s' is a successor state of s
* We will discuss **asynchronous backups** later
* Convergence to $v_{π}$ will be proven at the end of the lecture
<img src="images/rl-dp-iterative-policy-evaluation.png" width=600 />

### Small Girdworld Example
<img src="images/rl-dp-poleval-example1.png" width=600 />
<img src="images/rl-dp-poleval-example2.png" width=600 />
<img src="images/rl-dp-poleval-example3.png" width=600 />

## 3.3 Policy Iteration

### How to Improve a Policy
* Given a policy π
    * Evaluate the policy π
        $$ v_{π} (s) = E [R_{t+1} + γR_{t+2} + ...|S_{t} = s]$$
    * Improve the policy by acting greedily with respect to $v_{π}$
        $$ π' = greedy(v_{π} )$$
* **In Small Gridworld improved policy was optimal, $π = π_{∗}$**
* **In general, need more iterations of improvement / evaluation**
* But this process of policy iteration always converges to π_{∗}

### Greedy Policy Iteration
<img src="images/rl-dp-policy-iteration.png" width=600 />

### Proof of Policy Improvement
<img src="images/rl-dp-policy-improvement-proof.png" width=600 />
<img src="images/rl-dp-policy-improvement-proof2.png" width=600 />

### Modified Policy Iteration
* Does policy evaluation need to converge to $v_{π}$ ?
* Or should we introduce a stopping condition
    * e.g. $\epsilon -convergence$ of value function
* Or simply stop after k iterations of iterative policy evaluation?
    * For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
* Why not update policy every iteration? i.e. stop after k = 1
    * This is equivalent to value iteration (next section)

### Generalized Policy Iteration
<img src="images/rl-dp-generalized-policy-iteration.png" width=600 />

### Principle of Optimality
Any optimal policy can be subdivided into two components:
* An optimal first action $A_{∗}$
* Followed by an optimal policy from successor state S'

**Theorem (Principle of Optimality)**

A policy π(a|s) achieves the optimal value from state s, $v_{π} (s) = v_{∗} (s)$, if and only if
* For any state s' reachable from s
* π achieves the optimal value from state s' , $v_{π} (s') = v_{∗} (s')$

## 3.4 Value Iteration
### Deterministic Value Iteration
* If we know the solution to subproblems $v_{∗} (s)$
* Then solution $v_{∗} (s)$ can be found by one-step lookahead
$$ v_{∗} (s) ← \underset{a \in A}{max} R^{a}_{s} + \gamma \underset{s' \in S}{\sum}P^{a}_{ss'}v_{*}(s')$$
* The idea of value iteration is to apply these updates iteratively
* Intuition: start with final rewards and work backwards
* **Still works with loopy, stochastic MDPs**

### Example
<img src="images/rl-dp-value-interation-example.png" width=600 />

### Value Iteration Algorithm
* Problem: find optimal policy π
* Solution: iterative application of Bellman optimality backup
* $v_{1} → v_{2} → ... → v_{∗}$
* Using **synchronous backups**
    * At each iteration k + 1
    * For all states s ∈ S
    * Update $v_{k+1} (s)$ from $v_{k} (s')$
* Convergence to $v_{∗}$ will be proven later
* **Unlike policy iteration, there is no explicit policy**
* Intermediate value functions may not correspond to any policy

<img src="images/rl-dp-value-iteration.png" width=600 />

### policy iteration vs value iteration
$$ \tag{Policy Iteration} v_{k+1}(s) = \underset{a \in A}{\sum}\pi(a|s) (R^{a}_{s} + \gamma \underset{s' \in S}{\sum}(P^{a}_{ss'} v_{k}(s') )))$$
$$ \tag{Value  Iteration} v_{k+1}(s) = \underset{a \in A}{max}(R^{a}_{s} + \gamma \underset{s' \in S}{\sum}(P^{a}_{ss'} v_{k}(s') )$$
$$ \tag{Policy Iteration Matrix Form} v^{k+1} = R^{\pi} + \gamma P^{\pi}v^{k}$$
$$ \tag{Value  Iteration Matrix Form} v_{k+1} = \underset{a \in A}{max} (R^{a} + \gamma P^{a}v_{k})$$

## Summary of DP Algorithms
<img src="images/rl-dp-synchronous-dp-algs.png" width=600 />

## 3.5 Extensions to Dynamic Programming
## 3.6 Contraction Mapping

# 4 Model-Free Prediction
# 5 Model-Free Control
# 6 Value Function Approximation
# 7 Policy Gradient Methods
# 8 Integrating Learning and Planning
# 9 Exploration and Exploitation
# 10 Case study - RL in games