# Learn RL Notes

# Books

# 1 Intro to RL
## 1.1 Many faces of RL
<img src="images/rl-faces.png" width=600 />
## 1.2 Terminology 
### Reward, Reward hypothesis
<img src="images/rl-reward.png" width=600 />
### Total reward, sequential decision making
<img src="images/rl-sequence.png" width=600 />
### Agent, Environment, History, State
<img src="images/rl-agent-environment.png" width=600 />
<img src="images/rl-history-state.png" width=600 />
### Environment state, Agent state, Information state
<img src="images/rl-environment-state.png" width=600 />
<img src="images/rl-agent-state.png" width=600 />
<img src="images/rl-information-state.png" width=600 />

## 1.3 MDP, FOMDP
<img src="images/rl-mdp.png" width=600 />
<img src="images/rl-fomdp.png" width=600 />




## 1.4 Major components of RL agent
### 1.4.1 Policy: agent's behavior function
> A policy is the agent's behavior

> it's a map from state to action

#### Deterministric Policy
$$ a = \pi(s) $$
#### Stochastic Policy
$$ \pi(a|s) = P[A_{t} = a | S_{t} = s] $$
<img src="images/maze.png" width=600 />
<img src="images/maze-policy.png" width=600 />

### 1.4.2 Value function: how good is each state and/or action
> Value function is a prediction of the future reward

> Used to evaluate the goodness/badness of states
<img src="images/maze-value-function.png" width=600 />

Therefore need to select between actions, e.g.
$$ v_{\pi}(s) = E_{\pi}[R_{t+1} + \lambda R_{t+2}+\lambda^{2}_{t+3}+...| S_{t} = s] $$

### 1.4.3 Model: agent's representation of environment
> A Model predicts what the environment will do next
<img src="images/maze-model.png" width=600 />

#### $\mathcal{P}$ predicts next state
$$ \mathcal{P}^{a}_{ss'}=P[S_{t+1}=s'|S_{t}=s, A_{t}=a]$$
#### $\mathcal{R}$ predicts next (intermediate) reward
$$ \mathcal{R}^{a}_{s} = E[R_{t+1}|S_{t}=s, A_{t}=a] $$


## 1.5 RL Agent Taxonomy
<img src="images/rl-taxonomy.png" width=600 />

## 1.6 Learning and Planning
Two fundamental problems in sequential decision making
### Reinforcement Learning:
* The environment is initially unknown
* The agent interacts with the environment
* The agent improves its policy
<img src="images/rl-atari-learning.png" width=600 />
### Planning:
* A model of the environment is known
* The agent performs computations with its model (without any external interaction)
* The agent improves its policy
* a.k.a. deliberation, reasoning, introspection, pondering, thought, search
<img src="images/rl-atari-planing.png" width=600 />

## 1.7 Exploration and Exploitation
* Reinforcement learning is like trial-and-error learning
* The agent should discover a good policy
* From its experiences of the environment
* Without losing too much reward along the way

* Exploration finds more information about the environment
* Exploitation exploits known information to maximise reward
* It is usually important to explore as well as exploit
## Examples
* Restaurant Selection
    * Exploitation Go to your favourite restaurant
    * Exploration Try a new restaurant
* Online Banner Advertisements
    * Exploitation Show the most successful advert
    * Exploration Show a different advert
* Oil Drilling
    * Exploitation Drill at the best known location
    * Exploration Drill at a new location
* Game Playing
    * Exploitation Play the move you believe is best
    * Exploration Play an experimental move

## 1.8 Prediction and Control
* Prediction: evaluate the future
    * Given a policy
* Control: optimise the future
    * Find the best policy
<img src="images/rl-gridworld-prediction.png" width=600 />
<img src="images/rl-gridworld-control.png" width=600 />


# 2 Markov Decision Processes

## 2.1 Markov Processes
* Markov decision processes formally describe an environment for reinforcement learning
* Where the environment is **fully observable**
* i.e. The current state completely characterises the process
* Almost all RL problems can be formalised as MDPs, e.g. 
    * Optimal control primarily deals with continuous MDPs
    * **Partially observable** problems can be converted into MDPs
    * **Bandits** are MDPs with one state
    
### 2.1.2 Markov Property
> “The future is independent of the past given the present”

A state $S_{t}$ is Markov if and only if $$P[S_{t+1}|S_{t}] = P[S_{t+1}|S_{1}, S_{2}. ..., S_{t}]$$ 
* The **state** captures all relevant information from the **history**
* Once the state is known, the history may be thrown away
* i.e. The state is a sufficient statistic of the future

### 2.1.3 State Transition Matrix
For a Markov state s and successor state s , the **state transition probability** is defined by
$$ \mathcal{P}_{ss'} = P[S_{t+1} = s' | S_{t} = s]$$
**State transition matrix** $\mathcal{P}$ defines transition probabilities from all states s to all successor states s',
$$ \mathcal{P} = \begin{bmatrix} P_{11} & \dots & P_{1n} \\
\dots \\
P_{n1} & \dots & P_{nn}\end{bmatrix}$$
where each row of the matrix sums to 1.

### 2.1.4 Markov Process
> A Markov process is a memoryless random process, i.e. a sequence of random states $S_{1} , S_{2}$ , ... with the Markov property.

A Markov Process (or Markov Chain) is a tuple (S, P)
* S is a (finite) set of states
* P is a state transition probability matrix, $ \mathcal{P}_{ss'} = P[S_{t+1} = s' | S_{t} = s]$

<img src="images/rl-markov.png" width=600 />
<img src="images/rl-markov-episodes.png" width=600 />
<img src="images/rl-markov-state-transition-matrix.png" width=600 />

## 2.2 Markov Reward Processes
> A Markov reward process is a Markov chain with values.

A Markov Reward Process is a tuple $(S, P, R, \gamma)$
* S is a finite set of states
* P is a state transition probability matrix, $P_{ss'} = P [S_{t+1} = s' | S_{t} = s]$
* R is a reward function, $R_{s} = E[R_{t+1} | S_{t} = s]$
* $\gamma$ is a discount factor, $\gamma$ ∈ [0, 1]

### Return
The return G t is the total discounted reward from time-step t.
$$ G_{t} = R_{t+1} + \gamma R_{t+2}+ ... = \sum_{k=0}^{\infty}\gamma^{k} R_{t+k+1}$$
## 2.3 Markov Decision Processes
## 2.4 Extensions to MDPs

# 3 Planning by Dynamic Programming
# 4 Model-Free Prediction
# 5 Model-Free Control
# 6 Value Function Approximation
# 7 Policy Gradient Methods
# 8 Integrating Learning and Planning
# 9 Exploration and Exploitation
# 10 Case study - RL in games