# Deep Reinforcement Learning

Deep Mind's David Silver's Course.

### Lecture 1 - Introduction to Reinforcement Learning

Reinforcement learning has many aspects and has distinctive application in not only machine learning but fields such as robotics and control theory. Many faces of reinforcement learning:

1. Machine Learning
2. Reward System
3. Operations Research
4. Bounded Rationality 
5. Optimal Control 
6. Classical/Operant Conditioning

  ##### Characterstics of Reinforcement Learning 

What makes reinforcement learning differen from other machine learning paradigms ? 
    1. There is no supervisor, only a ${Reward}$ signal.
    2. Feedback is delayed, not instantaneous.
    3. Time really matters, sequential decision making (non i.i.d data).
    4. Agent's action affect the subsequent data it recieves.
    

### Outline 
- <font color='grey'>About Reinforcement Learning.</font>
- The Reinforcement Learning Problem.
- <font color='grey'>Inside An RL agent.</font>
- <font color='grey'>Problems with Reinforcement Learning.</font>

### The Reinforcement Learning Problem

#### Rewards

- A reward $R_t$ is a scalar feedback signal.
- Indicates how the agent is doing at time step t. 
- Agent's job is to maximize cumulative reward.

Reinforcement Learning is based on $\textbf{Reward hypothesis}$.
> All goals can be described by the maximization of expected reward.

###### Examples of Rewards 

- Fly stunt manoeuvres in a helicopter.
    - +ve reward for following desired trajectory.
    - -ve reward for crashing.

### Sequential Decision Making

- Goal: select actions to maximize total futre reward.
> That usually means planning ahead.
- Actions may have long term consequences.
- Reward may be delayed.
- It may be better to sacrifice immediate reward to gain more long-term reward.
- Examples:
    - A financial investment(may take months to mature)
    - Refueling a helicopter (might prevent a crash in several hours)

### Agent and Environment 
Interaction between agent and environment
Trial and Error Loop

Observations $O_t$
Action $A_t$
Reward $R_t$

- At each time step ${t}$ the $\textbf{agent}$:
    - Executes action $A_t$
    - Receives observation $O_t$
    - Receives scalar reward $R_t$

- At each time step ${t}$, the $\textbf{environment}$:
    - Receives action $A_t$
    - Emits observation $O_t$
    - Emits scalar reward $R_t$

The Reward can always be modelled as a Scalar: This is the RL view

- The history is the sequence of observations, rewards and actions $ H_t = A_1,O_1,R_1 .... A_t,O_t,R_t $
- i.e all the obsevable variables upto time $t$.
- What happens next depends on the history.
    -An agent selects actions.
    - The environment selects observations/rewards.
- $State$ is the information used to determine what happens next.
- Formally, state is a function of the history. $S_t = f(H_t)$

### The Environment State   $S_t^e$

- It is the environments's private representation.
- It is whatever data environment uses to pick the next observation/reward.
- The environment state is usually not visible to the agent.
- Even if $S_t^e$ is visible, it may not contain the relevant information.

### The Agent State $S_t^a$

- It is the agents internal representation
- Whatever data agent uses to pick the next action. i.e. the information used by reinforcement learning algorithms.
- It can be any function history: $S_t^a = f(H_t)$

### Information State or Markov State
An information state contains all the information from the history.

##### Definition 
A state is Markov if and only if 
   > $P[S_{t+1} | S_t] = P[S_{t+1} | S_1,S_2,S_3,S_4 ....S_t]$

######  "The future is independent of the past given the present"
> $H_{1:t} -> S_t -> H_{t+1:\infty}$

- Once the state is known, the history can be thrown away.
- The state is sufficient statistics of the future.
- The environment state $S_t^e$ is Markov.
- The history $H_t$ is Markov.

- State representation really defines what happens next.

#### Fully Observable Environments

Agent directly observs environment state $O_t = S_t^a = S_t^e$

Agent State = Information State = Information state
- Formally this is Markov Decision Process (MDP)

#### Partially Observable Environments

- Parial Observability: agent indirectly observes environment.
    - A robot with camera vision isn't told it's absolute location.
    - A trading Agent only observs current prices.
    - A poker playing agent only observes public cards.

- Now agent state $\neq$ environment state.
- Formally this is a partially observable Markov Decision Process (POMDP)
- Agent must construct its own state $S_t^a$, e.g.
    - Complete History: $S_t^a = H_t$
    - Beliefs of environment state: $S_t^a = (P[S_t^e = S^1], .......P[S_t^e=S^n])$
    - Recurrent neural network: $S_t^a = \sigma(S_{t-1} W_s + O_t W_o)$

### Outline 
- <font color='grey'>About Reinforcement Learning.</font>
- <font color='grey'>The Reinforcement Learning Problem. </font>
- Inside An RL agent.
- <font color='grey'>Problems with Reinforcement Learning.</font>

### Inside An RL Agent

#### Major Component of an RL Agent

- An RL agent may include one or more of these components:
    - $\textbf{Policy}$: Agent's behaviour function.
    - $\textbf{Value Function}$: how good is each state and/or action.
    - $\textbf{Model}$: Agent's representation of the environment.


### Policy 
- A policy is the agent's behaviour.
- It is a map from state to action, e.g.
- Deterministic policy: $a = \Pi(s) $
- Stochastic Policy: $\Pi(a|s) = P[A=a | S=s]$



### Value Function
- Value function is a prediction of future reward. 
- Used to evaluate the goodness/badness of states.
- And therefore to select between actions, e.g.
    > $V_{\Pi}(s) = E_{\Pi}[R_t + \gamma R_{t+1} + \gamma^2R_{t+2} + ..... | S_t = s]$
    

### Models

- A model predicts what the environment will do next.
- $\textbf{Transitions}$: predicts the next state (i.e. dynamics)
- $\textbf{Rewards}$: Predicts the next immediate reward e.g. 
   - $\mathbb{P}_{ss'}^a = \mathbb{P}[S'=s' | S = s, A = a]$.
   - $\mathfrak{R}_s^a = \mathbb{E}[R|S = s, A =a]$

### Categorizing RL agents(1)

- Value Based
    - <font color='grey'>No Policy (implicit)</font>
    - Value Function

- Policy Based
    - Policy
    - <font color='grey'>No Value Function</font>
- Actor Critic 
    - Policy
    - Value Function 


### Categorising RL Agents(2)
- Model Free 
    - Policy and/or Valur Function 
    - <font color='grey'> No Model </font>
- Model Based 
    - Policy and/or Value Function 
    - Model 


### Outline 
- <font color='grey'>About Reinforcement Learning.</font>
- <font color='grey'>The Reinforcement Learning Problem. </font>
- <font color='grey'>Inside An RL agent.</font>
- Problems with Reinforcement Learning.

### Problems within Reinforcement Learning
Two fundamental problems in sequential decision making

- Reinforcement Learning:
    - The environment is initially unknown.
    - The agent interacts with the environment.
    - The agent improves its policy.
- Planning: 
    - A model of the environment is known.
    - The agent performs computations with its model (without any external interaction)
    - The agent improves its policy

### Exploration vs Exploitation (1)
- Reinforcement Learning is like trial and error learning
- The agent should discover a good policy.
- From its experiences of the environment.
- Without loosing too much reward along the way.

> Exploration finds more information about the environment.

> Exploitation exploits known information to maximize reward.


- It is usually imporatant to explore as well as exploit.


### Prediction and Control

- Prediction: evaluate the future
    - Given a policy.
- Control: optimize the future.
    - Find the best policy.