# Lecture 1: Introduction to Reinforcement Learning

### How is reinforcement learning different?

- There's no supervisor, only a *reward* signal
    - There's no immediate feedback on whether the decision was good or bad
- Feedback is delayed, not instantaneous
    - The feedback is obtained many steps later to know whether the set of decisions were good or bad
- Time really matters (sequential, non i.i.d data)
    - An agent moves through a dynamic system where things observed at a time instant are correlated with the things observed in the immediate next time instant
- Agent's actions affect the subsequent data it receives

### Areas of application

- Robotics: fly stunt manuevers in a helicopter (reward- good/bad)
- Video Games: Backgammon (reward- score)
- Investment and trading: manage an investment portfolio (reward- money)
- Machine control: control a power station, wind-mill (reward- efficient energy generation)
- Humanoid robots: make a humanoid robot walk (reward- distance walked/falling over)
- Artificial General Intelligence (AGI) or general AI: play many Atari games better than humans (reward- score)

### Rewards

- A reward $R_{t}$ is a scalar feedback signal
- Indicates how well agent is doing at step $t$
- The agent's job is to maximize the cumulative reward

> **Reward Hypothesis:** All goals can be described by the maximization of expected cuulative reward.


### Sequential Decision Making

- Goal: select actions to maximize total future reward
- Actions may have long term consequences
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long-term reward
- Examples:
    - A financial investment (may take months to mature)
    - Refuelling a helicopter (might prevent a crash in several hours)
    - Blocking opponent moves

### Agent and Environment

<img src="Figures/01-agent-and-environment.png" style="width: 550px;"/>

### History and State

- *History* ($H_{t}$) is the sequence of observations, actions, rewards.
- *State* ($S_{t}$) is the information used to determine what happens next.
$$S_{t} = f(H_{t})$$  

**Environment state $S_{t}^{e}$:** what happens next from the environment's POV based on the history. It is not usually visible to the agent as it doesn't know everything that is present in the environment. Even if $S_{t}^{e}$ is visible, it may contain irrelevant information.  

**Agent state $S_{t}^{a}$:** agent's internal representation. These are the set of numbers are used by the reinforcement learning algorithm. These numbers capture what has happened with the agent so far and summarize everything the agent has seen so far. This information is used by the agent to pick the next action.  

**Information state (Markov state):** A state $S_{t}$ is Markov iff
$$\mathbb{P}[S_{t+1} | S_{t}] = \mathbb{P}[S_{t+1} | S_{1},...,S_{t}]$$
The future is independent of the past given the present. Once the state is known, the history may be thrown away.  
- The environment state $S_{t}^{e}$ is Markov.
- The history $H_{t}$ is Markov.

### Fully Observable Environments

Agent directly observes environment state
$$O_{t} = S_{t}^{a} = S_{t}^{e}$$
- Agent state = environment state = information state
- This is a Markov Decision Process (MDP)

### Partially Observable Environments

- Agent indirectly observes environment
    - A robot with camera vision isn't told its absolute location
    - A trading agent only observes current prices
    - A poker playing agent only observes public cards
- Agent state $\neq$ environment state
- POMDP
- Agent must construct its own state representation $S_{t}^{a}$, e.g.
    - Complete history: $S_{t} = H_{t}$
    - Beliefs of $S_{t}^{e}$: $S_{t}^{a} = (\mathbb{P}[S_{t}^{e} = s^{1}],...,\mathbb{P}[S_{t}^{e} = s^{n}])$
    - RNN: $S_{t}^{a} = \sigma(S_{t-1}^{a}W_{s} + O_{t}W_{o})$

### Major Components of an RL Agent

- An RL agent may include one or more of these components:
    - Policy: agent's behaviour function. It tells us about how the agent picks its actions.
    - Value function: how good is each state and/or action. How much reward can we expect to get if we take an action in that particular state.
    - Model: agent's representation of the environment. It is about how the agent thinks that the environment works.  

<img src="Figures/01-maze-example.png" style="width: 550px;"/>

#### Policy
- Map from state to action, e.g.
- Deterministic policy: $a = \pi(s)$
- Stochastic policy: $\pi(a | s) = \mathbb{P}[A = a | S = s]$  

<img src="Figures/01-maze-policy.png" style="width: 550px;"/>

#### Value function
- Prediction of future reward (i.e. expected future total reward)
- Used to evaluate the goodness/badness of states
$$v_{\pi}(s) = \mathbb{E_{\pi}}[R_{t} + \gamma R_{t+1} + \gamma ^{2}R_{t+2} + ... | S_{t} = s ]$$
$\gamma$: discounting factor (indicates that we care more about immediate rewards)  

<img src="Figures/01-maze-value-function.png" style="width: 550px;"/>

#### Model
- A model predicts what the environment will do next
- Transitions: $P$ predicts the next state (i.e dynamics)
- Rewards: $R$ predicts the next (immediate) reward, e.g.
$$P_{ss'}^{a} = \mathbb{P}[S' = s' | S = s, A = a] $$
$$R_{s}^{a}  = \mathbb{E}[R | S = s, A = a]$$
- It is not mandatory to build a model for the environment. But it is something that can be used.  

<img src="Figures/01-maze-model.png" style="width: 550px;"/>


### Types of RL agents

- Value based
    - No policy (implicit)
    - Value function
    - The policy can be obtained from the value function
- Policy based
    - Policy
    - No value function
    - Maintains a data structure of the arrows
- Actor Critic
    - Policy
    - Value Function
- Model Free
    - Policy and/or Value Function
    - No Model
- Model Based
    - Policy and/or Value Fucntion
    - Model

### Learning and Planning

<img src="Figures/01-learning-planning.png" style="width: 550px;"/>

### Exploration vs Exploitation tradeoff

- *Exploration* finds more information about the environment
- *Exploitation* exploits known information to maximize reward
- e.g.: Restaurant Selection
    - Exploitation: Go to your favorite restaurant
    - Exploration: Try a new restaurant

### Prediction and Control

- Predicition: evaluate the future
    - Given a policy
- Control: optimize the future
    - Find the best policy