# RL Foundations

- This notebook will serve as an introductory guide to RL, focusing on motivation and intuition

## What is RL?

- RL is somewhat similar to supervised learning

- Recall that in supervised learning, we pair an outcome with data, and try to get the model weights adjusted such that, given an input data, we will replicate the outcome

- For an RL learner, we have the same idea; some inputs are provided, and some reward outcomes are logged.
    - The difference is that in RL, in between seeing the inputs, and the rewards, the agent needs to take an action $A$
    - With different set of inputs (**state**), we can take different actions, which can lead to different rewards
    - So the qn for the learner is; given a state, which action should we take that will maximise our reward?

- Since there is an action required by the agent, this choice of action can only be learnt by the agent through trial and error
    
- While there are many sophisticated approaches to guide how an agent to make decisions, it is always possible to turn an RL problem into a supervised learning problem. For example, if a we have a dataset with: 
    - State: the input context seen by the agent
    - Action: The action taken by the agent
    - Reward: The reward received by the agent for taking action $A$ at state $S$

- Then a simple approach to RL might be to say; train a model that takes in state and action, and predicts reward $R$
    - In this case, the best action is simply the one that produces the $\argmax(R)$

## Formalising the Problem

- For every step $t$
    1. The agent observe the current state $s_t$
    2. The agent chooses and action $a_t$. **How** the agent chooses the action is based on its **policy**
    3. Based on $s_t, a_t$, the environment will make a transition to some state $s_{t+1}$ with some reward $r_{t+1}$
    4. Based on the knowledge of $s_t, a_t, r_{t+1}, s_{t+1}$, the agent tries to improve its decision making policy!

- Don't be confused by jargon. Can think of a **policy** as a deterministic map of how the agent makes decisions.
    - Simple example; let's say $s_t$ tells the agent that (i) the forecast is that is will rain soon, (ii) there are water droplets on the window
    - Then the policy can be how the agent chooses within the set of `[bring umbrella, don't bring umbrella]`

## Explore/Exploit

- Let's zoom in on an agent's **policy**

- Because the agent is making decisions under uncertainty, every decision is a tradeoff between exploration and exploitation
    - Should I make use of the information I know now to choose the best action?
    - Or should I explore other actions to get more information?
    - For example, suppose you try a new food place. You've tried one dish, that you quite liked. Do you stick to that dish (exploitation), or do you try something new off the menu (exploration)?

- We'll go through each of these in a later notebook, but here are some common ideas about how we want to structure our policy:
    1. $\epsilon$-Greedy: With some probability $\epsilon$, choose a random action. Otherwise, choose the action that maximises your reward
    2. Softmax/Boltzmann: Pick actions based on estimated value. That is, take a softmax of all rewards, and sample probabilistically. The softmax here is a bit different, because you reweight the softmax-ed probabilities by some constant $\tau$, which we call the temperature
    
    \begin{aligned}
        P(a_t | s_t) &= \frac{e^{r_a^{t+1} / \tau}}{\sum_{a'=1}^{N} e^{r_{a'}^{t+1} / \tau}}
    \end{aligned}

    3. UCB Sampling: The idea of UCB sampling is that you greedily sample action using the next available reward $r_{t+1}$, PLUS an additional factor that increases if the proportion of action $i$ attempted out of all actions attempted exceeds $y$%

    \begin{aligned}
        UCB(a_t) &= r_{a_t}^{t+1} + c \cdot \sqrt{\frac{\ln(N)}{N_{a_t}}} \\

        \text{where} \\
        &N \text{: Total number of trials} \\
        &N_a \text{: Count of action a taken} \\
        &c \text{: Constant controlling the aggressiveness of our exploration} \\
    \end{aligned}

    4. Thompson Sampling: 
        - Assume that the reward for each action $a_t$ follows some distribution $D$ governed by some parameters $\alpha_{a_t}, \beta_{a_t}, ...$. 
        - Using these parameters, "draw" a reward for each action by taking a sample from the distribution $D(\alpha_{a_t}, \beta_{a_t}, ...)$
        - Using this reward, decide which action to take greedily
        - Then using the outcome of the action, update your beliefs about the distribution's hyperparameters

        - This is still not very concrete; we will go through this in more detail lateer; let's just treat this as intuition building first

## Episodic Tasks vs Continuing Tasks

- Another quite useful distinction is the idea of episodic vs continuing tasks
    - Episodic tasks end at some point (e.g. a chess game always ends)
    - Continuing tasks extend to infinity

- The implication here is that for episodic tasks, we only aggregate rewards up to a certain terminal timesteps, while continuing tasks will require some sort of discounted return

\begin{aligned}
    G_t &= \sum_{k=t}^{T} R_{k+1} & \text{ for episodic} \\ \\

    G_t &= \sum_{k=0}^{\inf} \gamma^k R_{t+k+1} & \text{ for continuous} \\ \\
\end{aligned}