## Rich Sutton and Andy Barto: A brief History of RL

Andrew and I woke Reinforcement Learning up because it would add fallings on collect, and we clarified what it was and how it's different from supervised learning. This is sort of what I was characterizing as the origin story of reinforcement learning. It was an obvious idea, Marvin Minsky knew it in 1960 or '59. It's so obvious that everyone knew it, but then it became overshadowed by supervised learning until, Eric Loft started knocking on people's doors and saying, "Hey, this is something that's been neglected." And as real and as important and tasked us to figure it out. In fact, the very first neural network simulation on a digital computer by farmland Clark in 1954 was a reinforcement learning system. 

I think at the time, search for something that works and then you will remember, combining search and memory is the essence of reinforcing, then strangely had been wrong. Donald Mickey, talked about mentalization which is one, RL is a memorized search. You do some operation, and then you remember the results, and the next time you have to do it. You look it up instead of recomputing, and it saves a lot of time and so on. In a sense, our RL at its root is memorized context-sensitive search. Pople Stone at the end of one of his paper on the 50ths talks about interpolator, like using polynomials instead of Lookup table to look up something with generalization. That's what neural networks do, for example. We didn't invent memorization, but through a new use for it. I don't know if people were doing memorized search the way reinforcement learner's. Here he had this idea of a distributed approach. He also had the idea of what you call a generalized reinforcement, that one of these units could be reinforced by all kinds of signals and not just our binary signal. Without making goal-seeking systems out of ballsy proponents, and without having a generalized reinforcement, would have just have a specialized reward signal. 

> I think that's what reinforcement learning is... It is just focusing on a learning system that actually wants something that does trial and error and remembers it, and has to specialized reward signal.

## Bellman Equation Derivation
In everyday life, we learn a lot without getting explicit, positive, or negative feedback. Imagine for example, you are riding a bike and hit a rock that sends you off balance. Let's say you recover so you don't suffer any injury. You might learn to avoid rocks in the future and perhaps react more quickly if you do hit one. We recognize that the state of losing our balance is bad even without falling and hurting ourselves. In reinforce and learning a similar idea allows us to relate the value of the current state to the value of future states without waiting to observe all the future rewards. We use **Bellman equations** to formalize this connection between the value of a state and its possible successors. 

### Bellman equation for the state-value function

First, let's talk about the Bellman equation for the **state-value function**. The Bellman equation for the state value function defines a relationship between the value of a state and the value of his possible successor states. To derive this relationship from the definitions of the state-value function and return, let's start by recalling that the state-value function is defined as the expected return starting from the state $s$. Recall that the return is defined as the discounted sum of future rewards. 

$$
v_{\pi}(s) \doteq \mathbb{E}_{\pi} [\color{blue}{G_t} | S_t = s] \hspace{100px} \color{blue}{G_t = \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1}}
$$

We saw previously that the return at time $t$, can be written recursively as the immediate reward plus the discounted return at time $t+1$. 

$$
v_{\pi}(s) = \mathbb{E}_{\pi} [\color{blue}{R_{t+1} + \gamma G_{t+1}} | S_t = s]
$$

Now, let's expand this expected return. First, we expand the expected return as a sum over possible action choices made by the agent. Second, we expand over possible rewards and next states condition on state $s$ and action $a$. We can break it down in this order because the action choice depends only on the current state, while the next state and reward depend only on the current state and action. The result is a weighted sum of terms consisting of immediate reward plus expected future returns from the next state $s'$. 

$$
v_{\pi}(s) = \color{green}{\sum\limits_a \pi(a|s) \sum\limits_{s'}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \color{blue}{\mathbb{E}_{\pi} [G_{t+1} | S_{t+1} = s']}\right ]
$$

All we have done is explicitly write the expectation as it's defined, as a sum of possible outcomes weighted by the probability that they occur. Note that capital $R_{t+1}$ is a random variable, while the little $r$ represents each possible reward outcome. The expected return depends on states and rewards infinitely far into the future. We could recursively expand this equation as many times as we want, but it would only make the expression more complicated.

$$
v_{\pi}(s) = \color{green}{\sum\limits_a \pi(a|s) \sum\limits_{s'}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \color{blue}{\sum\limits_{a'} \pi(a'|s') \sum\limits_{s''}\sum\limits_{r'} p(s'',r' | s', a') } \left [r' + \gamma \color{red}{\mathbb{E}_{\pi} [G_{t+2} | S_{t+2} = s'' }\right ] \right ]
$$

Instead, we can notice that this expected return is also the definition of the value function for state $s'$. The only difference is that the time index is $t+1$ instead of $t$. This is not an issue because neither the policy nor $p$ depends on time. Making this replacement, we get the **Bellman equation for the state-value function**. 

$$
v_{\pi}(s) = \color{green}{\sum\limits_a \pi(a|s) \sum\limits_{s'}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \color{blue}{v_{\pi(s')}}\right ]
$$

The magic of value functions is that we can use them as a stand-in for the average of an infinite number of possible futures. 

### Bellman equation for the action value function

We can derive a similar equation for the action-value function. Recall the original equation for action-value function.

$$
q_{\pi}(s, a) \doteq \mathbb{E}_{\pi} [G_{t} | S_{t} = s, A_{t} = a]
$$

We create a recursive equation for the value of a state action pair in terms of its possible successors state action pairs. In this case, the equation does not begin with the policy selecting an action. This is because the action is already fixed as part of the state action pair. Instead, we skip directly to the dynamics function $p$ to select the immediate reward and next state $s'$. Again, we have a weighted sum over terms consisting of immediate reward plus expected future return given a specific next state $s'$. 

$$
q_{\pi}(s, a) = \color{green}{\sum\limits_{s}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \color{blue}{\mathbb{E}_{\pi} [G_{t+1} | S_{t+1} = s']}\right ]
$$

However, unlike the Bellman equation for the state-value function, we can't stop here. We want to recursive equation for the value of one state action pair in terms of the next state action pair. At the moment, we have the expected return given only the next state. To change this, we can express the expected return from the next state as a sum of the agents possible action choices. In particular, we can change the expectation to be conditioned on both the next state and the next action and then sum over all possible actions. Each term is weighted by the probability under $\pi$ of selecting $a'$ in the state $s'$. 

$$
q_{\pi}(s, a) = \color{green}{\sum\limits_{s}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \sum\limits_{a'} \pi(a', s') \color{blue}{\mathbb{E}_{\pi} [G_{t+1} | S_{t+1} = s', A_{t+1} = a']}\right ]
$$

Now, this expected return is the same as the definition of the action-value function for $s'$ and $a'$. Making this replacement, we get the Bellman equation for the action value function. 

$$
q_{\pi}(s, a) = \color{green}{\sum\limits_{s}\sum\limits_{r} p(s',r | s, a)} \left [r + \gamma \sum\limits_{a'} \pi(a', s') \color{blue}{q_{\pi}(s', a')}\right ]
$$