# Motivation
* Actor Critic methods are at the intersection of __value based methods__(DQN) and **policy based methods**(REINFORCE)
* If an agent uses the neural network to approximate value function then it is called value based methods.
* If an agent uses the neural network to approximate policy of the environment then it is called policy based method.
* The dqn agent we learned about uses neural network to approximate **optimal action value** function. 
* Value Based methods can be used to find following value function
    * $V_{\pi}(s)$ -- __state-value__
    * $Q_\pi(s,a)$ -- __action-value__
    * $A_\pi(s,a)$ -- __advantage function__
    * optimal version is denoted by $V_*$ and so on.
* __Stochastic Policy__: takes in the state and gives probability distribution of all possible action.
* __Determinstic Policy__: takes in the state and outputs the single action for any given state.
* __Problem__ with __policy based methods__ are **high variance**.
### Main idea of ActorCritic Method
* Use value based method to reduce variance of policy based method.

# Bias and Variance(Important)
* There is always a trade off between **bias and variance**.
<img src = "images/a16.png">
* __1 Quadrant__ : HIGH BIAS, HIGH VARIANCE
* __2 Quadrant__ : LOW BIAS, HIGH VARIANCE
* __3 Quadrant__ : LOW BIAS, LOW VARIANCE -- _desired_
* __4 Quadrant__ : HIGH BIAS, LOW VARIANCE



* An agent tries to estimate value function or policies from returns;  a return(cumulative reward over trakectories) is calculated using a single trajectories, **however value functions which is what we're trying to estimate are calculate using the EXPECTATION.
* BIG PART OF REINFORCEMENT LEARNING RESEARCH IS TO REDUCE THE VARIANCE WHILE CALCULATING __RETURN__.

* Simple Return over a trajectory $G_t = R_{t+1}+R_{t+2}+R_{t+3}+ .. + R_T$
* __State value__: $V_\pi(s) = E_\pi[G_t|S_t=s]$
* __Action value__: $Q_\pi(s,a) = E_\pi[G_t|S_t = s,A_t=a]$
* __Advantage value__: $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$


* __Goal__ : To reduce the variance keeping the bias minimum.

* __Reinfocement Learning agent tries to find policies(optimal) to maximize the total expected reward but we're limited to sampling the environment we can only estimate this expectation the question is what's the best way to estimate value function for our actor critic methods.__

## Two Ways For Estimating Expected Returns
* Mote-Carlo estimate -- **not biased but high variance**
* Temporal Difference -- **BIASED(because we use value of $Q_{t+1}$ to estimate $Q_t$) but low variance** -- **Agent will learn faster but will have problem while converging**

## Baselines and Critics(Important)
* **Monte-Carlo estimate** is __unbiased__ but __high variance__.
* **TD-Estimate** is **biased** but **low variance**.
* Using function approximator gives us an advantage, now we gain the power of generalization, that means when when we encounter a new state, whether we have visited that state or not but our deep neural network will still output the value(estimate) because it has been trained with similar kind of data.
* Monte-Carlo methods have high variance but no bias.
* TD-estimate has low variance and low bias.
* Now word **Critic** implies that the bias has been introduced and Monte-Carlo method has no bias.
* So instead of using Monte-Carlo estimate to train **baseline** if we use **TD-estimate**, then we can say we have a critic.
* Sure we will be introducing the bias but **we will be reducing the variance of the model** and improving our convergence.Speeding up learning.
* In Actor-Critic Method all we are trying to do is to reduce high variance, commonly associated with Policy Based Method.
* **USING TD Critic instead of Monte-carlo critic will reduce the variance and thus convergence problem**


### Takeaway
* The important takeaway for you, though, is that there are inconsistencies out there. You often see methods named "Actor-Critic" when they are not. I just want to bring the issue to your attention.

## Policy-based, Value-based and Actor Critic
### Actor (Policy Based Approach)
* We play a bunch of matches we then go home learns this way, we think about the matches and commit to ourselves to do more what I did in matches in which I won and less of what I did in matches I lost after many many times repeating this process we will have increase the probability of actions that led to a win and decrease the probability of actions that led to losses.
* But we can see how this approach is rather inefficient as it needs lots of data to learn a useful policy see many of actions that occurs within a game that ended up in a loss could have **been really good actions** so decreasing the probability of good action state taking in a match only because we lost is not the best idea, sure if we repeat this process infinitely often we're likely to end up with a good policy __but at a cost of slow learning.__
* Policy Based agent have **high variance**.
### Critic (Value Based Approach)
* We start playing a match and even before we get started we start **guessing** what the final score is going to be like, we continue to make guesses throughout the match, at first guesses will be off but as we get more and more experienced we will be able to make pretty solid guesses the better our guesses the better we'll tell good from bad situtations or good from bad actions the better we can make this distinction the better we'll perform of course given that we choose good actions though this not a perfect approach either guesses itroduce a bias because they'll sometimes be wrong particularly because of a lack of experience guesses are prone to under or over estimation, but guesses are **more consistent** through time.
* If we think we are going to win a match five minutes into it, chances are you still think so **ten minutes** into it.
* This what makes **TD-estimates** to have lower **variance**.

***
* __Policy methods__ -- Agents learn to **Act**.
* __Value based methods__ -- Agent is learning to **Estimate stituations and actions**

## A Basic Actor-Critic Agent
* Actor Critic = Policy Methods + Value Based Method
* An actor critic agent is an agent that uses function approximator to learn a policy and a value function so we will then use two neural networks one for the actor and one for the critic, the critic will learn to evaluate the state value function $v_\pi$ using the TD estimate using the critic we will calculate the **advantage function** and train the actor using the value, a very basic online actor critic agent is as follows we have two network
## Actor:
* Takes in a state and **output** the distribution of actions $\pi(a|s;\theta_{\pi})$
## Critic:
* Takes in a state and __outputs state value function__ $V(s;\theta_v)$
## Algorithm
* Inputs the `current state` into the **Actor** and get the __action__ to take in that **State**, observe `next state` and `reward` to get our experience tuples $(s,a,r,s^{'})$.
* And then using the **TD-estimate** which is reward $R$ plus the **Critic** estimate for s prime($s^{'}$) so $r + \gamma V(s^{'};\theta_{v})$(label) we **TRAIN** critic.
* To calculate the **advantage** $A(s,a) = r + \gamma V(s^{'};\theta_v) - V(s;\theta_v)$ we use **critic**
* And finally we train the **Actor** Using **advantage** as a baseline.

<img src = "images/a17.png">
### Notes
One important thing to note here is that we use $V(s;\theta_v)$ or $A(s,a)$, but sometimes $V_\pi(s'\theta_v)$ or $A_\pi(s,a)$
<br>There are 2 things actually going on in there.
1. A very common thing we'll see in reinfocement learning is the oversimplification of notation. However, both syles, whether we see $A(s,a)$ or $A_\pi(s,a)$(value functions with or without a $\pi$) it means we are evaluating a value function of policy $\pi$. In case of $A$, the advantage function. A different case would be when we see a subscript $*$. For example, $A^*(s,a)$ means the optimal advantage function.Q-learning learns the optimal action-value function, $Q^*(s,a)$ for example.
2. The other thing is the use of $\theta_v$ in some value functions and not in others. This only means that such value function is using a neural network. For example, $V(s;\theta_v)$ is using a neural network as function approximator, but $A(s,a)$ is not. We are calculating the advantage function $A(s,a)$ using the state-value function $V(s;\theta_v)$, but $A(s,a)$ is not using the function approximator directly.

# A3C: Asynchronous Advantage Actor-Critic, N-step Bootstrapping
* We will be calculating the advantage function $A(s,a) = r + \gamma V(s^{'};\theta_v) - V(s;\theta_v)$ and __critic__ will be learning to estimate (state value)$V_\pi(s;\theta_v)$.
* If we are using images as inputs to our agent **A3C** can use a single CNN with actor and critic sharing weights in two seperate heads one for the actor and one for the critic, note that **A3C** is not to be used **Exclusively** for CNN and images but if were to ise it sharing weights is a more efficient more complex approach and can be harder to train.
* It's a good idea to start with seperate set of weights(two networks) at beginning and change only to improve performance.
##### Important
* Instead of using **TD-estimate** in **A3C** we use **N-step Bootstrapping**
    * **N-step Bootstrapping** is simply a abstraction and a generalization of the **TD** and **Mote Carlo estimate** 
* **TD** is one step bootstrapping our agent goes out and experiences **one time of real reward** and then bootstraps right there.(low variance, but biased)
* **Monte Carlo** goes out all the way(to terminal state) and it does not bootstrap.(high variance, but no bias)Beacause it doesn't need to, Monte Carlo estimate is an infinite step bootstrapping but how about going more than **one** step but not all the way out.
* Can we do two time steps of real reward and then bootstrap from the second next state can we do three, how about four or more we sure can this what is call **N-step Bootstrapping**.
<img src = "images/a18.png">
* **A3C** uses this type of return to **train** the __critic__.
* For example in our tennis example and N-step bootstrapping means we will wait for sometime before guessing what the final score will look like waiting for experience the environment for a little longer before we make any prediction or we calculate the expected return of the original state allows us to have **less bias** in our prediction keeping __variance__ under control.
* In practice only a few steps out say __four or five steps__ bootstrapping are often the best.

## A3C : Asynchronous Advantage Actor - Critic, Parallel Training 
* Unlike DQN **A3C** does not use a replay buffer the main reason we needed Replay buffer was so that we break the **correlation between sequential states**.
* And we replay buffer we can randomly select experience tuple from replay buffer and break the correlation beween sequential experience tuples and randomly select the experience tuples and put them in minibatch and also can use the old experience.
<img src = "images/a19.png">

### A3C replacement for Replay Buffer
* __A3C__ replaces replay buffer with parallel training by creating mutliple instances of the the environment and agent and running them all at the same time our agent will receive mini batches of the decorrelated experiences just as we need samples de correlated because agents will likely be experienced in different state at any given time.
* This type of training allows us to use __on policy__ learning in our learning algorithm which often associated with more __stable__ learning. 
<img src = "images/a20.png">

# A3C: Asynchronous Advantage Actor-Critic, Off-policy vs. On-policy
* __On-policy__: Policy used for interacting with the environment is also the policy being __learned__(or being optimizied to get optimal policy).
* __Off-policy__: Policy used for interacting with the environment is __different__ than the policy being learned.
* **Sarsa** is a good example of **on-policy** and **Sarsamax**(Q-Learning) is a good example of **off-policy**
    * A q-learning agent learns about the agent learns about the optimal policy though the policy **generates behaviour is an exploratory policy** often **epsilon-greedy policy** 
* **Q-learning** learns a **optimal deterministic policy** even if its behaviour policy is totally **stochastic**(random).
* **Sarsa** learns the best **Exploratory** policy that is the best policy that still explores.
### Update Equations:
#### Sarsa
$$Q(S,A) \gets Q(S,A) + \alpha [R + \gamma Q(S^{'},A^{'}) - Q(S,A)]$$
#### Q-learning
$$Q(S,A) \gets Q(S,A) + \alpha [R + \gamma max_a Q(S^{'},a) - Q(S,A)]$$

* DQN is also a off-policy learning method our agent behaves with some exploratory policy say Epsilon greedy but it learns the optimal policy when using off-policy learning agents are able to __learn from many different sources__ including experiences generated by all versions of the agent itself thus the replay buffer.
* However off policy learning is know to be **unstable** and often **Diverge** with deep neural networks.
* **A3C** on the other hand is a __on-policy__ method with on policy learning we only use the data generated by the policy currently being learned about and anytime we improve the policy we __toss out__ all data and go out collect some more.
* __On-policy__ learning is a bit inefficient in the use of experiences but it often has more stable and consistent convergence properties.
* A simple analogy of on and off policy goes this way:
    * __On-policy__ is learning from your own hands-on experience for example project in this nanodegree as we can imagine that is a pretty good way of learning but it is somewhat data inefficient we can only do so many projects before we run out time.
    * __Off-Policy__ on other hand is learning from someone's experience and as such it more sample efficient because well we can learn from many different sources however this way of learning is more prone to misunderstandings. Other people might not able to explain things in a way we understand well.
* The nanodegree analogy in the **off-policy** case is learning from watching the lesson for example we learn much faster this way but again perhaps not as good and deep we would from self-study or own hands-on experience.
* Usually a good balance between **off-policy** and **on-policy** is desired.




If we want to learn about the agent that combines on and off policy  learning then read the paper by **Google** title [Q-prop](https://arxiv.org/abs/1611.02247) 
<br>**Q-prop sample efficient policy gradient** with an **off-policy** Critic.

# A2C: Advantage Actor-Critic
* What is asynchronous part in __A3C__ is all about.
* __A3C__ accumulates gradient update and applies those updates **asynchronously** to a global neural network.
* Each agent in simulation does this at its own time so the agent use a local copy of the network to collect experience calculate and accumulate gradient across multiple time steps and they apply these gradients to a gloabl network asynchronoulsy.
* Asynchronous here mean that agent will update the network on its own there is no synchornization between the agents this also means that the weights the agent is using might be different from the weight in use by another agent at any given time.



There is a **synchronous** implementation of the **A3C** known as **A2C**. It has some extra bit of code that synchronizes all agents it wait for all agents to finish a segment of iteraction with its copy of the environment and then updates the network at once before sending the updated weights back to all agent.
<br>__A2C__ arguably simpler to implement and its gives pretty much the same result and allegedly in some cases performes even better.
<br>__A3C__ most easily trained on a CPU while A2C is more straightforward to extend on GPU implementation. 

# GAE: Generalized Advantage Estimation
* There is another way for estimating expected returns call the **lambda return** the intution goes this way
* Say after we try and step bootsrapping we realize that numbers of n larger than one often perform better but it sill hard to tell what that number should be, should it be two or three or something else.
* To make decision even more difficult in some problem small number of n are better while in some cases large number of n are better.
* How do we get this right, the idea of the lambda return is to create a mixture of all n step bootstrapping estimates at once.
* **lambda** is a hyper parameter used for __weighing__ the combination of each n-step estimate to the lambda return.
* Say we set lambda to $0.5$ the contribution to the lambda return would be a combination of all n step returns weighted by the exponentially decaying factor across the different n-step estimates.
* Notice how the weight depends on the value of lambda we set and it decays exponentially at the rate of that value so for calculating the lambda return for state $s$ at timestep $T$ we would use all n step returns and multiply each of the N step return by the current corresponding weight.
* Then add all of them, this sum will be the **lambda return for STATE S at time step T**.
* Intrestingly when lambda is set to zero the the two step, three step and all n step return other than one step will be zero.
* So lambda returned when lambda is set to zero will be equal to the **TD-estimate** and our lambda is set to __one__ then our **lambda return** other than the infinite step return will be equal to __zero__ so it is equivalent to **Monte Carlo estimate** 
<img src = "images/a21.png">


#### GAE
* Generalized advantage estimation is a way to train the critic with this __lambda return__.
* We can fit the advantage function just like in a **A3C** and **A2C** but using a mixture of n-step bootstrapping estimates.
* It;s important to highlight that this type of return can be combined with virtually any policy based method and in fact in the paper that introduced the **GAE**. **TRPO** was the policy based method used.
* By using this type of estimation this algorithm **TRPO** plus __GAE__ trains very quickly because multiple value function are spread around every time due to the lambda return style estimate 
<img src = "images/a22.png">

# DDPG: Deep Deterministic Policy Gradient, Continuous Action-Space
* DDPG is a different type of Actor-Critic Method infact it could be seen as a kind of approximate DQN instead of actual Actor-Critic.
* The reason for this is that the critic in DDPG is used to approximate the __maximizer over action value function__$max_aQ(s^{'},a)$ of next state and not as a learned baseline.
* Though this is a very important algorithm.
* One of the limitation of DQN agent is that it only work for **Discrete action space**.
    * Because its label for training is $r + \gamma max_a Q[s^{'},a]$ and its hard to find max value in continuous action space.
### DDPG
* In **DDPG** we use two network one actor and other critic.
## Actor
* Now actor here is used to approximate **optimal policy deterministically**.
* We want to always output best belief action for any given state this is **unlike stochastic policy** in which we want to learn the policy to output the **probability distribution** of action.
* In **DDPG** we want to output the best belief action every time we query the action from the network.(That is a **deterministic policy**)
* The actor is basically learning to output $argmax_{a}Q(s,a)$ which is the best action.
## Critic
* Learns to evaluate the optimal action value function by using actor best belief action
* $Q(s,\mu(s;\theta_\mu);\theta_Q)$



In the [DDPG paper](https://arxiv.org/abs/1509.02971), they introduced this algorithm as an "Actor-Critic" method. Though, some researchers think DDPG is best classified as a DQN method for continuous action spaces (along with [NAF](https://arxiv.org/abs/1603.00748)). Regardless, DDPG is a very successful method and it's good for you to gain some intuition.

## DDPG Pipeline
* We first input current state into actor and get an action.
* then we take that action and get next state and reward.
* with `torch.no_grad()` we pass in the next_state throught the critic network get the $critc(nextstate,\theta_C)$
* And target for critic is $reward + critc(nextstate,\theta_C)$ with mse loss.
* And target for actor is $Critic(currenState,\theta_C)$ with policy gradient.
