Key Concepts: 
- GPI (Generalized Policy Iteration)
    - Iteratively approximating and improving to the optimal value function
    - Feedback between evaluation and improvement
    - Evaluate with a Bellman equation lookup.
    - Coinflip Example - Remember the policy where it learns to bet it all at 25 and 50? 
- In-Place Dynamic Programming
- Asynchronous Dynamic Programming. 
    - Update the state value of a subset of states. Problems can be unsolveable otherways with too large of search spaces. 
- Concrete Resource Allocation Problem
- Policy Iteration Vs. Value Iteration
    - Value iteration just takes one loop before policy update. 

Chapter 4 Dynamic Programming

- Refers to algoriths that can be used to compute optimal policies given a perfect model of the environment. 
    - Assumed a finite MDP (Markov Decision Process).

4.1 Policy Evaluation

- Below is a formalization of DP iterative policy evaluation. Notice how the next value is being updated by the next state under the current value. You can use this to update the value of a state under policy changes. 

![](images/DP-Iterative-Update.png)


- Iterative Policy Evaluation converges in the limit because eventually the random starting valuation starts to converge to zero after you loop through all of the actions, and update the values over and over again.

![](images/Iterative-Policy-Evalutation.png)

Chapter 4.4 Value Iteration

![](images/Value-Iteration.png)


- Update that value of a state based by setting the value to the expected value of your best action

![](images/Value-Iteration-Algorithm.png)

4.5 Asynchronoous Dynamic Programming 
- These dynamic programming solution to state and action evaluation and policy updates are very computationally expensive, as you are continually sweeping through the whole state and actions space. It is possible however, to include methods for truncate some of this process so that you do not have to sweep through all states and actions to make an evaluation. We can also do things like skipping states we know are not relevant for optimal behavior or order the sweep through states to propagate value updates more efficiently. 
- With asynchronous dynamic programming, we can run an iterative DP algorithm while at the same time the agent is experiencing the environment.

4.6 Generalized Policy Iteration
- Policy iteration is made up of two main processes
    - policy evaluation
    - policy improvement
- The term for the policy evaluation and improvement interaction is the generalized policy iteration (GPI)
    - Almost all RL problems are described as GPI.
- The value function stabilizes only when it
is consistent with the current policy, and the policy stabilizes
only when it is greedy with respect to the current value function.
- The evaluation and improvement processes in GPI can be viewed as both competing
and cooperating. They compete in the sense that they pull in opposing directions. Making
the policy greedy with respect to the value function typically makes the value function
incorrect for the changed policy, and making the value function consistent with the policy
typically causes that policy no longer to be greedy.
- The arrows in this diagram correspond to the behavior of policy iteration in that each
takes the system all the way to achieving one of the two goals completely. In GPI
one could also take smaller, incomplete steps toward each goal. In either case, the two
processes together achieve the overall goal of optimality even though neither is attempting
to achieve it directly

![](images/Evaluation-Improvemnt.png)

4.7 Efficiency of Dynamic Programming
- Worst case it take polynomial time to solve a dynamic programming problem. 
- If n and k denote the number of states
and actions, this means that a DP method takes a number of computational operations
that is less than some polynomial function of n and k. A DP method is guaranteed to
find an optimal policy in polynomial time even though the total number of (deterministic)
policies is kn. In this sense, DP is exponentially faster than any direct search in policy
space could be. 
- For the largest problems, only DP methods are feasible.
- On problems with large state spaces, asynchronous DP methods are often preferred. To
complete even one sweep of a synchronous method requires computation and memory for
every state. For some problems, even this much memory and computation is impractical,
yet the problem is still potentially solvable because relatively few states occur along
optimal solution trajectories. Asynchronous methods and other variations of GPI can be
applied in such cases and may find good or optimal policies much faster than synchronous
methods can.


4.8 Summary
- Policy improvement refers to the computation of an improved policy given the value function for that policy. Putting these two computations together, we obtain policy iteration and value iteration, the two most popular DP methods. Either of these can be used to reliably compute optimal policies and value functions for finite MDPs given complete knowledge of the MDP.
- An intuitive view of the operation of DP updates is given by their backup diagrams.
![](images/0_6UMWl8MxHQ071yxF.png)

- One last special property of DP methods. All of them update estimates
of the values of states based on estimates of the values of successor states. That is, they
update estimates on the basis of other estimates. We call this general idea bootstrapping.