Nichita Uțiu, Romanian Institute of Science and Technology, Cluj-Napoca, Romania
This is a collection of notes on Sutton and Barto's book "Reinforcement Learning: An Introduction (2nd edition)" from 2018. These are some of the main takeaways from the book structured on a per-chapter basis with a main focus on the theoretical aspects. These can be used as a support for the book.
Note: The numbers in round brackets in the headings indicate the chapters of the book the notes correspond to. $$ \DeclareMathOperator*{\E}{\mathbb{E}} \DeclareMathOperator*{\argmax}{arg,max} \DeclareMathOperator*{\argmin}{arg,min} $$
A Markov decision process is the quadruplet (
Note: The continuous formulation (episodes never end) is often desired in proofs, but many tasks are episodic in nature. To unify them, we can consider the terminal state to be an absorbing state where all action loop back deterministically with 0 reward.
By definition, a policy is a conditional probability distribution of actions to be taken give a certain state. It is written as
q_\pi &\triangleq \E{}\pi[G_t \mid S_t = s, A_t = a] = \sum{s' \in S, r \in R} p(s', r| s,a)(r + \gamma \sum_{a' \in A} \pi(a' | s') q_\pi (s', a')) \
&= \sum_{s' \in S, r \in R} p(s', r| s,a)(r + \gamma v_\pi (s'))
\end{align}
$$
As we can see, both
Given these expected values for returns, we can define a partial ordering relation on the set of policies: $$ \pi \geq \pi' \leftrightarrow \forall s \in S( v_\pi(s) \geq v_{\pi'} (s)) $$ This ordering implies there is a set of policies which are optimal (the greatest elements of the set of policies) and for which we have the following optimal state and state-action value function: $$ v_(s) \triangleq \max_\pi v_\pi (s) \ q_(s,a) \triangleq \max_\pi q_\pi(s,a) $$
Note: The optimal value functions also have Bellman equations with the non-optimal version of the function exchanged for the optimal one ($q_$ instead of $q_\pi$ and $v_$ instead of
A lot of RL algorithms revolve around a framework called General Policy Iteration (GPI) which involves 2 steps:
-
Evaluation: Taking a policy
$\pi$ and estimating the value of$v_\pi$ . The obtained estimate is often called$V$ . -
Improvement: Given the estimate
$V$ , improve the policy$\pi$ .
If we were to write the Bellman equation for every state in
DP does the first step of GPI by sweeping through the set of all states and computing for each one the expected return using the Bellman equations and updating the approximate value function. This sweeping process is repeated until the maximum change over all the states is lower than a threshold
From this point of view, DP requires knowledge of the full model of the environment (ie. $p(s', r|s,a)$). This assumption is a strong one and is not met for many environments used in practice.
The policy improvement theorem states that, considering the partial ordering relation defined between policies, given 2 deterministic policies
This theorem has the consequence that the greedy policy constructed wrt. to another policy is at least as good as that policy. This gives rise to a process of policy improvement by taking the greedy policy wrt. an approximate value function.
The two steps of GPI described before (evaluation and improvement) are therefore the processes of evaluating/approximating the value function given a policy and then updating the policy to be greedy wrt. that value function. GPI can describe the vast majority of RL algorithms.
Note: DP methods are proven to converge to the optimal policy if all states are updated an infinity of times in the limit. As long as this condition is satisfied the states can be updated in any order. This means one can update them asynchronously and in parallel (see section 4.5 on page 85).
Until now we've only discussed policy improvement in the framework of DP. The main incarnation of this is the policy iteration algorithm which alternates between estimation and improvement (see page 80).
However, this process can be simplified by merging estimation and improvement into a single step defined by the following update rule (ie. value iteration): $$ v(s) \leftarrow \max _ a \sum _{s',r} p(s', r|s,a) \left[r + \gamma v(s') \right] $$ Devising an algorithm based on this rule is straightforward. Instead of doing 2 sweeps, we just repeatedly do this combined step (see page 83). Although slightly different, this algorithm still fits into the GPI family, the difference being that the alternation takes place at a per-state level.
One of the main limitations of DP algorithms is that they need full knowledge of the dynamics of the environment. MC methods relax this constraint and are part of the family of model-free methods. We do however consider that we only apply MC methods to episodic tasks in this chapter.
MC methods revolve around sampling episodes given a policy. The most basic evaluation algorithm (see page 92) revolves around sampling episodes (sequences of
Since the estimators for the value of each state are independent and are not a function of other estimates, MC methods do not bootstrap by definition.
Note: We can take into consideration just the first visit of the state or all visits of the states in a sampled episode when computing the average returns. These two methods are intuitively named first-visit and every-visit MC.
One can use MC for estimation in the GPI framework and come up with MC control (ie. using sampled episodes to estimate the true value function of the policy). However, unlike DP methods which consider all possible trajectories, MC methods sample according to a policy, meaning that they are biased and that they do not maintain exploration. This has the result that unless we guarantee exploration, MC methods are not guaranteed to converge asymptotically to the real value function.
- One method to solve this is to uniformly sample the starting states and start from them if the environment allows. This is called MC with exploring starts (see page 99).
- Another method is to only explore a subset of stochastic policies for which the probability of any action is non-zero. These are called
$\varepsilon$ -soft policies and we use a subset called$\varepsilon$ -greedy policies for improvement (see page 101 for more details).$\varepsilon$ -greedy policies are policies which are greedy with a probability of$1-\varepsilon$ and sample uniform otherwise.
So far the methods described have been on-policy methods, meaning that the policy improved (target policy) and the one that decides the actions (behavior policy) were the same. If the 2 policies were different, the method would be called off-policy.
One way to evaluate target policies wrt. a different behavior one is via Importance Sampling. This consists in weighting the sampled trajectory's return to compensate for the different probabilities of taking the actions wrt. to the behavior policy as opposed to the target one:
$$
\rho_{t:T-1} = \prod_{k=t}^{T-1} \dfrac{\pi(A_k|S_k)}{b(A_k|S_k)}
$$
If we were to use the unweighted averaged returns of the values, we would be estimating the value function of the behavior policy
Note: One can implement MC prediction efficiently using cumulative sums to compute averages, rather than memorizing all the transitions of an episode (see section 4.6, specifically page 110).
In order to implement off-policy MC control, we can use importance sampling in conjunction with an
Note:
-
Ordinary importance sampling has infinite variance if the sampling ratios are unbounded, however, it's an unbiased estimator! Weighted sampling is, on the other hand, biased but has very low variance (see figure 5.3 and example 5.5 from page 106 for a comparison).
-
As an implementation note, MC methods do not need to memorize rewards and states. Cumulative sums and counts can be kept instead (see page 110).
-
The lack of bootstrapping in MC methods makes them more robust against environments which violate the Markovian property.
-
Other more advanced methods such as discount-aware importance sampling and per-decision importance sampling can be used to further lower the variance of the estimates (see sections 5.8 and 5.9 - we skipped over these details).
Like MC methods, TD learning does not need a model, but unlike MC, it does bootstrap. The general update formula for TD estimation is the following:
$$
V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t)) = V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))
$$
This kind of update is called a TD(0) update or single-step TD update. The term
For batch TD and MC (updating on a series of episodes at once), MC is the least-squares estimator of the data whereas TD(0) estimates the values of the maximum likelihood MDP which generates the data.
TD methods are sound and converge to
Another key difference between MC and TD methods is that TD methods continually learn during the episode using only single transitions, whereas MC requires several episode samples.
Note: It is also immediately clear that there is a strong connection between gradient ascent/descent algorithms and TD methods.
Sarsa is an on-policy TD(0) control method which estimates Q-values. Sarsa takes 2 actions using a policy, observes the states for both of them, and the reward for the first (the name is actually an acronym for this sequence: State Action Reward State Action). The TD error is then used to learn the Q-values: $$ Q(S, A) \leftarrow Q(S,A) + \alpha [R + \gamma Q(S', A') - Q(S,A)] $$ As with on-policy MC, we need to use a soft policy to ensure convergence (see page 130 for algorithm).
Q-learning is a very similar algorithm which uses an update rule very similar to Sarsa, but it tries to estimate the optimal policy by bootstrapping as if it took an action using the greedy policy:
$$
Q(S, A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S',a) - Q(S,A)]
$$
This sort of bootstrapping makes Q-learning an off-policy method which tries to estimate
Expected Sarsa follows the same pattern as Q-learning, but instead of the maximum next value, it computes the expectation wrt. to the target policy
For any algorithm which takes actions with a policy (the behaviour policy) which prefers the best value functions (eg. Sarsa with
To combat this, we can use 2 value functions for estimation and compute one's estimate to update the other. We use their average to choose actions and update them alternatively wrt. to a Bernoulli trial:
$$
Q_1(S,A) \leftarrow Q_1(S,A) + \alpha (R+\gamma Q_2(S', \argmax{}_a Q_1(S',a)) - Q_1(S,A))
$$
Randomly, we then alternate to the equivalent function for
Note: In some cases where many value-action pairs lead to the same state (ie. chess, tic-tac-toe) it is often more useful to design value functions around the steads they lead to - called afterstates.
n-step bootstrapping is a family of algorithms which does temporal difference learning on chains of multiple actions. They provide a unifying framework for both TD(0) and MC methods (see page 146 for a diagram). For example, MC methods can be viewed as
The return formula can be applied to returns written wrt. the Q-value function. We can then rewrite the update rule for the Q-values as well (see page 147 for algorithm). $$ G_{t:t+n} \triangleq \sum_{k=1}^n \gamma^{k-1} R_{t+k} +\gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), n \geq 1, 0 \leq t < T - n $$
Note: One benefit of using multiple steps for an update is in the case of sparse reward signals, where only a small handful of states generate non-zero TD-errors (see figure 7.4 on page 147).
Dealing with multistep Q values, we can define an n-step version for Sarsa leading to n-step control. n-step Sarsa has the following update rule (see page 147 for algorithm): $$ Q_{t+n} (S_t, A_t) \leftarrow Q_{t+n-1}(S_t, A_t) + \alpha \left[G_{t:t+n} - Q_{t+n-1} (S_t, A_t) \right] $$ The same method can be adapted to expected Sarsa if we replace the last bootstrap estimate with an expectation: $$ G_{t:t+n} \triangleq \sum_{k=1}^n \gamma^{k-1} R_{t+k} +\gamma^n \bar{V}{t+n-1}(S{t+n}), n \geq 1, 0 \leq t < T - n \ \bar{V}_{t}(s) \triangleq \sum_a \pi(a|s) Q_t(s,a) $$
Borrowing from MC methods, n-step Sarsa can be further adapted to off-policy control by multiplying the TD error with the importance sampling ratio for either state or state-action value function $$ Q_{t+n} (S_t, A_t) \leftarrow Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t:t+n-1} \left[G_{t:t+n} - Q_{t+n-1} (S_t, A_t) \right] \ \rho_{t:h} \triangleq \prod _{k=t}^{min(h, T-1)} \dfrac{\pi(A_k|S_k)}{b(A_k|S_k)} $$ With this update rule, we can implement off policy n-step off-policy control (see page 149).
N-step off-policy Sarsa uses importance sampling for all steps except for the last one. Thus,
The Tree Backup algorithm allows us to use n-step TD off-policy without importance sampling, by bootstrapping the expectation of the Q-values at each step in the chain (see page 152 for diagram).
$$
G_{t:t+n} = R_{t+1} + \gamma \sum {a \neq A{t+1}} \pi(a| S_{t+1}) Q_{t+n-1}(S_{t+1}, a) + \gamma \pi ( A_{t+1} | S_{t+1}) G_{t+1:t+n}, \forall t \lt T-1 \
G_{t:t+1} = R_{t+1} + \gamma \sum {a} \pi(a| S{t+1}) Q_{t}(S_{t+1}, a), t =T-1
$$
Basically, the last action's (timestep
For
This algorithm is a generalization of both n-step Sarsa and Tree Backup (see diagram on page 155 for comparison diagram). At timestep
If
Note: One difference is that, unlike n-step Sarsa, all intermediate steps are updated not just the first one, more similar to how Tree Backup works. The diagram on page 155 is a pretty important one!
In this chapter we saw how we can extend previous algorithms such as Q-learning and Sarsa to update using n-steps. The two approaches presented use either sampling or expectation (or both in the case of n-step $Q(\sigma)$). By using multiple steps we ensure faster training in the case of sparse rewards.
Similarly to how MC and TD methods can be unified through the n-step framework, in this chapter we are trying to unify model-free and model-based methods. Whereas model-free methods rely purely on learning, model-based ones rely (mostly) on planning.
Models can be either distribution models or sampling models. The former assumes full or learned knowledge of the MDP's dynamics
We can consider 2 types of planning: state-space panning which searches through the state space for an optimal policy and plan-space planning which searches through the space of plans. The latter needs some operators which transform the plans and a partial ordering relationship for them (eg. evolutionary methods) and are beyond the scope of this book.
A simple example of planning is a planning Q-learning version, called Q-planning. A simple version of this algorithm involves selecting a state action-pair randomly and giving it to the model. and then using the reward to train using standard Q-learning.
Note: As a rule of thumb, planning can and should be done in small incremental steps intermixed with learning. This is a key requirement for efficient learning.
With a planning agent, real data is used for both model learning and direct reinforcement learning (improve the value function like the other methods) (see page 162, for a concise diagram of the process). The model can also improve the function through planning, this is called indirect reinforcement learning.
Dyna-Q is a model-based algorithm which does direct RL through Q-learning and learns a deterministic tabular model (each
Dyna is shown empirically to have faster convergence on deterministic environments (see page 164 for algorithm, and page 165 for the learning curve diagram).
The model can be incorrect and, thus, the policy learned can be suboptimal. Reasons for the model being wrong:
- The environment is stochastic and it is hard to fully learn the MDP dynamics.
- The MDP is learned through function approximation and is, thus, imperfect.
- The environment is non-stationary.
In the Dyna-Q case, we can solve the problem of non-stationarity by rewarding exploration. We can do this by adding to the reward of the pair
Prioritized sweeping is a method based on the observation that using all state-action pairs from the model usually doesn't have a lot of states they can learn from (this happens especially with sparse rewards). This suggests that planning should be working backward from goal-states (basically anything with a stronger reward signal).
To implement prioritized sweeping we introduce all state-action pairs
Note: Prioritized sweeping can be extended to stochastic MDPs by trying to learn the stochastic model as the frequency of next state-action pairs from each pair. We can also naturally use expected updates as a better approximator.
State-space planning can be viewed as a sequence of value updates based (or not) on the model which vary in the type of update (expected or sample, large or small) and in the order in which they are done (eg. prioritized or sequential).
One-step updates can be classified in 3 dimensions. One is whether they update the state or state-action value function. The other one is whether the update is wrt. an arbitrary given policy
For each one of these functions, one can use either sample or expected updates. This differentiation is, however, relevant only in the case of stochastic environments. Each combination of function and update type results in a specific update rule (eg, Sarsa, Q-learning, etc) (see page 172 for a diagram).
Each of them varies in utility in certain situations, for example, sample update rules seem better when the branching factor is moderately large (>100) as expected updates would require
Exhaustive sweeps, such as those performed in DP devote an equal amount of time to the entirety of the state-action space. Another approach would be to sample according to a certain distribution, such as the on-policy distribution. Depending on the distribution (ie. if the number of visits of each step does not go to infinity in the limit), the convergence guarantee may be broken. However, it may be fine in practice.
Sampling based on the on-policy distribution is called trajectory sampling and can be advantageous in terms of speed in MDPs with many states and small branching factor, but in the long run, it may hurt exploration and provide diminishing improvements (see page 176 for experiment and diagram). The diminishing results are usually because more time is spent on states whose values are already learned well enough.
RTDP is a variant of asynchronous DP where the values visited are distributed according to the on-policy distribution. Actions are selected greedily and expected updates are applied to the current state. Any set of arbitrary states can be updated as well at each step (eg. limited horizon look-ahead, etc). By using expected updates, RTDP is a form of value iteration.
If the episode ends in the goal state, RTDP will converge to the optimal policy over all provided relevant states given a few simple constraints, such as all the initial values being zero.
Note: This is a very brief presentation of the subject. See chapter 8.7 on page 177 for more in-depth info.
The type of planning done until now to improve the estimation of value functions is called background planning. Another type of planning, which takes into account the current state
The two types of planning are not mutually exclusive and decision-time planning methods can store their results by updating the value functions too, but the primary goal of decision time planning is to improve the quality of actions taken at each control step. What this type of planning usually provides, is deeper and better look-ahead than using the raw value function.
The two types can be combined, and often are, but decision time planning is often used in tasks where inference time is not a concern as the look-ahead incurs some overhead.
Heuristic search is a popular decision-time planning method frequently used in classical AI which uses tree expansion from the current state to look ahead and search for the best action to take. This can be integrated into RL through value function improvement. What this means is the fact that through look-ahead we can improve the estimate of the value function, and thus improve the policy. Unlike in traditional heuristic search, however, we can use this traversal to also do updates on the value function.
Heuristic search methods can be seen as a generalization of single-step greedy action selection. If look-ahead has a significant enough impact (ie. either
For tree expansion, we can use a random traversal, but we are much more likely to get a better estimate if we use an on-policy distribution (see figure 8.9 on page 189 for how this may be implemented with one-step updates).
Rollout algorithms are decision-time planning algorithms based on MC-simulated estimations starting from the current state ab using a fixed policy
The role of this is to improve the estimate
This can be viewed **as a single step of asynchronous value iteration **(async because we do not sweep the entire state-action space) starting from the current state
Note: Rollout algorithms do not perform any learning, instead, they refine already existing policies with MC estimates to improve control.
MCTS is a rollout algorithm which uses a truncated lookahead tree to store estimates of future values. To implement an MCTS algorithm we require these 4 steps:
-
A tree policy is used to traverse this memorized expanded tree of lookahead states (ie. the selection step) until a leaf node is reached. Selecting which node to traverse to can be done using whatever method we want (eg. Upper Confidence Bound (UCB),
$\varepsilon$ - greedy, etc). - Depending on the implementation, when reaching a leaf node, we can expand it and memorize its children in the search tree as well (ie. the expansion step). When exactly we opt for expansion depends on the algorithm. Some algorithms expand when a certain threshold of visits is reached, while others decide based on the estimated value of the leaf.
- We can also perform a simulation step when reaching a leaf instead of an expansion one, This involves simulating a complete episode to get its return using a simple and fast rollout policy. This is very similar to what classical MC rollout algorithms do.
- The resulting value is then used to update the estimates of all nodes on the root-to-leaf path. This step is called the backup step (see page 186 for figure).
These 4 steps are repeated until a number of iterations is reached, or some other constraint is satisfied. The best action is then chosen wrt. the values accumulated in the tree (or to the number of visits). MCTS is ran for every new state we visit, being a decision-time planning method. The learned tree can be discarded or we partially kept for new values.
MCTS benefits from both sample-based evaluation of policies and policy improvement (RL) for the tree. Moreover, it focuses search only on promising states according to MC simulations.
Planning methods require a model of the environment which may be either a distribution or a sample model. The former can be used for expected updates but it is harder to learn than the latter. These models can be priorly known (eg. having some knowledge about the environment like in the game of Go) or learned through experience.
Simulated experience generated by the model can be used to update the value functions used for our policies. Learning and planning can be seen as the same algorithm applied to 2 different sources of experience.
Another aspect through which planning algorithms differ from one another is through how they prioritize the updates. Algorithms such as RTDP, for example, update values wrt. an on-policy distribution while Dyna-Q may sample uniformly from the model.
Planning can also be done to augment decision making for just the current step rather than being used for learning - these are called decision-time planning methods. Rollout algorithms such as MCTS are examples of this, where through clever MC simulations we improve upon a weaker policy.
All RL methods presented (and all RL methods in general) have in common the fact that they estimate value functions, they update the value functions using backups and they follow the framework of Generalized Policy Iteration (GPI).
RL algorithms do, however, have some conceptual differences. These differences can be seen as a spectrum across a few dimensions, out of which we mention the 2 most important ones:
- Width of updates. At the extremes of this dimension lie sample updates at one end and expected updates at the other one.
- Depth of update. Methods vary from using a single-step update (like TD(0) methods do) to using the entire length of the episode as in MC methods.
The classical methods lie at the extremes of this spectrum:
- MC has infinite depth and uses sample updates
- TD(0) has depth one and uses a sample update
- Dynamic Programming has depth 1, but uses expected updates
- Exhaustive search occupies the remaining corner, using full episodes and expected updates
However, many methods, such as n-step TD and heuristic searches, lie somewhere in the middle of the spectrum (see page 190 for one of the most informative diagrams in the book).
Another dimension which we didn't include is whether the method is on- or off- policy. This is orthogonal to the other 2 and generally (if not always) binary (there are no methods which are kinda off-policy).
Other dimensions can be added to this taxonomy. Among these are whether the method operates on continuing or episodic tasks, whether it is asynchronous, what value function it approximates, and so on. These are not exhaustive and not mutually exclusive
The remaining, and one of the most important dimensions is whether the method is tabular or uses function approximation. All methods presented in part I were tabular; part II will cover function approximation.
Integrating function approximation methods with RL boils down to using function approximation methods such as those used in supervised learning to approximate value functions. This is done so we can apply RL methods to MDPs with an infinite number of states where we don't have any guarantee of convergence to an optimal policy even in the limit of infinite data and time.
However, function approximation has its own issues arising with things such as dealing with non-stationarity, bootstrapping, and delayed targets.
We consider the case of approximating the state-value function using a parameterized model
The target of updates in RL is to move the value function for some particular states towards some desired values. We'll use the notation
Each update can be seen as a single training example in a supervised learning scenario. This means that we should use learning methods which can deal with the non-stationarity that comes from the fact that the policy
Unlike in tabular methods where the updates of a few states' values did not affect others, function approximators do not have the same guarantee. An update for the value of one state will influence the approximation of potentially all the other states.
Therefore, an unbiased optimization objective is impossible to write. We instead define a distribution
To optimize
This is basically the vanilla SGD algorithm if this estimate is unbiased and independent of
By bootstrapping, however, we do not have the same guarantee. For example, using the TD(0) target
A SGD variant used is state aggregation where multiple states
One important case of function approximation is that of the linear model. In this case, we define a feature function
Note: Because the discounting factor
Non-linear transformations and dependencies between features are not captured by linear models. We can use different basis functions to capture these.
One example of candidate features are polynomial features, where every component of the
One can also use a Fourier basis for the features, which in practice seems to have better performance than the polynomials, but is not very good at dealing with discontinuities.
If we consider our state's components belonging to the interval
Another possibility is the use of coarse coding where each state is mapped to a binary vector. Each component of this vector corresponds to whether the states is within a receptive field. A receptive field can be any contiguous set of points in the state space, but, usually, they are repeating and overlapping tiles or circles (see page 215).
Tiling is the simpler and overall more feasible alternative. State aggregation is a particular case of coarse coding where a single such tiling is used. Also, asymmetrical offsets for the tiles are often preferred to uniform ones (see Fig 9.11 on page 219).
Hashing can be used to reduce a high number of tiles to a manageable number. Hashing is, however, non-injective and assigns multiple tiles to the same bin, effectively creating larger, and even non-contiguous tiles.
Another important type of coding are Radial basis functions (RBFs), which calculate a Gaussian-weighted distance from a set of centers
Artificial Neural Networks (ANNs) can be used together with reinforcement learning methods in several ways. For example, they can be trained to approximate a value function given the TD error, maximize expected reward like a gradient bandit, or through policy-gradient methods.
All of the methods of improving training and generalization in ANNs can be applied to ANNs in RL as well.
When using linear functions, we used an iterative gradient-based method to get to the TD fixed point which is close to the real minimum
Another small advantage is the lack of a step-size hyperparameter. However, this means that it cannot be applied online and it is not practical for non-stationary environments.
These are a form of non-parametric models which do lazy learning. They store seen values and only when a new estimate is required, they generate it based on the memorized values.
Kernel-based methods are methods based on a generalization strength metric,
Using the on-policy distribution has strong theoretical results for semi-gradients methods. We can, however weigh the distribution according to a random variable. In this case we will use the variable
We also define a discounted version of interest called emphasis.
$$
M_t \triangleq I_t \gamma^n M_{t-n}
$$
The formula above is for the n-step emphasis. Correspondingly, we also have an emphasis-weighted n-step update rule:
$$
w_{t+n} \leftarrow w_{t+n-1} + \alpha M_T \left[ G_{t:t+n} - \hat{v} (S_t, w_{t+n-1}) \right] \nabla\hat{v}(S_t, w_{t+n-1})
$$
For the MC case, we can let
The benefit of using interest-based methods is that we can arbitrarily control the impact of every step on the learning.
This chapter explored on-policy function approximation reinforcement learning. For the parametric model case of approximation, we defined the mean squared error value for those parameters
We studied the linear parametric model case mostly as the
Different functions can be used as basis functions for the features of the linear model. They range from polynomials to Fourier basis function and coarse coding. Coarse coding, specifically tile coding, discretizes the feature space to reduce computational complexity. RBFs can be also used as a form of continuous coding.
Using gradient methods with non-linear models such as ANNs has become popular in recent years under the name of deep RL.
For on-policy control, we study function approximation Sarsa. The extension to this case is simple for episodic tasks but has to be reevaluated in the case of continuing ones. In the process we will also start using state-action values instead of state values.
Here we discuss a semi-gradient version of Sarsa. To do this, we first define a semi-gradient TD updater rule for state-action values.
$$
w_{t+1} \leftarrow w_t + \alpha \left[ U_t - \hat{q}(S_t, A_t, w_t) \right] \nabla \hat{q}(S_t, A_t, w_t)
$$
If we replace
Just by writing the n-step return in terms of the new
Similarly we can also obtain a function-approximation version of expected Sarsa.
Instead of discounting the rewards to get the return, we can instead use the average reward as the optimization objective of our methods. We do this because the discounting setting is difficult to use with function approximation in the continuing setting and average reward is better suited.
For a given policy, we can compute the average expected reward in the limit to infinity steps:
$$
r(\pi) \triangleq \lim_{h \to \infty} \dfrac{1}{h}\mathbb{E} \left[ R_t | S_0, A_{0:t-1} \sim \pi \right] = \sum s \mu \pi (s) \sum a \pi (a|s) \sum {s', r} p (s', r| s,a) r
$$
Here $\mu\pi (s)$ stands for the steady-state distribution and is equal to $\mu\pi (s) = Pr {S_t = s | A{0:t-1} \sim \pi}$. This distribution has the property that sampling states from it and sampling actions according to $\pi$, we remain under the same distribution.
$$
\sum_s \mu\pi (s) \sum {s', r}\pi(a|s) p (s', r| s,a) = \mu\pi (s')
$$
For this distribution to exist, we must assume an **ergodic **MDP. Ergodic means that the probability of getting to a state, in the long run, is not conditioned by the starting state. We can then define a differential version of our returns so that the values are bounded:
$$
G_t \triangleq \sum_{i=1}^\infty R_{t+i} - r(\pi)
$$
Based on these differential returns we can define the differential versions of the Bellman equations and subsequently of the TD error as well. We do this by keeping an estimate of the average reward
In the continuing formulation, under the assumption of ergodicity of the MDP, for the function approximation methods, the discount is redundant in the computation of returns.
Ergodicity, like in the case of differential returns, ensures that an average reward can be computed in the limit to infinity. Under this assumption, we can show that the average reward when using discounts is proportional to one without discounting as its value is
The authors claim this stems from the lack of a policy improvement theorem for the function approximation case, where improvement in one state may not necessarily mean improvement of the policy.
Here we present the average reward formulation of the n-step Sarsa algorithm. The main difference is that at timestep
In this chapter, we saw adaptations of the Sarsa algorithm to function approximation cases. Due to the use of semi-gradient methods, we had to use average rewards
This chapter deals with the extension of off-policy methods to function approximation. The challenges are partially the same as those for tabular off-policy methods, but here, the on-policy distribution has a much greater role in guaranteeing convergence of semi-gradient methods.
Unlike tabular methods, function approximation methods are not guaranteed stable. Tabular methods correspond to a special case of function approximation.
One way to adapt off-policy methods to function approximation is to use importance sampling and one step TD(0).
$$
w_{t+1} \leftarrow w_t + \alpha \rho_t \delta_t \nabla \hat{v} (S_t, w_t)
$$
In this equation,
We can also adapt the backup tree algorithm to the function approximation case by putting it in the semi gradient framework. The TD error here is the same as the one mentioned before for episodic or continuing cases. $$ W_{t+n} \leftarrow w_{t+n-1} + \alpha \left[ G_{t:t+n} - \hat{q} (S_t, A_t, w_{t+n-1}) \right] \nabla \hat{q} (S_t, A_t, w_{t+n-1}) \ G_{t:t+n} \triangleq \hat{q} (S_t, A_t, w_{t-1}) + \sum {k=t}^{t+n-1} \delta_k \prod{i=t+1}^k \gamma \pi(A_t|S_i) $$
Due to the interdependence of states in the approximation of value functions (ie. modifying the estimation for one state modifies it for all of them) off-policy methods can diverge.
As the example on page 260 shows, this happens in the off-policy case and not the on-policy one because, now, over-/under-estimations of the value functions may never be corrected. Because, when using off-policy methods, the actions are not taken or updated according to the on-policy distribution, the importance sampling ratio
The paper presents on pages 261 and 263 two examples in which even using expected and/or synchronous updates leads to divergence. Baird's example even diverges for Q-learning. Q-learning does seem to always converge when given a policy which is close to the target greedy one, although there is no theoretical proof of this.
Some specific function approximation methods can be used which do not extrapolate to unobserved targets. This guarantee does not apply to popular ones such as ANNs and linear models with tile coding.
The cause of instability and divergence in RL is not a single factor, but a combination of function approximation, bootstrapping, and off-policy training. These, as shown in the previous section, lead to divergence. We look at the pros of each of the causes individually.
-
Function approximation, for example, is the only way to deal with arbitrarily large MDPs and states. This was not possible with tabular methods, and, thus, we can not give it up.
-
Bootstrapping is not necessarily required, as we could use non-TD learning such as MC methods. However, MC has large memory requirements, and more often than not TD learns much faster than MC on tasks where states repeat (most tasks). To make TD more like MC, the stochasticity of updates can be diminished by using n-step TD, but sometimes it leads to decreased performance. Overall bootstrapping is greatly beneficial.
-
Off-policy learning may seem the least indispensable one as good on-policy methods such as Sarsa do exist. However, off-policy learning encourages exploration more and has strong behavioral motivations as it allows parallel learning of multiple target policies using only one behavior one.
In this section, similar to the book, we describe a toy example of function approximation on the state space
Now, given a policy
We can also project the Bellman error giving rise to the Mean Square Projected Bellman Error $\overline{PBE} (w) = || \Pi \bar{\delta}w||^2\mu$. Unlike for Bellman and MSE error, this can be minimized to 0 in the approximators' subspace and this point is the TD fixed point
A figure of all these points and operators can be found on page 267.
Until now, for TD, unlike for MC, we only had semi-gradient methods to work with the function approximation case. Because they are not true gradient methods, SGD does not offer its normal convergence guarantees.
One way would be to use the TD error directly $$ \delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) $$ We can then define the Mean Squared TD Error which is the expectation of the squared TD error $$ \overline{TDE} (w) = \sum _{s\in\mathcal{S}} \mu(s) \mathbb{E} \left[ \delta^2_t | s_t=s, A_t \sim \pi\right] = \mathbb{E}_b\left[ \rho_t \delta_t^2 \right] $$ By minimizing this with SGD, we are applying what we call the naive residual-gradient algorithm. This converges robustly to its minimum, but as the example on page 271 shows, it is sometimes far from the true values.
As an alternative, we can try and minimize the Bellman error
One disadvantage of this minimization is that it is the product of 2 expectations so we need to sample the environment twice from the same state (usually not possible) and in the approximating case values still diverge (see page 273).
Note: For the deterministic environments, the TD error and Bellman error minimizations lead to the same results.
For the RL context, the authors consider that something is learnable if it can be learned at all from an infinite sample of experience data. It turns out that 2 different MDPs can generate identical data distributions, thus unless we have access to their dynamics, we cannot learn them.
This means that neither the
However, for
We could only use the residual-gradient algorithm with
This section considers the minimization of the
The full derivation can be found in the book from page 278, but the result is that we can decompose the gradient of the
Emphatic-TD considers the discounting coefficient
An example algorithm can be seen on page 282. It is a variation of the interest and emphasis algorithm from chapter 9. Although stable in theory, it has very high variance and, therefore, rarely converges on Baird's counterexample.
Due to the space of policies being very large, many behavior policies may be vastly different from the target one. This can lead to a very high variance in the importance ratios despite having bounded expectations.
Because of this, SGD tends to diverge unless small step-sizes are used. This can be mitigated through things such as: backup tree methods, having similar target and behavior policies, weighted importance sampling, etc.
This chapter summarized how off-policy learning cannot be easily translated to the framework of function approximation. It showed that bootstrapping methods coupled with function approximation are not stable in an off-policy setup, an effect called by the authors the deadly triad.
Using alternative objectives such as the Bellman error are appealing because they can allow stable SGD optimizations. This error, however, was shown to not be learnable, but an alternative projected Bellman error can be used at the cost of slightly higher computational complexity.
Off-policy learning in the domain of function approximation is still unsolved, but promises of better exploration and parallel learning of multiple policies make it appealing.
Note: In literature, the Bellman operator is more commonly denoted
Eligibility traces are a method of unifying TD methods and MC. For example, TD(
The mechanism introduces a short term memory vector
This mechanism allows continual and uniform learning, as opposed to the delayed one provided by n-step learning and MC.
We begin by reiterating the definition of the
By suing an eligibility trace vector
One caveat is that TD(
Eligibility traces pose a complexity increase, especially when using the tabular setting, where the weight vector would be the size of the MDP. However, in the case of function approximation, the implementation usually incurs just double the memory. N-step versions of the
Eligibility traces provide a way to continuously vary between MC and TD methods. They can be used in both on-policy and off-policy tasks and are usually faster to learn than TD(0).
Through backward view online methods, one can emulate the forward view
Good values of
The main advantage is data efficiency at a higher computational cost. This may be useful in the case of online and especially real-world environments but might break even when simulation and data gathering are inexpensive.
Policy gradient methods directly take actions using a differentiable policy whose parameters are denoted
The only condition to be able to do policy approximation is to have a policy function which is differentiable wrt. its parameters. In order to ensure exploration, policies are often prevented from becoming deterministic. One way to do this is to parametrize a preference function (ie. logits) and take the softmax of that: $$ \pi(a|s, \theta) \triangleq \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} $$ Through these methods, we also have a simple way of dealing with continuous action spaces. In situations of partial observation (or imperfect function approximation), approximate policy methods can also learn stochastic policies (see page 323 for example).
Sometimes directly optimizing the policy may be easier than learning the value functions and facilitates the design of policy parametrization.
Another advantage of approximate policy methods is that action probabilities change much smoother than in the
Using the policy gradient theorem, we can re-write the state value function as an expectation (or rather the value it's proportional to):
$$
\nabla J(\theta) \propto \sum S \mu (s) \sum_a q\pi (s,a) \nabla \pi (a | s, \theta) = \mathbb{E}\pi \left[ \sum_a q\pi (S_t, a) \nabla \pi (a|S_t, \theta) \right]
$$
We can then further modify the value so that we compute it wrt. to the return of an episode
Reinforce can be generalized to include a baseline state value function:
$$
\nabla J(\theta) \propto \sum S \mu (s) \sum_a (q\pi (s,a) -b(s)) \nabla \pi (a | s, \theta)
$$
This work because the subtracted value
This has the same convergence guarantees, but with well-selected step sizes (
Actor-critic methods involve a pretty straightforward modification. They simply replace the return with a bootstrapped approximation of it, the update rule for the actor weights
For continuing problems, we have previously defined performance in terms of average reward rate
Parameterizing policies to account for continuous actions amount to simply changing the distribution they output. Instead of using a softmax distribution with parameterized preferences, we can instead parameterize the mean and variance of a normal distribution.
$$
\pi(a|s, \theta) \triangleq \frac{1}{\sigma(s, \theta) \sqrt{2\pi}} \exp \left(-\frac{(a - \mu(s, \theta))^2}{2 \sigma (s, \theta) ^2}\right)
$$
For both
This chapter presented an alternative to action-value methods by directly optimizing the policy wrt. the value of the initial reward. These methods are called policy gradient methods.
Unlike action-value methods, policy gradient ones, are guaranteed to converge to the actual value when trained on-policy. This is called the policy gradient theorem.
To get this gradient we can use MC methods and we get the REINFORCE algorithm. The variance of this method can be further reduced by adding a baseline, ie. a function that learns the values for states. If we use bootstrapping to compute this kind of value function, we can train using TD methods and get what are called actor-critic methods.
All these methods have stronger convergence guarantees than value-based methods and actively researched.
In Mnih et al. 2015, the authors use DQN to play various Atari games with very good performance with 50 million frames of information per game (38 days).
To do this, the authors tested several models and methods against each other and reached a set of improvements to DQN which led to the greatest performance increases:
First, to reduce memory, 4 frames are fed to the model at once and actions are taken every 4 timesteps. This can also be seen as a way of reducing partial observability and making the process more Markovian.
Then as implemented by other papers before, the authors do not sequentially feed the experiences (
As expected from Q-learning, the value network is trained with semi-gradient methods, but bootstrapping is cached for
Architecture-wise, they tested MLPs, CNNs, and linear models. Out of the three, CNNs, unsurprisingly came out as winners with significant improvements over linear models, especially when paired with the other tweaks we mentioned.
The successes around the game of Go, namely the AlphaGo family of algorithms all revolve around MCTS used as a policy improvement operator. MCTS is a more refined version of a rollout algorithm.
Using MC seems to be necessary as there seems to be no search-based methods to efficiently evaluate states and state-actions for Go (all previous best-performing methods use MCTS). For example, with MCTS we can keep some subtree without fully discarding previous information like in generic MC.
AlphaGo's version of MCTS called Asynchronous Policy and Value MCTS (APV-MCTS), unlike traditional MCTS uses both the rollout policy and the value function to evaluate nodes in the tree. APV-MCTS also keeps track of edge visits and selects those with the most.
This work was supported by the European Regional Development Fund and the Romanian Government through the Competitiveness Operational Programme 2014–2020, project ID P_37_679, MySMIS code 103319, contract no. 157/16.12.2016.
© Nichita Uțiu. Distributed under the terms of the Creative Commons Attribution License, http://creativecommons.org/licenses/by/4.0/. This work was supported by the European Regional Development Fund and the Romanian Government through the Competitiveness Operational Programme 2014–2020, project ID P_37_679, MySMIS code 103319, contract no. 157/16.12.2016.