# Monte Carlo Methods

## Introduction

Monte Carlo methods can be used to estimate value functions and
discovering optimal policies. Unlike DP, they do not assume complete
knowledge of the environment. MC methods only require experience, sample
sequences of states, actions, and rewards from actual or simulated interaction with an
environment. Learning from actual experience is striking because it requires no prior
knowledge of the environment’s dynamics, yet can still attain optimal behavior. Learning
from simulated experience is also powerful. Although a model is required, the model need
only generate sample transitions, not the complete probability distributions of all possible
transitions that is required for dynamic programming (DP). In surprisingly many cases it
is easy to generate experience sampled according to the desired probability distributions,
but infeasible to obtain the distributions in explicit form.

Monte Carlo methods are ways of solving the reinforcement learning problem based on
averaging sample returns. To ensure that well-defined returns are available, here we define
Monte Carlo methods only for episodic tasks. That is, we assume experience is divided
into episodes, and that all episodes eventually terminate no matter what actions are
selected. Only on the completion of an episode are value estimates and policies changed.
Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in
a step-by-step (online) sense. The term “Monte Carlo” is often used more broadly for
any estimation method whose operation involves a significant random component. Here
we use it specifically for methods based on **averaging complete returns**.

We adapt the idea of general policy iteration (GPI)
developed in Chapter 4 for DP.Whereas there we computed value functions from knowledge
of the MDP, here we learn value functions from sample returns with the MDP. The value
functions and corresponding policies still interact to attain optimality in essentially the
same way (GPI). As in the DP chapter, first we consider the prediction problem, then policy improvement, and,
finally, the control problem and its solution by GPI. Each of these ideas taken from DP
is extended to the Monte Carlo case in which only sample experience is available.

## Monte Carlo Prediction

We begin by considering Monte Carlo methods for learning the state-value function for a
given policy. Recall that the value of a state is the expected return—expected cumulative
future discounted reward—starting from that state. An obvious way to estimate it from
experience, then, **is simply to average the returns observed after visits to that state. As
more returns are observed**, the average should converge to the expected value. This idea
underlies all Monte Carlo methods.

### First-visit MC method

Suppose we wish to estimate $v_{\pi} (s)$, the value of a state s under policy $\pi$, given a set of episodes (i.e rounds of games) obtained by following $\pi$
and passing through s. Each occurrence of state s in an episode is called **a visit to s**. Of course, s may be visited multiple times in the same episode; Let us call the first time it is visited in an
episode the first visit to s, the **first-visit MC method** estimates $v_{\pi} (s)$ as the average of the returns following first visits to s, whereas the **every-visit MC method** averages the returns following all visits to s.
These two  MC methods are very similar but have slightly different theoretical properties. First-visit MC has been most widely studied. Every-visit MC extends more naturally to function approximation and eligibility traces.

<img src='first-vivit-MC.png'>

Unless $S_t$ appears in $S_0, ...., S_{t-1}$ means we only take the first occurrence of state into account.

First-visit MC is shown in procedural form in the box. Every-visit MC would be the
same except without the check for St having occurred earlier in the episode.

Both first-visit MC and every-visit MC converge to $v_{\pi} (s)$ as the number of visits to s goes to infinity. This is easy to see by LLN.

Sometimes, even when one has complete knowledge of the environment's dynamics, the ability of MC methods to work with sample episodes alone can be a significant advantage -- generating the sample
games required by Monte Carlo methods is easy.

An important fact about Monte Carlo methods is that the estimates for each
state are independent. The estimate for one state does not build upon the estimate
of any other state, as is the case in DP. In particular, note that the computational expense of estimating the value of
a single state is independent of the number of states. This can make Monte Carlo
methods particularly attractive when one requires the value of only one or a subset
of states. One can generate many sample episodes starting from the states of interest,
averaging returns from only these states, ignoring all others. This is a third advantage
Monte Carlo methods can have over DP methods

## Monte Carlo Estimation of Action Values

If a model is not available, then it is particularly useful to estimate action values (the
values of state–action pairs) rather than state values. With a model, state values alone are
sufficient to determine a policy; one simply looks ahead one step and chooses whichever
action leads to the best combination of reward and next state, as we did in the chapter on
DP. Without a model, however, state values alone are not sufficient (p is unknown). One must explicitly
estimate the value of each action in order for the values to be useful in suggesting a policy.
Thus, one of our primary goals for Monte Carlo methods is to estimate $q_{*}$. To achieve
this, we first consider the policy evaluation problem for action values.

The policy evaluation problem for action values is to estimate $q_{\pi} (s, a)$, the expected return when starting in state s, taking action a and thereafter following policy $\pi$. The
Monte Carlo methods for this are essentially the same as just presented for state values,
except now we talk about visits to a state–action pair rather than to a state. A state–
action pair s, a is said to be visited in an episode if ever the state s is visited and action
a is taken in it. The every-visit MC method estimates the value of a state–action pair
as the average of the returns that have followed all the visits to it. The first-visit MC
method averages the returns following the first time in each episode that the state was
visited and the action was selected. These methods converge quadratically, as before, to
the true expected values as the number of visits to each state–action pair approaches
infinity. The only complication is that many state–action pairs may never be visited. If ⇡ is
a deterministic policy, then in following ⇡ one will observe returns only for one of the
actions from each state. With no returns to average, the Monte Carlo estimates of the
other actions will not improve with experience. This is a serious problem because the
purpose of learning action values is to help in choosing among the actions available in
each state. To compare alternatives we need to estimate the value of all the actions from
each state, not just the one we currently favor.

### maintaining exploration and exploring starts

This is the general problem of **maintaining exploration**. For policy evaluation to work for action
values, we must assure continual exploration. One way to do this is by specifying that
the episodes start in a state–action pair, and that every pair has a nonzero probability of
being selected as the start. This guarantees that all state–action pairs will be visited an
infinite number of times in the limit of an infinite number of episodes. We call this the
assumption of exploring starts.

The assumption of exploring starts is sometimes useful, but of course it cannot be
relied upon in general, particularly when learning directly from actual interaction with an
environment. In that case the starting conditions are unlikely to be so helpful. The most
common alternative approach to assuring that all state–action pairs are encountered is to consider only policies that are stochastic with a nonzero probability of selecting all
actions in each state.

## Monte Carlo Control

We are now ready to consider how Monte Carlo estimation can be used in control, that
is, to approximate optimal policies. The overall idea is to proceed according to the same
pattern as in the DP, that is, according to the idea of generalized policy iteration (GPI). In GPI, one maintains both an approximate a policy and an approximate value function. The value function is repeatedly
altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function These two kinds of changes work against each other to
some extent, as each creates a moving target for the other, but
together they cause both policy and value function to approach
optimality.

To begin, let us consider a MC version of classical policy iteration. In this method, we perform alternating complete steps of policy evaluation and policy improvement, beginning with an arbitrary policy policy $\pi_0$ and ending with the optimal policy and optimal action-value function.

<img src='MC-policy-iteration.png'>

Policy evaluation is done exactly as described in the preceding section.
Many episodes are experienced, with the approximate action-value function approaching
the true function asymptotically. For the moment, let us assume that we do indeed
observe an infinite number of episodes and that, in addition, the episodes are generated
with **exploring starts**. Under these assumptions, the MC methods will compute each $q_{\pi_k}$ exactly, for arbitrary $\pi_k$.

Policy improvement is done by making the policy greedy wrt the current value function. In this case, we have an action-value function, and therefore no model is needed to construct the greedy policy (ie we do not need to know p). For any action-value function
q, the corresponding greedy policy is the one that, for each $s \in S$, deterministically chooses an action with maximal action-value:

$\pi_{k+1} (s) = argmax_{a} q_{\pi_k} (s, a)$

Policy improvement then can be done by constructing each $\pi_{k+1}$ as the greedy policy wrt $q_{\pi_k}$. The policy improvement theorem then applies to $\pi_{k}, \pi_{k+1}$ because, for all $s \in S$,

$q_{\pi_k} (s, \pi_{k+1} (s)) = q_{\pi_k} (s, argmax_{a} q_{pi_k} )s, a) = max_{a} q_{\pi_{k}} (s, a) \geq q_{\pi_k} (s, \pi_k (s)) \geq v_{\pi_k} (s)$

As discussed previously, the theorem assures us that each $\pi_{k+1}$ is uniformly better than $\pi_{k}$, or just as good as $\pi_k$, in which case they are both optimal policies. This in turn assures us that the overall process converges to the optimal policy and optimal value function. In this way,
MC methods can be used to find optimal oplicies given only sample episodes and no other knowledge of the environment's dynamics.

We made two unlikely assumptions above in order to easily obtain this guarantee of
convergence for the Monte Carlo method.

1. The episodes have exploring starts
2. Policy evaluation could be done with an infinite number of episodes

To obtain a practical algorithm we will have to remove both assumptions. For now, we focus on removing the second assumption, that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as
iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and MC cases, there are two ways to solve the problem. One is to hold firm to the idea of approximating $q_{\pi_k}$ in each policy evaluation. Measurements and assumptions are made to obtain bounds
on the magnitude and probability of error in the estimates, and then sufficient steps are
taken during each policy evaluation to assure that these bounds are sufficiently small.
This approach can probably be made completely satisfactory in the sense of guaranteeing
correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

### MCES

There is a second approach to avoiding the infinite number of episodes nominally
required for policy evaluation, in which we give up trying to complete policy evaluation
before returning to policy improvement. On each evaluation step we move the value
function toward q⇡k , but we do not expect to actually get close except over many steps.
We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme
form of the idea is value iteration, in which only one iteration of iterative policy evaluation
is performed between each step of policy improvement. The in-place version of value
iteration is even more extreme; there we alternate between improvement and evaluation
steps for single states.

For Monte Carlo policy iteration it is natural to alternate between evaluation and
improvement on an episode-by-episode basis. After each episode, the observed returns
are used for policy evaluation, and then the policy is improved at all the states visited in
the episode. A complete simple algorithm along these lines, which we call Monte Carlo
ES, for Monte Carlo with Exploring Starts

<img src='MCES.png'>

In MCES, all the returns for each state-action pair are accumulated and averaged (For all episodes), irrespective of what policy was in force when they were observed (because $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$, we are updating the global q(s, a)). It is easy to see that MCES cannot converge to any suboptimal policy.
If it did, then the value function would eventually converge to the value function for that policy, and that in turn would cause the policy to change. Stability is achieved only when both the policy and the value function are optimal. Convergence to this optimal fixed point seems inevitable as the changes to the
action-value function decrease over time, but has not yet been formally proved.

### Monte Carlo Control without Exploring Starts

