# Monte Carlo Methods

## Introduction

Monte Carlo methods can be used to estimate value functions and
discovering optimal policies. Unlike DP, they do not assume complete
knowledge of the environment. MC methods only require experience, sample
sequences of states, actions, and rewards from actual or simulated interaction with an
environment. Learning from actual experience is striking because it requires no prior
knowledge of the environment’s dynamics, yet can still attain optimal behavior. Learning
from simulated experience is also powerful. Although a model is required, the model need
only generate sample transitions, not the complete probability distributions of all possible
transitions that is required for dynamic programming (DP). In surprisingly many cases it
is easy to generate experience sampled according to the desired probability distributions,
but infeasible to obtain the distributions in explicit form.

Monte Carlo methods are ways of solving the reinforcement learning problem based on
averaging sample returns. To ensure that well-defined returns are available, here we define
Monte Carlo methods only for episodic tasks. That is, we assume experience is divided
into episodes, and that all episodes eventually terminate no matter what actions are
selected. Only on the completion of an episode are value estimates and policies changed.
Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in
a step-by-step (online) sense. The term “Monte Carlo” is often used more broadly for
any estimation method whose operation involves a significant random component. Here
we use it specifically for methods based on **averaging complete returns**.

We adapt the idea of general policy iteration (GPI)
developed in Chapter 4 for DP.Whereas there we computed value functions from knowledge
of the MDP, here we learn value functions from sample returns with the MDP. The value
functions and corresponding policies still interact to attain optimality in essentially the
same way (GPI). As in the DP chapter, first we consider the prediction problem, then policy improvement, and,
finally, the control problem and its solution by GPI. Each of these ideas taken from DP
is extended to the Monte Carlo case in which only sample experience is available.

## Monte Carlo Prediction

We begin by considering Monte Carlo methods for learning the state-value function for a
given policy. Recall that the value of a state is the expected return—expected cumulative
future discounted reward—starting from that state. An obvious way to estimate it from
experience, then, **is simply to average the returns observed after visits to that state. As
more returns are observed**, the average should converge to the expected value. This idea
underlies all Monte Carlo methods.

### First-visit MC method

Suppose we wish to estimate $v_{\pi} (s)$, the value of a state s under policy $\pi$, given a set of episodes (i.e rounds of games) obtained by following $\pi$
and passing through s. Each occurrence of state s in an episode is called **a visit to s**. Of course, s may be visited multiple times in the same episode; Let us call the first time it is visited in an
episode the first visit to s, the **first-visit MC method** estimates $v_{\pi} (s)$ as the average of the returns following first visits to s, whereas the **every-visit MC method** averages the returns following all visits to s.
These two  MC methods are very similar but have slightly different theoretical properties. First-visit MC has been most widely studied. Every-visit MC extends more naturally to function approximation and eligibility traces.

<img src='first-vivit-MC.png'>

Unless $S_t$ appears in $S_0, ...., S_{t-1}$ means we only take the first occurrence of state into account.

First-visit MC is shown in procedural form in the box. Every-visit MC would be the
same except without the check for St having occurred earlier in the episode.

Both first-visit MC and every-visit MC converge to $v_{\pi} (s)$ as the number of visits to s goes to infinity. This is easy to see by LLN.

Sometimes, even when one has complete knowledge of the environment's dynamics, the ability of MC methods to work with sample episodes alone can be a significant advantage -- generating the sample
games required by Monte Carlo methods is easy.

An important fact about Monte Carlo methods is that the estimates for each
state are independent. The estimate for one state does not build upon the estimate
of any other state, as is the case in DP. In particular, note that the computational expense of estimating the value of
a single state is independent of the number of states. This can make Monte Carlo
methods particularly attractive when one requires the value of only one or a subset
of states. One can generate many sample episodes starting from the states of interest,
averaging returns from only these states, ignoring all others. This is a third advantage
Monte Carlo methods can have over DP methods

## Monte Carlo Estimation of Action Values

If a model is not available, then it is particularly useful to estimate action values (the
values of state–action pairs) rather than state values. With a model, state values alone are
sufficient to determine a policy; one simply looks ahead one step and chooses whichever
action leads to the best combination of reward and next state, as we did in the chapter on
DP. Without a model, however, state values alone are not sufficient (p is unknown). One must explicitly
estimate the value of each action in order for the values to be useful in suggesting a policy.
Thus, one of our primary goals for Monte Carlo methods is to estimate $q_{*}$. To achieve
this, we first consider the policy evaluation problem for action values.

The policy evaluation problem for action values is to estimate $q_{\pi} (s, a)$, the expected return when starting in state s, taking action a and thereafter following policy $\pi$. The
Monte Carlo methods for this are essentially the same as just presented for state values,
except now we talk about visits to a state–action pair rather than to a state. A state–
action pair s, a is said to be visited in an episode if ever the state s is visited and action
a is taken in it. The every-visit MC method estimates the value of a state–action pair
as the average of the returns that have followed all the visits to it. The first-visit MC
method averages the returns following the first time in each episode that the state was
visited and the action was selected. These methods converge quadratically, as before, to
the true expected values as the number of visits to each state–action pair approaches
infinity. The only complication is that many state–action pairs may never be visited. If ⇡ is
a deterministic policy, then in following ⇡ one will observe returns only for one of the
actions from each state. With no returns to average, the Monte Carlo estimates of the
other actions will not improve with experience. This is a serious problem because the
purpose of learning action values is to help in choosing among the actions available in
each state. To compare alternatives we need to estimate the value of all the actions from
each state, not just the one we currently favor.

### maintaining exploration and exploring starts

This is the general problem of **maintaining exploration**. For policy evaluation to work for action
values, we must assure continual exploration. One way to do this is by specifying that
the episodes start in a state–action pair, and that every pair has a nonzero probability of
being selected as the start. This guarantees that all state–action pairs will be visited an
infinite number of times in the limit of an infinite number of episodes. We call this the
assumption of exploring starts.

The assumption of exploring starts is sometimes useful, but of course it cannot be
relied upon in general, particularly when learning directly from actual interaction with an
environment. In that case the starting conditions are unlikely to be so helpful. The most
common alternative approach to assuring that all state–action pairs are encountered is to consider only policies that are stochastic with a nonzero probability of selecting all
actions in each state.

## Monte Carlo Control

We are now ready to consider how Monte Carlo estimation can be used in control, that
is, to approximate optimal policies. The overall idea is to proceed according to the same
pattern as in the DP, that is, according to the idea of generalized policy iteration (GPI). In GPI, one maintains both an approximate a policy and an approximate value function. The value function is repeatedly
altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function These two kinds of changes work against each other to
some extent, as each creates a moving target for the other, but
together they cause both policy and value function to approach
optimality.

To begin, let us consider a MC version of classical policy iteration. In this method, we perform alternating complete steps of policy evaluation and policy improvement, beginning with an arbitrary policy policy $\pi_0$ and ending with the optimal policy and optimal action-value function.

<img src='MC-policy-iteration.png'>

Policy evaluation is done exactly as described in the preceding section.
Many episodes are experienced, with the approximate action-value function approaching
the true function asymptotically. For the moment, let us assume that we do indeed
observe an infinite number of episodes and that, in addition, the episodes are generated
with **exploring starts**. Under these assumptions, the MC methods will compute each $q_{\pi_k}$ exactly, for arbitrary $\pi_k$.

Policy improvement is done by making the policy greedy wrt the current value function. In this case, we have an action-value function, and therefore no model is needed to construct the greedy policy (ie we do not need to know p). For any action-value function
q, the corresponding greedy policy is the one that, for each $s \in S$, deterministically chooses an action with maximal action-value:

$\pi_{k+1} (s) = argmax_{a} q_{\pi_k} (s, a)$

Policy improvement then can be done by constructing each $\pi_{k+1}$ as the greedy policy wrt $q_{\pi_k}$. The policy improvement theorem then applies to $\pi_{k}, \pi_{k+1}$ because, for all $s \in S$,

$q_{\pi_k} (s, \pi_{k+1} (s)) = q_{\pi_k} (s, argmax_{a} q_{pi_k} )s, a) = max_{a} q_{\pi_{k}} (s, a) \geq q_{\pi_k} (s, \pi_k (s)) \geq v_{\pi_k} (s)$

As discussed previously, the theorem assures us that each $\pi_{k+1}$ is uniformly better than $\pi_{k}$, or just as good as $\pi_k$, in which case they are both optimal policies. This in turn assures us that the overall process converges to the optimal policy and optimal value function. In this way,
MC methods can be used to find optimal oplicies given only sample episodes and no other knowledge of the environment's dynamics.

We made two unlikely assumptions above in order to easily obtain this guarantee of
convergence for the Monte Carlo method.

1. The episodes have exploring starts
2. Policy evaluation could be done with an infinite number of episodes

To obtain a practical algorithm we will have to remove both assumptions. For now, we focus on removing the second assumption, that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as
iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and MC cases, there are two ways to solve the problem. One is to hold firm to the idea of approximating $q_{\pi_k}$ in each policy evaluation. Measurements and assumptions are made to obtain bounds
on the magnitude and probability of error in the estimates, and then sufficient steps are
taken during each policy evaluation to assure that these bounds are sufficiently small.
This approach can probably be made completely satisfactory in the sense of guaranteeing
correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

### MCES

There is a second approach to avoiding the infinite number of episodes nominally
required for policy evaluation, in which we give up trying to complete policy evaluation
before returning to policy improvement. On each evaluation step we move the value
function toward q⇡k , but we do not expect to actually get close except over many steps.
We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme
form of the idea is value iteration, in which only one iteration of iterative policy evaluation
is performed between each step of policy improvement. The in-place version of value
iteration is even more extreme; there we alternate between improvement and evaluation
steps for single states.

For Monte Carlo policy iteration it is natural to alternate between evaluation and
improvement on an episode-by-episode basis. After each episode, the observed returns
are used for policy evaluation, and then the policy is improved at all the states visited in
the episode. A complete simple algorithm along these lines, which we call Monte Carlo
ES, for Monte Carlo with Exploring Starts

<img src='MCES.png'>

In MCES, all the returns for each state-action pair are accumulated and averaged (For all episodes), irrespective of what policy was in force when they were observed (because $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$, we are updating the global q(s, a)). It is easy to see that MCES cannot converge to any suboptimal policy.
If it did, then the value function would eventually converge to the value function for that policy, and that in turn would cause the policy to change. Stability is achieved only when both the policy and the value function are optimal. Convergence to this optimal fixed point seems inevitable as the changes to the
action-value function decrease over time, but has not yet been formally proved.

### Monte Carlo Control without Exploring Starts

How can we avoid the unlikely assumption of exploring starts? The only general way to
ensure that all actions are selected infinitely often is for the agent to continue to select
them. There are two approaches to ensuring this, resulting in what we call **on-policy**
methods and **off-policy** methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a
policy different from that used to generate the data. MCES method is an example of an on-policy method.

#### $\epsilon$-soft policies

In on-policy control methods, the policy is generally soft, meaning that $\pi(a|s) \geq 0, \forall s\in S, a \in A(s)$, but gradually shifted closer and closer to a deterministic optimal policy. In this section, the on-policy method we present uses $\epsilon$-greedy policies, meaning that most of the time, they choose an action that has maximal estimated action value, but
with some probability $\epsilon$ they instead select an action at random (Explore). That is, all nongreedy actions are given the minimal probability of selection, $\frac{\epsilon}{\|A(s)\|}$, and the remaining bulk of the probability, $1 - \epsilon + \frac{\epsilon}{\|A(s)\|}$ is given to the greedy action (because random selection of actions can select the optimal action).
The $\epsilon$-greedy policies are examples of $\epsilon$-soft policies, defined as policies for which $\pi(a|s) \geq \frac{\epsilon}{\|A(s)\|}$ for all states and actions, for some $\epsilon > 0$. Among $\epsilon$-soft policies, $\epsilon$-greedy policies are in some sense those that are closest to greedy.

The overall idea of on-policy Monte Carlo control is still that of GPI. As in Monte
Carlo ES, we use first-visit MC methods to estimate the action-value function for the
current policy. Without the assumption of exploring starts, however, we cannot simply
improve the policy by making it greedy with respect to the current value function, because
that would prevent further exploration of nongreedy actions. Fortunately, GPI does not
require that the policy be taken all the way to a greedy policy, only that it be moved
toward a greedy policy. In our on-policy method we will move it only to an $\epsilon$-greedy policy. For any $\epsilon$-soft policy, $\pi$, any $\epsilon$-greedy policy wrt $q_{\pi}$ is guaranteed to be better than or equal to $\pi$

<img src='epsilon-soft-policies.png'>

That any $\epsilon$-greedy policy wrt $q_{\pi}$ is an improvement over any $\epsilon$-soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi^\prime$ be the $\epsilon$-greedy policy. The conditions of the policy improvement theorem apply because for any $s \in S$:

$q_{\pi} (s, \pi^\prime(s)) = \sum_{a} \pi^{\prime}(a|s) q_{\pi}(s, a)$ (This is the expected value of $q_{\pi} (s, \pi^\prime (s))$ where $\pi^\prime (s)$ is a probability distribution)

$= \frac{\epsilon}{\|A(s)\|} \sum_{a} q_{\pi} (s, a) + (1 - \epsilon) max_{a} q_{\pi} (s, a)$

Since $\pi (a | s) = \frac{\epsilon}{\|A(s)\|}$ if $a \neq argmax_{a} q_{\pi} (s, a)$

\begin{equation}
\implies \frac{\pi(a | s) - \frac{\epsilon}{\|A(s)\|}}{1 - \epsilon} =
\begin{cases}
  1 &  a = argmax_{a} q_{\pi} (s, a)\\
  0, & o/w
\end{cases}
\end{equation}

$\implies \frac{\epsilon}{\|A(s)\|} \sum_{a} q_{\pi} (s, a) + (1 - \epsilon) max_{a} q_{\pi} (s, a) \geq \frac{\epsilon}{\|A(s)\|} \sum_{a} q_{\pi} (s, a) + (1 - \epsilon) \sum_{a} \frac{\pi(a | s) - \frac{\epsilon}{\|A(s)\|}}{1 - \epsilon} q_{\pi} (s, a) $

$ = \frac{\epsilon}{\|A(s)\|} \sum_{a} q_{\pi} (s, a) + \sum_{a} \pi(a | s) q_{\pi} (s, a) - \sum_{a} \frac{\epsilon}{\|A(s)\|} q_{\pi} (s, a) $

$ = v_{\pi} (s)$

Thus, by policy improvement theorem, $\pi^\prime \geq \pi$ (ie. $v_{\pi^\prime} (s) \geq v_{\pi} (s), \forall s \in S$).

We now can prove that equality can hold only when both $\pi^\prime$ and $\pi$ are optimal among the $\epsilon$-soft policies, that is, when they are better than or equal to all other
$\epsilon$-soft policies:

<img src='prove_1.png'>

In essence, we have shown in the last few pages that policy iteration works for "-soft
policies. Using the natural notion of greedy policy for "-soft policies, one is assured of
improvement on every step, except when the best policy has been found among the "-soft
policies. This analysis is independent of how the action-value functions are determined at each stage, but it does assume that they are computed exactly. This brings us to
roughly the same point as in the previous section. Now we only achieve the best policy
among the "$\epsilon$-soft policies, but on the other hand, we have eliminated the assumption of
exploring starts.

#### Off-policy Prediction via Importance Sampling

All learning control methods face a dilemma: They seek to learn action values conditional
on subsequent optimal behavior, but they need to behave non-optimally in order to
explore all actions (to find the optimal actions). How can they learn about the optimal
policy while behaving according to an exploratory policy? The on-policy approach in the
preceding section is actually a compromise—it learns action values not for the optimal
policy, but for a near-optimal policy that still explores. More straightforward approach
is to use two policies, one that is learned about and that becomes the optimal policy, and
one that is more exploratory and is used to generate behavior. The policy being learned
about is called the **target policy**, and the policy used to generate behavior is called the
**behavior policy**. In this case we say that learning is from data “off” the target policy, and
the overall process is termed **off-policy learning**.

On-policy methods are generally simpler and are considered first. O↵-policy methods
require additional concepts and notation, and because the data is due to a di↵erent policy,
o↵-policy methods are often of greater variance and are slower to converge. On the other
hand, o↵-policy methods are more powerful and general. They include on-policy methods
as the special case in which the target and behavior policies are the same.

We begin the study of off-policy methods by considering the prediction problem, in which both target and behavior policies are fixed. That is,
suppose we wish to estimate $v_{\pi}$ or $q_{\pi}$, but all we have are episodes following another policy b, where $b \neq \pi$. In this case, $\pi$ is the target policy, b is the behavior policy, and both policies are considered fixed and given.

In order to use episodes from b to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally under b. That is, we require that $\pi(a|s) > 0 \implies b(a|s) > 0$. This is called **the assumption of coverage**.
It follows from coverage that b must be stochastic in states where it is not identical to $\pi$. The target policy $\pi$, on the other hand, may be deterministic, and, in fact, this is a case of particular interest in control applications. In control, the target policy is typically the
deterministic greedy policy wrt the current estimate of the action-value function. This policy becomes a deterministic optimal policy while the behavior policy remains stochastic and more exploratory, for example, an $\epsilon$-greedy policy. However, in this section, we consider the prediction problem (ie. estimate value functions given policy), in which $\pi$ is unchanging and given.

##### Importance sampling

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies called the **importance-sampling rato**.
Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, ..., S_{T}$, occuring under any policy $\pi$ is

$P(A_t, S_{t+1}, A_{t+1}, ..., S_{T} | S_t), \forall A_{i} \sim \pi$

$= p(A_t | S_t) p(S_{t+1} | S_t, A_t) .... p(S_T | S_{T-1}. A_{T-1})$ (ie. Chain rule + memoryless property)

$= \pi(A_t | S_t)p(S_{t+1} | S_t, A_t)....p(S_T | S_{T-1}. A_{T-1})$

$= \prod^{T-1}_{k=t} \pi(A_k|S_k) p(S_{k+1} | S_k, A_k)$

Thus, the relative probability of the trajectory under the target and behavior policies (The importance-sampling ratio) is:

$\rho_{t:T-1} = \frac{\prod^{T-1}_{k=t} \pi(A_k|S_k) p(S_{k+1} | S_k, A_k)}{\prod^{T-1}_{k=t} b(A_k|S_k) p(S_{k+1} | S_k, A_k)} = \frac{\prod^{T-1}_{k=t} \pi(A_k|S_k)} {\prod^{T-1}_{k=t} b(A_k|S_k)}$

is the ratio between likelihood of sequence of actions from $\pi$ and likelihood of same sequence of actions from b.

Recall that we wish to estimate $v_{\pi}, q_{\pi}$, but all we have is $v_{b}, q_{b}$. These returns have wrong expectations and so cannot be averaged to obtain $v_{\pi}$ (i.e $\pi, b$ are two different distribution). This is where importance sampling comes in.
The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value:

$v_{\pi} (s) = E[\rho_{t:T-1} G_t | S_t=s]$, where $G_t$ is estimated following b. If $\rho_{t:T-1} < 1 \implies$ the sequence is more likely to be in b, so it should have less weight.

Now we are ready to give a Monte Carlo algorithm that averages returns from a batch of observed episodes following policy b to estimate $v_{\pi}$. It is convenient here to number time steps in a way that increases across episode boundaries. That is, if the first episode of the batch ends in a terminal state at time 100, then the next episode begins at time t=101. This enables us to use time-step
numbers to refer to particular steps in particular episodes. in particular, we can define the set of all time setps in which state s is visited, denoted J(s), this is for an every-visit method; for a first-visit method, J(s) would only include time steps that were first visits to s within their episodes. Also, let T(t) denotes the first time of termination following time t (if t=50, termination after t=50 is 60, then T(50) = 60), and $G_t$ denote the return after t up through T(t). Then $\{G_{t}\}_{t \in J(s)}$ are the returns that pertain to stat s, and
$\{\rho_{t:T(t)-1}\}_{t \in J(s)}$ are the corresponding importance-sampling ratios. To estimate $v_{\pi} (s)$, we simply scale the return by the ratios and average the results:

$v_{\pi}(s) = \frac{\sum_{t \in J(s)} \rho_{t:T(t)-1} G_t}{\|J(s)\|}$ (estimation of expected value by LLN)

When importance sampling is done as a simple average in this way it is called **oridinary importance sampling**, an importance alternative is **weighted importance sampling** which uses a weighted average defined as:

$v_{\pi}(s) = \frac{\sum_{t \in J(s)} \rho_{t:T(t)-1} G_t}{\sum_{t \in J(s)} \rho_{t:T(t)-1}}$

or zero if the denominator is zero. To understand these two varieties of importance sampling, consider the estimates of their first-visit methods after observing a single return from state s. In the
weighted-average estimate, the ratio $\rho_{t:T(t)-1}$ cancels in the numerator and denominator for *a single return*, so that the estimate is equal to the observed return independent of the ratio. Given
that this return was the only one observed, that is a reasonable estimate, but its expectation is $v_{b} (s)$ rather than $v_{\pi} (s)$, and in this statistical sense, it is biased. in contrast, the first-visit version of the ordinary importance-sampling estimator is always
$v_{\pi} (s)$ in expectation, but it can be extreme. Suppose the ratio were ten, indicating that the trajectory observed is ten times as likely under the target policy as under the behavior policy. in this case, the oridinary importance-sampling estimate would be ten times the observed return. That is, it would be quite far from the observed return even though the episode's trajectory is
considered very representative of the target policy.

##### Bias Variance tradeoff between OIS and WIS

Formally, the difference between the first-visit methods of the two kinds of importance
sampling is expressed in their biases and variances. Ordinary importance sampling is
unbiased whereas weighted importance sampling is biased when sample size is small (the bias converges
asymptotically to zero). On the other hand, the variance of ordinary importance sampling is in general unbounded because the variance of the ratios
can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. In fact, assuming bounded returns, the variance of the weighted importance-sampling estimator converges to zero even if the variance of the ratios themselves is infinite.
In practice, the weighted estimator usually has dramatically lower variance and is trongly preferred. Nevertheless, we will not totally abandon ordinary importance sampling as it is easier to extend to the approximate methods using function approximation.

The every-visit methods for ordinary and weighed importance sampling are both biased,
though, again, the bias falls asymptotically to zero as the number of samples increases.
In practice, every-visit methods are often preferred because they remove the need to keep
track of which states have been visited and because they are much easier to extend to
approximations.

Below is the off-policy every-visit MC algorithm using weighted importance sampling

##### Implementation

Monte Carlo prediction methods can be implemented incrementally, on an episode by episode
basis, For off policy Monte Carlo methods, we need to separately
consider those that use ordinary importance sampling and those that use weighted
importance sampling. In ordinary importance sampling, the returns are scaled by the importance sampling ratio $\rho_{t:T(t)-1}$, then simply averaged. For these methods we can us the incremental methods, but using the scaled returns in place of the rewards (Chapter 2). This leaves the case of off-policy methods using weighted importance sampling. Here
we have to form a weighted average of the returns, and a slightly different incremental algorithm is required.

Suppose we have a sequence of returns $G_1, ...., G_{n-1}$, all starting in the same state and each with a corresponding random weight $W_i$ (i.e $W_i = \rho_{t_i:T(t_i)-1}$). We wish to form the estimate

$V_{n} = \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k}, n
\geq 2$

and keep it up-to-date as we obtain a single additional return $G_n$. In addition to keeping track of $V_n$, we must maintain for each state the cumulative sum $C_n$ of the weights given to the first n returns ($\sum_{k=1}^{n-1} W_k$). The update rule for $V_n$ is:

$V_{n+1} = \frac{\sum_{k=1}^{n} W_k G_k}{\sum_{k=1}^{n} W_k} = \frac{\sum_{k=1}^{n-1} W_k G_k}{C_n} + \frac{W_n G_n}{C_n}$

$= \frac{(C_n - W_n)\sum_{k=1}^{n-1} W_k G_k}{C_n (C_n - W_n)} + \frac{W_n G_n}{C_n}$

$= \frac{C_n\sum_{k=1}^{n-1} W_k G_k}{C_n (C_n - W_n)} - \frac{W_n\sum_{k=1}^{n-1} W_k G_k}{C_n (C_n - W_n)} + \frac{W_n G_n}{C_n}$

$= \frac{\sum_{k=1}^{n-1} W_k G_k}{C_n - W_n} - \frac{W_n\sum_{k=1}^{n-1} W_k G_k}{C_n (C_n - W_n)} + \frac{W_n G_n}{C_n}$

$= \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k} - \frac{W_n\sum_{k=1}^{n-1} W_k G_k}{C_n (\sum_{k=1}^{n-1} W_k)} + \frac{W_n G_n}{C_n}$

$= V_n - \frac{W_n}{C_n}\frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k} + \frac{W_n G_n}{C_n}$

$= V_n - \frac{W_n}{C_n}V_n + \frac{W_n G_n}{C_n}$

$\implies V_{n+1} = V_n + \frac{W_n}{C_n} [G_n - V_n], n \geq 1$

$C_{n+1} = C_n + W_{n+1}, C_0 = 0$

The algorithm is nominally for the off-policy case, using weighted importance sampling, but applies as well to the on-policy case just by choosing the target and behavior policies as the same ($\pi = b$)

<img src='off-policy-WIS.png'>

#### Off-policy MC Control

We are now ready to present an example of the second class of learning control methods
we consider in this book: off-policy methods. Recall that the distinguishing feature of
on-policy methods is that they estimate the value of a policy while using it for control.
In off-policy methods these two functions are separated. An advantage of this separation is
that the target policy may be deterministic (e.g., greedy), while the behavior policy can
continue to sample all possible actions.

Off-policy Monte Carlo control methods use one of the techniques presented in the
preceding two sections. They follow the behavior policy while learning about and
improving the target policy. These techniques require that the behavior policy has a
nonzero probability of selecting all actions that might be selected by the target policy
(coverage). To explore all possibilities, we require that the behavior policy be soft (i.e.,
that it select all actions in all states with nonzero probability).

<img src='off-policy-mc-control.png'>

The reason $W = W\frac{1}{b(A_t | S_t)}$ is that when $A_t = \pi(S_t)\implies \pi(A_t | S_t) = 1$

A potential problem is that this method learns only from the tails of episodes, when
all of the remaining actions in the episode are greedy. If nongreedy actions are common,
then learning will be slow, particularly for states appearing in the early portions of
long episodes. Potentially, this could greatly slow learning. There has been insufficient
experience with o↵-policy Monte Carlo methods to assess how serious this problem is. If
it is serious, the most important way to address it is probably by incorporating temporal
difference learning

In [19]:
env = gym.make("CartPole-v0")

In [51]:
import gym
import numpy as np

class Agent:

    def __init__(self, actions, num_eps, env, init_value=0.1, gamma=0.5):

        self.Q = {}
        self.C = {}
        self.pi = {}
        self.b = {}
        self.num_eps = num_eps
        self.actions = actions
        self.init_value = init_value
        self.env = env
        self.gamma = gamma

    def train(self):

        curr_ep = 0

        while curr_ep < self.num_eps:
            trajectory = self._generate_episode()
            G = 0
            W = 1

            for t in range(len(trajectory['s']) - 1, -1, -1):
                s_t = trajectory['s'][t]
                r_t = trajectory['r'][t]
                a_t = trajectory['a'][t]

                # get index of action in actions list
                a_t = self.actions.index(a_t)
                G = self.gamma * G + r_t
                self.C[s_t][a_t] = self.C[s_t][a_t] + W
                self.Q[s_t][a_t] = self.Q[s_t][a_t] + W * (G - self.Q[s_t][a_t]) / self.C[s_t][a_t]
                self.pi[s_t] = np.argmax(self.Q[s_t])

                if a_t != self.pi:
                    break

                W = W / self.b[s_t][a_t]

            curr_ep += 1

            print(f'episode: {curr_ep} finished')
            print(f'total reward is {G}')

        return self.pi

    def _init_states(self, s):

        p = [1 / len(self.actions)] * len(self.actions)

        if s not in self.b.keys():
            self.b[s] = p

        if s not in self.Q.keys():
            self.Q[s] = [self.init_value] * len(self.actions)
            self.C[s] = [self.init_value] * len(self.actions)
            self.pi[s] = np.random.choice(range(len(self.actions)), p=p)

    def _generate_episode(self):

        curr_state = tuple(self.env.reset())
        stop = False
        trajectory = {'s': [curr_state],
                      'a': [],
                      'r': []}

        while not stop:

            self._init_states(curr_state)

            action = np.random.choice(self.actions, p=self.b[curr_state])
            curr_state, reward, stop, _ = env.step(action)
            curr_state = tuple(curr_state)

            if not stop:
                trajectory['s'].append(curr_state)

            trajectory['r'].append(reward)
            trajectory['a'].append(action)

        return trajectory

In [52]:
env = gym.make("CartPole-v0")
agent = Agent([0, 1], 20000, env, 0.1, 0.5)
best_policy = agent.train()