# Monte Carlo Learning
The previous chapters did not cover real ML methods. This was mainly due to the algorithms having full knowledge of the dynamics of the system. ML algorithms instead learn from interactions.

Model-free learning can be further classified as:
- Monte Carlo (MC) learning, and
- Temporal Difference (TD) learning.

At end of episode, MC learning estimates values of policy using samples collected during the episodes.

Consequently, ML learning can only be used in episodic tasks, since sequential tasks never end.

## On-Policy MC Learning
### Policy Evaluation
Key ideas:
- State and action values are expectations of returns on conditions of states and state-action pairs respectively.
- Can use MC method to estimate expectation.

For example, among $c$ trajectories that have visited a given state/state-action pair, they have returns $g_1, g_2, \dots, g_c$.

MC method estimates the state/action value as
$$\frac{1}{c}\sum_{i = 1}^cg_i$$

1. Let return samples be $g_1, g_2, \dots, g_{c-1}, g_c$.
2. $\overline{g}_{c-1} = \frac{1}{c - 1}\sum_{i = 1}^\infty g_i$.
3. Can prove that $\overline{g}_c = \overline{g}_{c-1} + \frac{1}{c}(g_c - \overline{g}_{c-1})$.

This is a space saving method to calculate each average incremental return.

## Robbins-Monro Algorithm
Attempts to find root of equation $f(x) = 0$ with limitation that we can only obtain the measurements of the random functions $F(x)$, where $f(x) = \mathbb{E}[F(x)]$.

Problem is solved by iteratively using
$$X_k = X_{k-1} - \alpha_kF(X_{k-1})$$
$\{\alpha_k\}_{k\geq 1}$ is a learning rate sequence with following conditions:
1. (non-negative) $\alpha_k \geq 0$ for all $k$.
2. (diverges regardless of start point) $\sum_{k = 1}^{\infty}\alpha_k = \infty$.
3. (diverges regardless of noise) $\sum_{k = 1}^\infty \alpha_k^2 = \infty$.

From this, the iteration converges to a solution under some condition.

### Implementation of Algorithm
Consider estimating action values. Let $F(q) = G - q$, where $q$ is the value to estimate.

Observe many samples of returns, and update $q$ using
$$q_k \leftarrow q_{k-1} + \alpha_k(g_k - q_{k-1})$$

$q_0$ arbitrary initial value, $\alpha_k = 1/k$ sequence of learning rates.

After convergence, we have $\mathbb{E}[F(q(s,a))] = \mathbb{E}[G_t\mid S_t = s, A_t = a] - q(s,a) = 0$.

Can analyse estimations of state values similarly by letting $F(v) = G - v$.

Policy evaluation can directly estimate state values or directly estimate action values.

With Bellman expectation equations, can:
1. Use state values to back up action values with dynamics $p$, or
2. Use action values to back up state values with knowledge of policy $\pi$.

Unfortunately, $p$ is unknown in model-free learning, so we can only use action values to back up state values.

- Every-visit MC update uses all return samples to update value estimations.
- First-visit MC update only uses sample when state (or state-action pair) is first visited.

Both techniques converge to true value, one way or another.

### Algorithm 4.1: Evaluation of action values using Every-Visit MC Policy Evaluation
Inputs: env (without model), policy $\pi$.

Output: action value estimates $q(s,a)$

1. (Initialise) Set $q(s,a)$ arbitrarily. If using incremental implementation, set $c(s,a) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Sample trajectory) Use policy $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    2. (Initialise return) $G\leftarrow 0$.
    3. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Update action value) Update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$. For incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t, A_t)}[G - q(S_t, A_t)]$.

### Algorithm 4.2: Every-visit MC update to evaluate state values.
1. (Initialise) Initialise $v(s)$ arbitrarily. If using incremental implemenation, initialise $c(s) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Sample trajectory) Use $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    2. (Initialise return) $G \leftarrow 0$.
    3. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Update action value) Update $v(S_t)$ to reduce $[G - v(S_t)]^2$. For incremental implementation, perform the following:
            1. $c(S_t) \leftarrow c(S_t) + 1$.
            2. $v(S_t) \leftarrow v(S_t) + \frac{1}{c(S_t)}[G - v(S_t)]$.

### Algorithm 4.3: First-visit MC update to estimate action values.
1. (Initialise) Initialise $q(s,a)$ arbitrarily. If using incremental implementation, set $c(s,a) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Sample trajectory) Use $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    2. (Initialise return) $G \leftarrow 0$.
    3. (Calculate steps that state-action pairs are first visited within episode) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. If $f(S_t, A_t) < 0$, then $f(S_t, A_t) \leftarrow t$.
    4. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Update when first visited) If $f(S_t, A_t) = t$, update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$. For incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t, A_t)}[G - q(S_t, A_t)]$

### Algorithm 4.4: First-visit MC update to estimate state values.
1. (Initialise) Initialise $v(s)$ arbitrarily. If using incremental implementation, set $c(s) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Sample trajectory) Use $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    2. (Initialise return) $G \leftarrow 0$.
    3. (Calculate steps that state-action pairs are first visited within episode) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. If $f(S_t) < 0$, then $f(S_t) \leftarrow t$.
    4. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Update when first visited) If $f(S_t) = t$, update $v(S_t)$ to reduce $[G - v(S_t)]^2$. For incremental implementation, perform the following:
            1. $c(S_t) \leftarrow c(S_t) + 1$.
            2. $v(S_t) \leftarrow v(S_t) + \frac{1}{c(S_t)}[G - v(S_t)]$.

## MC Learning with Exploration Start
Introduce MC update algorithms to find optimal policy.

Up to now, we know how to evaluate action values using MC updates. Upon getting these estimates, can improve the policy, and get a new one.

Repetition of estimation and improvement may lead to optimality.

Unfortunately, not all start states lead to optimality due to initial bad policy which gets us stuck in bad states, and induce bad update values for those states.

Exploring start changes initial state dist so that episode can start with any state-action pair.
### Algorithm 4.5: MC update with exploring state (maintaining policy explicitly)
1. (Initialise) Initialise $q(s,a)$ arbitrarily. If using incremental implementation, set $c(s,a) \leftarrow 0$.
2. (MC update) for each episode:
    1. (Initialise episode start) Choose a $S_0, A_0$ pair as start. Any one can be chosen.
    2. (Sample trajectory) Use $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    3. If using first-visit version, perform the following:
        1. $f(s,a) \leftarrow -1$ for all $s,a$.
        2. For each $t\leftarrow 0, 1, \dots, T-1$: If $f(S_t, A_t) < 0$ then $f(S_t, A_t) \leftarrow t$.
    4. (Initialise return) $G \leftarrow 0$.
    5. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Upd. act-val estim.) Update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$. If using incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t, A_t)}[G - q(S_t, A_t)]$.
            3. If using first-visit version, update counter and action estimates only when $f(S_t, A_t) = t$.
            4. (Improve Policy) $\pi(S_t) \leftarrow \arg\max_a q(S_t, A_t) = t$.

### Algorithm 4.6: MC update with exploring start (maintaining policy implicitly)
1. (Initialise) Initialise $q(s,a)$ arbitrarily. If using incremental implementation, initialise $c(s,a) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Initialise episode start) Choose $(S_0, A_0)$ pair randomly.
    2. (Sample trajectory) Starting from $(S_0, A_0)$, use policy derived from action values $q$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$ (choose action that maximises action value).
    3. If using first visit version:
        1. $f(s,a) \leftarrow -1$ for all $s, a$.
        2. For each $t \leftarrow 0, 1, \dots, T-1$: if $f(S_t, A_t) < 0$, then $f(S_t, A_t) \leftarrow t$.
    4. (Initialise return) $G \leftarrow 0$.
    5. (Update) $For t\leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G\leftarrow \gamma G + R_{t+1}$.
        2. (Upd. act-val estim.) Update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$.
        3. If using incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t,A_t)}[G - q(S_t, A_t)]$
            3. For first-visit version, perform the above only if $f(S_t, A_t) = t$.

## MC Learning on Soft Policy
Can explore without exploring start.

### What is a Soft Policy?
$\pi$ is a soft policy iff $\pi(a\mid s) > 0$ holds for every $s, a$. It can thereby choose all possible actions.

Soft policies can help explore more states and state-action pairs.

### $\epsilon$-Soft Policies
$\pi$ is $\epsilon$-soft iff exists $\epsilon > 0$ s.t $\pi(a\mid s) > \epsilon / |A(s)|$ for all $s, a$.

All $\epsilon$-soft policies are soft policies.

### $\epsilon$-Greedy Policies
$\epsilon$-soft policiy that is the closes to the deterministic policy is called an $\epsilon$-greedy policy of the deterministic policy.

If the deterministic policy is as shown below:
$$
\pi(a\mid s) = \begin{cases}
1 & s \in S, a = a^*\\
0 & s \in A, a \neq a^*
\end{cases}
$$
Then the $\epsilon$-soft policy will appear as follows:
$$
\pi(a\mid s) = \begin{cases}
1 - \epsilon - \frac{\epsilon}{|A(s)|} & s \in S, a = a^*\\
\frac{\epsilon}{|A(s)|} & s \in S, a \neq a^*
\end{cases}
$$
This policy assigns probability $\epsilon$ equally to all actions, and assigns the remaining $(1-\epsilon)$ to the greedy, exploitative action $a^*$.

MC update with soft policy uses $\epsilon$-soft policy during iterations. Particularly, the policy improvement updates an old $\epsilon$-soft policy to a new $\epsilon$-greedy policy, which can be explained by the policy improvement theorem, too. In other words, if $\pi$ is an $\epsilon$-soft policy, and $\pi'$ is an $\epsilon$-greedy policy, then we have $\pi \preccurlyeq \pi'$, which means for any $s \in S$:
$$\sum_{a}\pi'(a\mid s)q_\pi(s,a) \geq v_\pi(s)$$

### Algorithm 4.7: MC Update with Soft Policy (maintaining policy explicitly)
1. (Initialise) Initialise $q(s,a)$ arbitrarily. If we use incremental implementation, $c(s,a) \leftarrow 0$.
2. Set $\pi$ to arbitrary $\epsilon$-soft policy.
3. (MC update) For each episode:
    1. (Sample Trajectory) Use $\pi$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    2. If using first visit version, perform the following when a state-action pair is first visited:
        1. $f(s,a)\leftarrow -1$ for all $s,a$.
        2. For every $t\leftarrow 0, 1, \dots, T-1$: if $f(S_t, A_t) < 0$, then set $f(S_t, A_t) \leftarrow t$.
    3. (Initialise return) $G \leftarrow 0$.
    4. (Update) For $t\leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G\leftarrow \gamma G + R_{t+1}$.
        2. (Upd. act-val estimate) Update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$.
        3. If using incremental implementation:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t, A_t)}[G - q(S_t, A_t)]$.
            3. If first-visit version being usd, update $c$ and $q$ only when $f(S_t, A_t) = t$.
            4. (Improve policy) $A^* \leftarrow \arg\max_a q(S_t, a)$.
            5. Update $\pi(\cdot, S_t)$ to $\epsilon$-greedy policy of deterministic policy $\pi(a\mid S_t) = 0$ ($a \neq A^*$)

### Algorithm 4.8: MC Update with Soft Policy (maintaining policy implicitly)
1. (Initialise) Initialise action value estimates $q(s,a)$ arbitrarily. If using incremental implementation, initialise $c(s,a) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Sample trajectory) Use $\epsilon$-greedy policy derived from $q$ to generate trajectory $S_0, A_0, R_1, S_1, \dots, S_{T - 1}, A_{T-1}, R_T, S_T$.
    2. If using first-visit version, find when state-action pair is first visited:
        1. $f(s,a) \leftarrow -1$ for each $s, a$.
        2. For each $t\leftarrow 0, 1, \dots, T-1$, if $f(S_t, A_t) < 0$, then set $f(S_t, A_t) \leftarrow t$.
    1. (Initialise return) $G\leftarrow 0$.
    2. (Update) For $t \leftarrow T - 1, T-2, \dots, 0$:
        1. (Calculate return) $G \leftarrow \gamma G + R_{t+1}$.
        2. (Update action-value estimate) Update $q(S_t, A_t)$ to reduce $[G - q(S_t, A_t)]^2$.
        3. If using incremental implementation:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + 1$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{1}{c(S_t, A_t)}[G - q(S_t, A_t)]$.
            3. If using first-visit version, update $c$ and $q$ only when $f(S_t, A_t) = t$.

## Off-Policy MC Learning
In off-policy algorithm, policy that is updated and policy that generates samples can be different.

Will use Importance sampling to evaluate policy and find an optimal one.

### Importance Sampling
Sometimes when we want to estimate state value of particular state, that state can be very difficult to reach with some given policy.

There are some samples that can be used to estimate $q$ for that state, making the variance of the estimate very large.

Importance Sampling considers using another policy to generate samples to visit the state more frequently, making the samples more efficiently used.

Importance sampling is a technique to rewduce the variance in MC algorithm.
- Changes sampling probability distributions so that the sampling can be more efficient.
- Ratio of new probability to old probability is called **importance sampling ratio**.

We consider off-policy RL using importance sampling.
- Target policy $\pi$ is policy to update.
- Policy to generate samples is behaviour policy $b$.

Use $b$ to generate samples, use the samples to update statistics about target policy.

Given state $S_t$ at time $t$, can generate trajectory
$$S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, R_T, S_T$$
using either $\pi$ or $b$. The probabilities of the given trajectory generated are as follows:
\begin{align*}
\text{Pr}_x[A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, R_T, S_T\mid S_t] &= x(A_t\mid S_t)p(S_{t+1}, R_{t+1}\mid S_t, A_t)x(A_{t+1}\mid S_{t+1})\dots p(S_T, R_T\mid S_{T-1}, A_{T-1})\\
&= \prod_{\tau = t}^{T-1}x(A_\tau\mid S_\tau)\prod_{\tau = t}^{T-1}p(S_{\tau+1}, R_{\tau+1}\mid S_\tau, A_\tau)
\end{align*}
Where $x$ is either $\pi$ or $b$.

The ratio of the two probabilities is the importance sample ratio:
$$\rho_{t:T-1} = \frac{\text{Pr}_\pi[A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, R_T, S_T\mid S_t]}{\text{Pr}_b[A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, R_T, S_T\mid S_t]} = \prod_{\tau = t}^{T-1}\frac{\pi(A_\tau\mid S_\tau)}{b(A_\tau\mid S_\tau)}$$
Notice that the simplification shows the ratio depends on the policies only, not the dynamics.

To make the ratio well-defined, require $\pi$ absolutely continuous w.r.t $b$ ($\pi \ll b$).

This means that for all $s, a$ s.t. $\pi(a\mid s) > 0$, we have $b(a\mid s) > 0$. Furthermore, to clear up divisions by zero, if $\pi(a\mid s) = 0$, then
$$\frac{\pi(a\mid s)}{b(a\mid s)} = 0$$
regardless of the value of $b(a\mid s)$.

Now we consider state action pair $(S_t, A_t)$, and we can use either $\pi$ or $b$ to generate the trajectory
$$S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, R_T, S_T$$
The probability to generate this trajectory using policy $x$ is
\begin{align*}
\text{Pr}_x[R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{T-1}, A_{T-1}, S_t, A_t\mid S_t, A_t] &= p(S_{t+1}, R_{t+1}\mid S_t, A_t)x(A_{t+1}\mid S_{t+1})\dots p(S_T, R_T\mid S_{T-1}, A_{T-1})\\
&= \prod_{\tau = t+1}^{T-1}x(A_\tau\mid S_\tau)\prod_{\tau = t}^{T-1}p(S_{\tau+1}, R_{\tau + 1}\mid S_\tau, A_\tau)
\end{align*}
Importance sample ratio is
$$\rho_{t+1: T-1}\prod_{\tau = t+1}^{T-1}\frac{\pi(A_\tau\mid S_\tau)}{b(A_\tau\mid A_\tau)}$$

In on-policy MC update, after getting the return samples $g_1, \dots, g_c$, we use mean average $\frac{1}{c}\sum_{i=1}^cg_i$ for the value estimates. This assumes that each $g_i$ are of equal probabilities.

These return samples are not with equal probabilities between $b$ and $\pi$.

For $\pi$, the probabilities of those samples are propotional to the important sample ratios - can use weighted average of these samples to estimate. Take $\rho_i$ for $1 \leq i \leq c$ as the importance sample ratio of sample $g_i$, which will also act as its weight. Then the weighted average of those samples is:
$$\frac{\sum_{i=1}^c\rho_ig_i}{\sum_{i=1}^c\rho_i}$$
MC upadate with importance sampling can be implemented incrementally too, but don't need number of samples for each state or state-action pair. Instead, record summation of weights. For example, updating state values can be written as
\begin{align*}
c &\leftarrow c + \rho\\
v &\leftarrow v + \frac{\rho}{c}(g - v)
\end{align*}
where $c = \sum\rho_i$. Updating the action values can be written as:
\begin{align*}
c &\leftarrow c + \rho\\
q &\leftarrow q + \frac{\rho}{c}(g - q)
\end{align*}


## Off-Policy MC Policy Evaluation
On-policy algorithms perform the following in a nutshell:
1. Use target policy to generate samples,
2. Use the samples to update value estimates.

Policy to generate samples and policy to update are the same, so those algorithms are on-policy.

### Algorithm 4.9: Evaluate action values using off-policy MC update based on importance sampling.
1. (Initialise) Initialise $q(s,a)$ arbitrarily. If using incremental implementation, initialise $c(s,a) \leftarrow 0$.
2. (MC update) For each episode:
    1. (Designate behaviour policy $b$) Designate $b$ s.t. $\pi \ll b$.
    2. (Sample trajectory) Use $b$ to generate a trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    3. If using first-visit version, perform the following:
        1. Initialise $f(s,a) \leftarrow -1$.
        2. For each $t\leftarrow 0, 1, \dots, T-1$: if $f(S_t,A_t) < 0$, then set $f(S_t, A_t) \leftarrow t$.
    4. (Initialise return and ratio) $G\leftarrow 0$; $\rho \leftarrow 1$.
    5. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G\leftarrow \gamma G + R_{t+1}$
        2. (Upd. act-val. estim.) Update $q(S_t, A_t)$ to reduce $\rho[G - q(S_t, A_t)]^2$.
        3. If using incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + \rho$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{\rho}{c(S_t, A_t)}[G-q(S_t, A_t)]$.
            3. If using first visit person, update counter and action value estimates only when $f(S_t, A_t) = t$.
        4. (Update importance sampling ratio) $\rho \leftarrow \rho \frac{\pi(A_t\mid S_t)}{b(A_t\mid S_t)}$.
        5. (Check early stop condition) If $\rho = 0$, **break**.

## Off-Policy MC Policy Optimisation
When finding optimal policies, they:
1. Use latest estimates of the optimal policy to generate samples,
2. Use samples to update optimal policy.

Policy to generate samples and policy to be updated are the same, so those algorithms are on-policy.

### Algorithm 4.10: Find an optimal policy using off-policy MC update based on importance sampling.
1. (Initialise) Initialise $q(s,a)$ arbitrarily.
2. If using incremental implementation, initialise $c(s,a) \leftarrow 0$.
3. If policy is maintained explicitly, initialise $\pi(s) \leftarrow \arg\max_{a}q(s,a)$.
4. (MC update) For each episode:
    1. (Designate behaviour policy $b$) Designate $b$ s.t. $\pi \ll b$.
    2. (Sample trajectory) Use $b$ to generate a trajectory $S_0, A_0, R_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T, S_T$.
    3. If using first-visit version, perform the following:
        1. Initialise $f(s,a) \leftarrow -1$.
        2. For each $t\leftarrow 0, 1, \dots, T-1$: if $f(S_t,A_t) < 0$, then set $f(S_t, A_t) \leftarrow t$.
    4. (Initialise return and ratio) $G\leftarrow 0$; $\rho \leftarrow 1$.
    5. (Update) For $t \leftarrow T-1, T-2, \dots, 0$:
        1. (Calculate return) $G\leftarrow \gamma G + R_{t+1}$.
        2. (Upd. act-val. estim.) Update $q(S_t, A_t)$ to reduce $\rho[G - q(S_t, A_t)]^2$.
        3. If using incremental implementation, perform the following:
            1. $c(S_t, A_t) \leftarrow c(S_t, A_t) + \rho$.
            2. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \frac{\rho}{c(S_t, A_t)}[G-q(S_t, A_t)]$.
            3. If using first visit person, update counter and action value estimates only when $f(S_t, A_t) = t$.
        4. If maintaining policy explicitly, $\pi(S_t) \leftarrow \arg\max_a q(S_t, a)$.
        5. (Check early stop condition) If $A_t \neq \pi(S_t)$, **break**.
        6. (Update importance sampling ratio) $\rho \leftarrow \rho \frac{\pi(A_t\mid S_t)}{b(A_t\mid S_t)}$.