# Random Walk: TD vs MC comparison

In this notebook we compare the performance of temporal difference (TD(0)) and Monte Carlo.
The example we consider is a so-called Markov reward process,
i.e. a Markov decision process without actions.

For a more detailed description see Example 6.2, page 125, in Sutton & Barto.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

The environment is a finite, integer random walk of a given length,
with the agent starting in the middle of the way.

Each step they go left or right with equal probability.

The two outermost states (`0` and `length-1`) are terminal.

When the agent enters state `length-1` they receive a reward of +1, anytime else 0.

In [None]:

class RandomWalk:
    def __init__(self, length):
        ... #??
    
    def step(self):
        ... #??
    
    def reset(self):
        ... #??



Set the length of the random walk used below.

In [None]:
LENGTH = 7

Create an instance and test the environment

In [None]:
... #??

## 1. Monte Carlo

We implement Monte Carlo prediction, roughly following the pseudo-code on page 114 of [Sutton & Barto](http://incompleteideas.net/book/RLbook2020trimmed.pdf#page=114).
The algorithm is changed by using a constant update rate $\alpha$, rather than averaging over all previous returns.

In [None]:
# Initialize:
# V(s) <- arbitrarily

# Loop forever (for each episode):
    # Generate an episode: S0, A0, R1, S1, A1, R2, ..., ST-1, AT-1, RT
    # G <- 0
    # Loop for each step of episode, t = T-1, T-2, ..., 0:
        # G <- GAMMA * G + Rt+1
        # Unless St appears in S0, A0, S1, A1, ..., St-1, At-1:
            # V(St) <- V(St) + alpha * (G - V(St))

First, we define a helper function to generate an episode.

In [None]:
def generateEpisode(randomWalk: RandomWalk):
    ... #??

Next, we define the code for a single episode of MC.
The value function is passed as `values` and updated in place.

In [None]:

def monteCarloEpisode(randomWalk: RandomWalk, values, alpha):
    ... #??



Finally, we perform MC by running many episodes.

In [None]:
# Perform MC by running many episodes
N_EPISODES = 10000
ALPHA = 0.02

... #??

## 2. TD(0)

Next, we define the code for a single episode of TD(0).
The value function is passed as `values` and updated in place.

In [None]:
## TD(0) to estimate value function V

# Algorithm parameter alpha
# Initialize V(s) such that V(terminal)=0
# for each episode do:
    # Initialize S
    # repeat:
        # Take step, observe R, S'
        # V(S) <- V(S) + alpha * (R + V(S') − V (S))
        # S <- S'
    # until S is terminal

In [None]:

def tdEpisode(randomWalk: RandomWalk, values, alpha):
    ... #??


In [None]:
# Perform TD(0) by running many episodes
... #??

## Comparison

Note that the true value of state $s$ is $s / (LENGTH - 1)$.
We use this to compare the error made by each method.

In [None]:
TRUE_VALUES = [i / (LENGTH - 1) for i in range(LENGTH)]

def computeRMS(values):
    errors = [v - t for v, t in zip(values[1:-1], TRUE_VALUES[1:-1])]
    return np.sqrt(np.mean(np.square(errors)))

First we recreate the left graph from p. 125, Sutton & Barto, showing the value function for different numbers of episodes $n$.

In [None]:
def plotValueSteps(showN, alpha):
    values = [0.5] * LENGTH
    values[0] = 0
    values[-1] = 0

    randomWalk = RandomWalk(LENGTH)
    
    n = np.max(showN) + 1
    
    # Perform TD, plotting the values of steps in showN
    for i in range(n):
        if i in showN:
            plt.plot(values[1:-1], label = 'n = ' + str(i))
        # tdEpisode(randomWalk, values, alpha)
        monteCarloEpisode(randomWalk, values, alpha)
    
    plt.plot(TRUE_VALUES[1:-1], linestyle='dashdot', label="true values")

    plt.legend()
    plt.show()

In [None]:
plotValueSteps([0, 1, 10, 100], 0.1)

Next, we define a function that performs several TD/MC episodes and computes the RMS after each episode.

In [None]:

def computeRMSs(alpha, nEpisodes, xxxEpisode):
    randomWalk = RandomWalk(LENGTH)
    rms = []
    values = [0.5] * LENGTH
    values[0] = 0
    values[-1] = 0
    for j in range(nEpisodes):
        xxxEpisode(randomWalk, values, alpha)
        rms.append(computeRMS(values))
    return rms


Using the function above,
we plot the RMS vs. the number of training episodes for both methods and different values of `alpha`.

In [None]:

# Compared alphas
alphasMC = [0.01, 0.02, 0.03, 0.04]
alphasTD = [0.01, 0.05, 0.10, 0.20]

# Number of episodes per experiment
nEpisodes = 200

# Number of experiments
nExperiments = 100

print('MC...')
for i, alpha in enumerate(alphasMC):
    print(alpha)
    rmsList = [computeRMSs(alpha, nEpisodes, monteCarloEpisode) for _ in range(nExperiments)]
    rms = np.mean(rmsList, axis=0)
    plt.plot(rms, linestyle='dashdot', label='MC {}'.format(alpha))

print('TD...')
for i, alpha in enumerate(alphasTD):
    print(alpha)
    rmsList = [computeRMSs(alpha, nEpisodes, tdEpisode) for _ in range(nExperiments)]
    rms = np.mean(rmsList, axis=0)
    plt.plot(rms, linestyle='solid', label='TD {}'.format(alpha))


plt.legend()
plt.show()
