# Exercise 6: Multi-Step Bootstrapping

In this exercise we will have a look at n-step methods and eligibility trace. The n-step methods are a class of reinforcement learning algorithms that are an abstraction of the Monte Carlo and TD(0) methods discussed earlier and include them as special cases. Furthermore, we also consider the eligibility traces, which take a reverse approach to determining the state values. The environment we will be dealing with is a little more typical for control research: the inverted pendulum. 

![](https://miro.medium.com/max/1000/1*TNo3x9zDi1lVOH_3ncG7Aw.gif)

To implement this environment, we will make use of the gymnasium library. Please install the gymnasium library within your preferred Python environment using:

```pip install gymnasium```

In [None]:
import numpy as np
import gymnasium as gym
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
plt.style.use('seaborn')

Check if the installation and import work by executing the following cell. A window with an animation of the pendulum should open, display some random actions, and close automatically.

In [None]:
env = gym.make('Pendulum-v1', render_mode="human")
env = env.unwrapped # removes a built-in time limit of k_T = 200, we want to determine the time limit ourselves

state, _ = env.reset()
for _ in range(300):
    env.render()
    state, reward, terminated, _, _ = env.step(env.action_space.sample()) # take a random action
env.close()

The goal of this environment is to bring the pendulum into the upper neutral position, where the angle $\theta = 0$ and the angular velocitiy $\frac{\text{d}}{\text{d}t}\theta=\omega=0$. The reward function is already designed that way and does not need further specification. For further information about the environment you may refer to the code and documentation of Farama Foundation's `gymnasium`:

[Documentation of the gymnasium pendulum](https://gymnasium.farama.org/environments/classic_control/pendulum/)

[Pendulum environment in the gymnasium Github repository](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/pendulum.py)

## 1) Discretization of Action and State Space

Unlike the racetrack environment, the inverted pendulum comes with a continuous action and state space. Although it is possible to handle systems with these characteristics, we did not yet learn how to deal with them. For now, we only know how to implement agents for discrete action and state spaces. Accordingly, we will also try to represent the inverted pendulum within a discrete state / action space. For this, a discretization is necessary.

The pendulum has three state variables relating to the momentary angular position $\theta$:
\begin{align*}
    x=\begin{bmatrix}
    \text{cos}(\theta)\\
    \text{sin}(\theta)\\
    \frac{\text{d}}{\text{d}t}\theta
    \end{bmatrix}
    \in
    \begin{bmatrix}
    [-1, 1]\\
    [-1, 1]\\
    [-8 \, \frac{1}{\text{s}}, 8 \, \frac{1}{\text{s}}]
    \end{bmatrix},
\end{align*}

and one input variable which relates to the torque applied at the axis of rotation:

$u = T \in [-2 \, \text{N}\cdot\text{m}, 2 \, \text{N}\cdot\text{m}]$

After the discretization, we want the system to be defined on sets of non-negative natural numbers:

\begin{align*}
    x_d =
    \text{discretize_state}(x)
    \in
    \begin{bmatrix}
    \{0,1,2,...,d_{\theta}-1\}\\
    \{0,1,2,...,d_{\theta}-1\}\\
    \{0,1,2,...,d_{\omega}-1\}
    \end{bmatrix},
\end{align*}

$
u_d =
\text{discretize_action}(u)
\in
\{0,1,2,...,d_{T}-1\}.
$

Since action is selected within the discrete action space, we need to transform it accordingly:

$
u=
\text{continualize_action}(u_d):
\{0,1,2,...,d_{T}-1\} \rightarrow [-2 \, \text{N}\cdot\text{m}, 2 \, \text{N}\cdot\text{m}]
.
$

Write the functions `discretize_state` and `continualize_action`, such that a discrete RL agent can be applied. (Please note that all I/O of `gymnasium` consists of numpy arrays.) Write the functions in such a way that the number of discretization intervals $d_\theta, d_\omega, d_T$ are parameters that can be changed for different tests. The discretization intervals should be uniformly distributed on their respective state space.

A parametrization of $d_\theta = d_\omega = d_T = 15$ can be used to yield satisfactory results in this exercise.
However, does it make a difference if the number of discretization intervals is odd or even? If yes, what should be preferred for the given environment? 

## Solution 1)

YOUR ANSWER HERE

In [None]:
d_T = 15
d_theta = 15
d_omega = 15


def discretize_state(states):

    # YOUR CODE HERE
    raise NotImplementedError()


def continualize_action(disc_action):

    # YOUR CODE HERE
    raise NotImplementedError()


Use the following cell for debugging:

In [None]:
env = gym.make('Pendulum-v1', render_mode="human")
state, _ = env.reset()
for _ in range(5):
    disc_action = np.random.choice(range(9))
    cont_action = continualize_action(disc_action)
    print("discrete action: {}, continuous action: {}".format(disc_action, cont_action))
    
    state, reward, terminated, _, _ = env.step(cont_action) # take a random action
    disc_state = discretize_state(state)
    print("discrete state: {}, continuous state: {}".format(disc_state, state))
    
env.close()

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 2) n-Step Sarsa

Write an on-policy n-step Sarsa control algorithm for the inverted pendulum from scratch. This time, no code template is given. 

Use the following parameters: $\alpha=0.1, \gamma=0.9, \varepsilon=0.1, n=10$ with 500 time steps in 2000 episodes.

![](nStepSARSA_Algo.png)

YOUR ANSWER HERE

In [None]:
env = gym.make('Pendulum-v1') # , render_mode="human"
env = env.unwrapped

alpha = 0.1  # learning rate
gamma = 0.9  # discount factor
epsilon = 0.1  # epsilon greedy parameter
n = 10  # steps between updates

nb_episodes = 2000  # number of episodes
nb_steps = 500  # length of episodes

action_values = np.zeros([d_theta, d_theta, d_omega, d_T])
# int is necessary for indexing
pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)

# we can use this to figure out how well the learning worked
cumulative_reward_history = []

for j in tqdm(range(nb_episodes), position=0, leave=True):

    # YOUR CODE HERE
    raise NotImplementedError()
pi_learned = np.copy(pi)  # save pi in cache under different name for later


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Greedy Execution

Test the learned policy by pure greedy execution.

In [None]:
env = gym.make('Pendulum-v1', render_mode="human")
env = env.unwrapped

nb_steps = 200

state, _ = env.reset() # initialize x_0
disc_state = tuple(discretize_state(state)) # use tuple indexing
disc_action = pi_learned[disc_state]

for k in range(nb_steps):
        
    cont_action = continualize_action(disc_action)
    env.render() # comment out for faster execution
    state, reward, terminated, _, _ = env.step(cont_action)
    disc_state = tuple(discretize_state(state))
        
    if terminated:
        break
        
    disc_action = pi_learned[disc_state] # exploitative action
    
env.close()

## 3) Recursive updates: TD($\lambda$) 

Both, $n$-step and $\lambda$-return updates, are based on a forward view. That means we have to wait for future states and rewards before an update can be performed.
We therefore introduce an eligibility traces, which follows the general idea that previous actions have significantly led to the current situation. Contrary to n-step learning, however, intuition tells us that more recent decisions had a more severe impact on the present situation than decisions that were made a long time ago. Thus, it may be helpful to integrate a forgetting factor $\lambda$ which decreases the assumed influence of actions over time.

Solution 2 is now to be extended by eligibility traces $z_k(x_k)$ within the action-value update. Test it for different values of $\lambda$. How sensitive is the process to the choice of $\lambda$?

In [None]:
env = gym.make('Pendulum-v1') # , render_mode="human"
env = env.unwrapped

alpha = 0.1  # learning rate
gamma = 0.9  # discount factor
epsilon = 0.1  # epsilon greedy parameter
lamb = 0.8  # forgetting factor

nb_episodes = 5000  # number of episodes
nb_steps = 500  # length of episodes

action_values = np.zeros([d_theta, d_theta, d_omega, d_T])
# init eligibility trace
eligibility = np.zeros([d_theta, d_theta, d_omega])
# int is necessary for indexing
pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)

# we can use this to figure out how well the learning worked
cumulative_reward_history = []

for j in tqdm(range(nb_episodes), position=0, leave=True):

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
plt.plot(cumulative_reward_history)
plt.xlabel("episode")
plt.ylabel(r"$\sum R$")
plt.show()

print(np.shape(cumulative_reward_history))

## Greedy Execution

Test the learned policy by pure greedy execution.

In [None]:
env = gym.make('Pendulum-v1', render_mode="human")
env = env.unwrapped

nb_steps = 200

state, _ = env.reset() # initialize x_0
disc_state = tuple(discretize_state(state)) # use tuple indexing
disc_action = pi[disc_state]

for k in range(nb_steps):
        
    cont_action = continualize_action(disc_action)
    env.render() # comment out for faster execution
    state, reward, terminated, _, _ = env.step(cont_action)
    disc_state = tuple(discretize_state(state))
        
    if terminated:
        break
        
    disc_action = pi[disc_state] # exploitative action
    
env.close()

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()