# Tutorial 4 - Intro to RL

In this tutorial, you will learn the basics of RL. We use stable baselines 3 for the RL implementation and just implement the environment.

We will use a 1D spin chain with $N$ lattice sites as an example:
* States: The spins at each lattice site can point either up or down
* Actions: Flip the orientation of a spin at site $n$, $1\leq n \leq N$
* Terminal state: The minimal energy configuration, which is obtained when all spins are aligned
* Reward: punish by the energy of the configuration

The energy is 
$$ E = J \sum_i s_i s_{i+1}$$
where $J\in\mathbb{R}$ measures the coupling strength and $s_i\in[-1, 1]$ encodes spin up/down

In [1]:
import gymnasium as gym
import numpy as np

The gym environment needs to implement 4 methods:
* step(): Updates an environment with actions returning the next agent observation, the reward for taking that actions, if the environment has terminated or truncated due to the latest action and information from the environment about the step, i.e. metrics, debug info.
* reset() - Resets the environment to an initial state, required before calling step. Returns the first agent observation for an episode and information, i.e. metrics, debug info.
* render() - Renders the environments to help visualise what the agent see, examples modes are “human”, “rgb_array”, “ansi” for text.
* close() - Closes the environment, important when external software is used, i.e. pygame for rendering, databases

Also, the class should have two members:
* action_space
* observation_space

## 1.) Define the environment

In [2]:
class SpinChain(gym.Env):
    render_modes = ["ansi"]
    metadata = {"render_modes": render_modes, "render_fps": 1}
    
    def __init__(self, N, J=1, render_mode="ansi"):
        super(SpinChain, self).__init__()
        self.actions = []
        self.J = J
        self.N = N
        self.state = np.random.randint(low=0, high=2, size=self.N)
        self.action_space = gym.spaces.Discrete(self.N)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(self.N,), dtype=np.float32)
        self.render_mode = render_mode
    
    def step(self, action):
        self.actions.append(action)
        # carry out action: Flip spin at site "action"
        self.state[action] = (self.state[action] + 1) % 2
        reward, terminated, truncated, info = self.reward(), False, False, {}

        # We have two termnianl states: The one where all spins are pointing in the same direction
        if all(self.state.astype(bool)) or not any(self.state.astype(bool)):
            terminated = True
            info = {"message": f"Found the minimum energy configuration with actions {self.actions}."}
        
        # We truncate the game after 300 steps
        if len(self.actions) == 300:
            truncated = True
            info = {"message": f"Ended the episode after 300 steps."}
        
        return self.prepare_state(), reward, terminated, truncated, info

    def reset(self, seed=None, options=None):
        self.state = np.random.randint(low=0, high=2, size=self.N)
        self.actions = []
        info = {"message": f"Reset to start state {self.render()} with energy {self.reward()}"}
        return self.prepare_state(), info

    def render(self):
        state_dict = {0: "↑", 1: "↓"}
        return " ".join([state_dict[x] for x in self.state])
        
    def close(self):
        pass
    
    def reward(self):
        # to compute the energy we want to map spin up to -1 and spin down to +1
        spins = np.array([(-1)**x for x in self. state], dtype=np.float32)
        energy = self.J * np.sum(spins[:-1] * spins[1:])  # nearest-neighbor interaction
        return energy

    def prepare_state(self):
        return self.state.astype(np.float32)


## 2.) Investigate the environment

In [3]:
my_env = SpinChain(10)
state, info = my_env.reset()
print(state, info)

# energy of all spins up:
my_env.state = np.array([0] * my_env.N)
print(f"{my_env.render()}: {my_env.reward()}")
# energy of all spins down:
my_env.state = np.array([1] * my_env.N)
print(f"{my_env.render()}: {my_env.reward()}")

[1. 1. 1. 1. 1. 0. 0. 0. 1. 1.] {'message': 'Reset to start state ↓ ↓ ↓ ↓ ↓ ↑ ↑ ↑ ↓ ↓ with energy 5.0'}
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑: 9.0
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 9.0


In [4]:
# walk around in state space by performing a few random actions
my_env.reset()
for _ in range(10):
    action = np.random.randint(my_env.N)
    state, reward, terminated, truncated, info = my_env.step(action)
    print(action, state, reward, terminated, truncated, info)

8 [0. 1. 1. 1. 1. 1. 1. 0. 0. 1.] 3.0 False False {}
7 [0. 1. 1. 1. 1. 1. 1. 1. 0. 1.] 3.0 False False {}
6 [0. 1. 1. 1. 1. 1. 0. 1. 0. 1.] -1.0 False False {}
0 [1. 1. 1. 1. 1. 1. 0. 1. 0. 1.] 1.0 False False {}
7 [1. 1. 1. 1. 1. 1. 0. 0. 0. 1.] 5.0 False False {}
7 [1. 1. 1. 1. 1. 1. 0. 1. 0. 1.] 1.0 False False {}
1 [1. 0. 1. 1. 1. 1. 0. 1. 0. 1.] -3.0 False False {}
0 [0. 0. 1. 1. 1. 1. 0. 1. 0. 1.] -1.0 False False {}
9 [0. 0. 1. 1. 1. 1. 0. 1. 0. 0.] 1.0 False False {}
0 [1. 0. 1. 1. 1. 1. 0. 1. 0. 0.] -1.0 False False {}


## 3.) Connect to stable baselines

In [6]:
from gymnasium.envs.registration import register
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TRPO
import numpy as np
import torch.nn as nn
import pickle

checkpoint_callback = CheckpointCallback(
    save_freq=10000,                # save every 10,000 steps
    save_path="./checkpoints/",    # folder to save models
    name_prefix="trpo_model",      # filename prefix
    save_replay_buffer=False,      
    save_vecnormalize=False        
)

register(
    id='SpinChain-v0',
    entry_point='__main__:SpinChain',  # 'module_path:ClassName'
)

env = gym.make('SpinChain-v0', N=10, render_mode="ansi")

model = TRPO(
    policy="MlpPolicy",
    device='cpu',  # CUDA for TRPO is only recommended for CNN policies
    env=env,
    gamma=.995,
    learning_rate=1e-4,
    verbose=1
)
print("Start training.")
model.learn(total_timesteps=100000, log_interval=4, progress_bar=True, callback=checkpoint_callback)
print("Done training.")
model.save("trpo_spinchain")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Output()

Start training.
----------------------------------------
| rollout/                  |          |
|    ep_len_mean            | 249      |
|    ep_rew_mean            | 20.8     |
| time/                     |          |
|    fps                    | 2215     |
|    iterations             | 4        |
|    time_elapsed           | 3        |
|    total_timesteps        | 8192     |
| train/                    |          |
|    explained_variance     | 0.0016   |
|    is_line_search_success | 1        |
|    kl_divergence_loss     | 0.00409  |
|    learning_rate          | 0.0001   |
|    n_updates              | 3        |
|    policy_objective       | 0.0178   |
|    value_loss             | 259      |
----------------------------------------
-----------------------------------------
| rollout/                  |           |
|    ep_len_mean            | 244       |
|    ep_rew_mean            | 7.82      |
| time/                     |           |
|    fps                    | 2221  

Done training.


In [8]:
model = TRPO.load("trpo_spinchain", device='cpu')

obs, _ = env.reset()
for _ in range(20):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    print(f"{action:2d} {env.render()}: {env.unwrapped.reward()}")
    if terminated or truncated:
        print(info)
        obs, _ = env.reset()

 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↑ ↓ ↑: 1.0
 9 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↑ ↓ ↓: 3.0
 7 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓: 7.0
 3 ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓: 7.0


# Now it's your turn

I list some suggestions of what you could do below. Pick the one (or ones) that intrest you the most, or just play with the notebook and investigate your own questions.

## Exercise 1: Play with the environment

* Play with spin chains of different lengths
* Implement curriculum learning, where instead of getting a random spin chain the reset function generates spin chains with 1 spin flipped, then with 2, etc.
* Change the action space such that we have 2N actions, where the first N actions set the nth spin to up and the next N actions set the nth spin to down

## Exercise 2: Try other algorithms

Try other RL algorithms as implemented in stable baselines 3

## Exercise 3: 2D spin chain

Modify the environment to a 2D lattice and use a CNN for the policy.