<a href="https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/5_custom_gym_env.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines3 Tutorial - Creating a custom Gym environment

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3-Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


## Introduction

In this notebook, you will learn how to use your own environment following the OpenAI Gym interface.
Once it is done, you can easily use any compatible (depending on the action space) RL algorithm from Stable Baselines on that environment.

## Install Dependencies and Stable Baselines3 Using Pip



In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
!pip install "stable-baselines3[extra]>=2.0.0a4"
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.env_util import make_vec_env



## First steps with the gym interface

As you have noticed in the previous notebooks, an environment that follows the gym interface is quite simple to use.
It provides to this user mainly three methods, which have the following signature (for gym versions > 0.26)
- `reset()` called at the beginning of an episode, it returns an observation and a dictionary with additional info (defaults to an empty dict)
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether new state is a terminal state (episode is finished), whether the max number of timesteps is reached (episode is artificially finished), and additional information
- (Optional) `render()` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `render_mode='rbg_array'` to retrieve an image of the scene).

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about [gym spaces](https://gymnasium.farama.org/api/spaces/) is to look at the [source code](https://github.com/Farama-Foundation/Gymnasium/tree/main/gymnasium/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.


[Documentation on custom env](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html)

Also keep in mind that Stabe-baselines internally uses the previous gym API (<0.26), so every VecEnv returns only the observation after resetting and returns a 4-tuple instead of a 5-tuple  (terminated & truncated are already combined to done).

In [3]:
import gymnasium as gym

env = gym.make("CartPole-v1")

# Box(4,) means that it is a Vector with 4 components
print("Observation space:", env.observation_space)
print("Shape:", env.observation_space.shape)
# Discrete(2) means that there is two discrete actions
print("Action space:", env.action_space)

# The reset method is called at the beginning of an episode
obs, info = env.reset()
# Sample a random action
action = env.action_space.sample()
print("Sampled action:", action)
obs, reward, terminated, truncated, info = env.step(action)
# Note the obs is a numpy array
# info is an empty dict for now but can contain any debugging info
# reward is a scalar
print(obs.shape, reward, terminated, truncated, info)

Observation space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
Shape: (4,)
Action space: Discrete(2)
Sampled action: 0
(4,) 1.0 False False {}


##  Gym env skeleton

In practice this is how a gym environment looks like.
Here, we have implemented a simple grid world were the agent must learn to go always left.

In [4]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces


class GoLeftEnv(gym.Env):
    """
    Custom Environment that follows gym interface.
    This is a simple env where the agent must learn to go always left.
    """

    # Because of google colab, we cannot implement the GUI ('human' render mode)
    metadata = {"render_modes": ["console"]}

    # Define constants for clearer code
    LEFT = 0
    RIGHT = 1

    def __init__(self, grid_size=10, render_mode="console"):
        super(GoLeftEnv, self).__init__()
        self.render_mode = render_mode

        # Size of the 1D-grid
        self.grid_size = grid_size
        # Initialize the agent at the right of the grid
        self.agent_pos = grid_size - 1

        # Define action and observation space
        # They must be gym.spaces objects
        # Example when using discrete actions, we have two: left and right
        n_actions = 2
        self.action_space = spaces.Discrete(n_actions)
        # The observation will be the coordinate of the agent
        # this can be described both by Discrete and Box space
        self.observation_space = spaces.Box(
            low=0, high=self.grid_size, shape=(1,), dtype=np.float32
        )

    def reset(self, seed=None, options=None):
        """
        Important: the observation must be a numpy array
        :return: (np.array)
        """
        super().reset(seed=seed, options=options)
        # Initialize the agent at the right of the grid
        self.agent_pos = self.grid_size - 1
        # here we convert to float32 to make it more general (in case we want to use continuous actions)
        return np.array([self.agent_pos]).astype(np.float32), {}  # empty info dict

    def step(self, action):
        if action == self.LEFT:
            self.agent_pos -= 1
        elif action == self.RIGHT:
            self.agent_pos += 1
        else:
            raise ValueError(
                f"Received invalid action={action} which is not part of the action space"
            )

        # Account for the boundaries of the grid
        self.agent_pos = np.clip(self.agent_pos, 0, self.grid_size)

        # Are we at the left of the grid?
        terminated = bool(self.agent_pos == 0)
        truncated = False  # we do not limit the number of steps here

        # Null reward everywhere except when reaching the goal (left of the grid)
        reward = 1 if self.agent_pos == 0 else 0

        # Optionally we can pass additional info, we are not using that for now
        info = {}

        return (
            np.array([self.agent_pos]).astype(np.float32),
            reward,
            terminated,
            truncated,
            info,
        )

    def render(self):
        # agent is represented as a cross, rest as a dot
        if self.render_mode == "console":
            print("." * self.agent_pos, end="")
            print("x", end="")
            print("." * (self.grid_size - self.agent_pos))

    def close(self):
        pass

### Validate the environment

Stable Baselines3 provides a [helper](https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html) to check that your environment follows the Gym interface. It also optionally checks that the environment is compatible with Stable-Baselines (and emits warning if necessary).

In [5]:
from stable_baselines3.common.env_checker import check_env

In [6]:
env = GoLeftEnv()
# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)

### Testing the environment

In [7]:
env = GoLeftEnv(grid_size=10)

obs, _ = env.reset()
env.render()

print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())

GO_LEFT = 0
# Hardcoded best agent: always go left!
n_steps = 20
for step in range(n_steps):
    print(f"Step {step + 1}")
    obs, reward, terminated, truncated, info = env.step(GO_LEFT)
    done = terminated or truncated
    print("obs=", obs, "reward=", reward, "done=", done)
    env.render()
    if done:
        print("Goal reached!", "reward=", reward)
        break

.........x.
Box(0.0, 10.0, (1,), float32)
Discrete(2)
1
Step 1
obs= [8.] reward= 0 done= False
........x..
Step 2
obs= [7.] reward= 0 done= False
.......x...
Step 3
obs= [6.] reward= 0 done= False
......x....
Step 4
obs= [5.] reward= 0 done= False
.....x.....
Step 5
obs= [4.] reward= 0 done= False
....x......
Step 6
obs= [3.] reward= 0 done= False
...x.......
Step 7
obs= [2.] reward= 0 done= False
..x........
Step 8
obs= [1.] reward= 0 done= False
.x.........
Step 9
obs= [0.] reward= 1 done= True
x..........
Goal reached! reward= 1


### Try it with Stable-Baselines

Once your environment follow the gym interface, it is quite easy to plug in any algorithm from stable-baselines

In [8]:
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.env_util import make_vec_env

# Instantiate the env
vec_env = make_vec_env(GoLeftEnv, n_envs=1, env_kwargs=dict(grid_size=10))

In [9]:
# Train the agent
model = A2C("MlpPolicy", env, verbose=1).learn(5000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 14.9     |
|    ep_rew_mean        | 1        |
| time/                 |          |
|    fps                | 2258     |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.29    |
|    explained_variance | 0.53     |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 0.0718   |
|    value_loss         | 0.0177   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 12.1     |
|    ep_rew_mean        | 1        |
| time/                 |          |
|    fps                | 2804     |
|    iterations         | 200      |
|    time_elapsed 

In [10]:
# Test the trained agent
# using the vecenv
obs = vec_env.reset()
n_steps = 20
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=True)
    print(f"Step {step + 1}")
    print("Action: ", action)
    obs, reward, done, info = vec_env.step(action)
    print("obs=", obs, "reward=", reward, "done=", done)
    vec_env.render()
    if done:
        # Note that the VecEnv resets automatically
        # when a done signal is encountered
        print("Goal reached!", "reward=", reward)
        break

Step 1
Action:  [0]
obs= [[8.]] reward= [0.] done= [False]
........x..
Step 2
Action:  [0]
obs= [[7.]] reward= [0.] done= [False]
.......x...
Step 3
Action:  [0]
obs= [[6.]] reward= [0.] done= [False]
......x....
Step 4
Action:  [0]
obs= [[5.]] reward= [0.] done= [False]
.....x.....
Step 5
Action:  [0]
obs= [[4.]] reward= [0.] done= [False]
....x......
Step 6
Action:  [0]
obs= [[3.]] reward= [0.] done= [False]
...x.......
Step 7
Action:  [0]
obs= [[2.]] reward= [0.] done= [False]
..x........
Step 8
Action:  [0]
obs= [[1.]] reward= [0.] done= [False]
.x.........
Step 9
Action:  [0]
obs= [[9.]] reward= [1.] done= [ True]
.........x.
Goal reached! reward= [1.]


## It is your turn now, be creative!

As an exercise, that's now your turn to build a custom gym environment.
There is no constrain about what to do, be creative! (but not too creative, there is not enough time for that)

If you don't have any idea, here is is a list of the environment you can implement:
- Transform the discrete grid world to a continuous one, you will need to change a bit the logic and the action space
- Create a 2D grid world and add walls
- Create a tic-tac-toe game


In [11]:
# Tic Tac Toe environment

class TicTacToeEnv(gym.Env):
    """
    Custom Environment that follows gym interface.
    This is a simple env where the agent must learn to play tic tac toe
    """

    # Because of google colab, we cannot implement the GUI ('human' render mode)
    metadata = {"render_modes": ["console"]}
    AGENT_TURN = 1
    RANDOM_TURN = 2

    def __init__(self, grid_size=3, render_mode="console"):
        super(TicTacToeEnv, self).__init__()
        self.render_mode = render_mode
        self.turn = np.random.choice([self.AGENT_TURN, self.RANDOM_TURN])
        self.grid_size = grid_size
        self.observation_space = spaces.Box(
            low=0, high=2, shape=(grid_size * grid_size,), dtype=np.int8 #flattened board instead of nxn
        )
        self.board = np.zeros((grid_size * grid_size,), dtype=np.int8)
        # action space is picking one of the boxes
        self.action_space = spaces.Discrete(grid_size * grid_size)


    def reset(self, seed=None, options=None):
        """
        Important: the observation must be a numpy array
        :return: (np.array)
        """
        super().reset(seed=seed, options=options)
        self.board = np.zeros((self.grid_size*self.grid_size,), dtype=np.int8)
        self.turn = np.random.choice([self.AGENT_TURN, self.RANDOM_TURN]) #doesn't do shit right now
        return np.array(self.board),{}

    def random_step(self):
        valid_moves = np.where(self.board == 0)[0]
        rand_action = np.random.choice(valid_moves)
        # random step move
        self.board[rand_action] = 2
        board2d = self.board.reshape((self.grid_size,self.grid_size))
        truncated = bool(self.board.flatten().all()) # truncated if all cells are filled

        if (np.any(np.all(board2d == 2, axis=1)) or  # rows
            np.any(np.all(board2d == 2, axis=0)) or  # columns
            np.all(np.diag(board2d) == 2) or         # diagonal
            np.all(np.diag(np.fliplr(board2d)) == 2)): # anti-diagonal
            return np.array(self.board), -1, True, False, {}
        # ----------
        return np.array(self.board), 0, False, truncated, {}

    def agent_step(self,action):
        if (self.board[action]!=0):
            return(
                True,
                self.board,
                -3,
                True, # end the episode if an invalid action is made
                False,
                {"invalid_action":True}
            )
        #Agent makes moves
        self.board[action] = 1 
        # agent action over
        board2d = self.board.reshape((self.grid_size,self.grid_size))
        if (np.any(np.all(board2d == 1, axis=1)) or  # rows
            np.any(np.all(board2d == 1, axis=0)) or  # columns
            np.all(np.diag(board2d) == 1) or         # diagonal
            np.all(np.diag(np.fliplr(board2d)) == 1)): # anti-diagonal
            return True, np.array(self.board), 1, True, False, {}

        reward = 0
        truncated = bool(self.board.flatten().all()) # truncated if all cells are filled
        terminated = False
        # check if the game is over
        if truncated:
            return True, np.array(self.board), reward, terminated, truncated, {}

        # first element False: doesn't want step function to return anything
        return False, np.array(self.board), reward, terminated, truncated, {}



    def step(self, action):
        # such a horrible solution
        # deal with first turn randomness
        # let random agent act first half the time
        if self.turn == self.RANDOM_TURN:
            self.turn = self.AGENT_TURN
            return self.random_step(action)

        return_val, board, reward, terminated, truncated ,info = self.agent_step(action)
        if return_val:
            return board, reward, terminated, truncated, info 
        
        # RANDOM TURN
        return self.random_step()


    def render(self):
        board2d = self.board.reshape((self.grid_size,self.grid_size))
        for i in range(self.grid_size):
            for j in range(self.grid_size):
              if j == self.grid_size - 1:
                  print(board2d[i,j])
              else:
                  print(board2d[i,j],end="|")             
            if i != self.grid_size - 1:
                print("-"*(2*self.grid_size-1))

    def close(self):
        pass

In [12]:
GRID_SIZE = 3
env = TicTacToeEnv(grid_size=GRID_SIZE)
check_env(env,warn=True)

vec_env = make_vec_env(TicTacToeEnv, n_envs=1, env_kwargs=dict(grid_size=GRID_SIZE))

TypeError: TicTacToeEnv.random_step() takes 1 positional argument but 2 were given

In [30]:
model = PPO("MlpPolicy", env, verbose=1).learn(100000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 3.49     |
|    ep_rew_mean     | -2.5     |
| time/              |          |
|    fps             | 6547     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 3.5         |
|    ep_rew_mean          | -2.36       |
| time/                   |             |
|    fps                  | 4837        |
|    iterations           | 2           |
|    time_elapsed         | 0           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.016748298 |
|    clip_fraction        | 0.13        |
|    clip_range           | 0.2         |
|    entropy_loss   

In [26]:
obs = vec_env.reset()

action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

0|0|0
-----
0|1|0
-----
0|0|2
reward= [0.] done [False]


In [27]:
action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

0|0|1
-----
0|1|2
-----
0|0|2
reward= [0.] done [False]


In [28]:
action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

0|0|0
-----
0|0|0
-----
0|0|0
reward= [1.] done [ True]


In [21]:
action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

0|0|0
-----
0|0|0
-----
0|0|0
reward= [1.] done [ True]


In [43]:
action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

2|0|0|0|2
---------
0|1|1|1|1
---------
0|1|0|2|0
---------
0|2|0|0|0
---------
0|0|0|2|0
reward= [0.] done [False]


In [44]:
action , _ = model.predict(obs,deterministic=True)
obs, reward, done, info = vec_env.step(action)
vec_env.render()
print("reward=", reward, "done", done)

0|0|0|0|0
---------
0|0|0|0|0
---------
0|0|0|0|0
---------
0|0|0|0|0
---------
0|0|0|0|0
reward= [1.] done [ True]


In [33]:
total_trials = 1000
num_wins = 0
for i in range(total_trials):
    obs = vec_env.reset()
    done = False
    while not done:
        action , _ = model.predict(obs)
        obs,reward,done,info = vec_env.step(action)
    if reward == 1:
        num_wins += 1
print("num wins for 3x3:", num_wins)

num wins for 3x3: 853


In [49]:
vec_env.reset()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0]], dtype=int8)

In [51]:
# 5x5
total_trials = 1000
num_wins = 0
for i in range(total_trials):
    obs = vec_env.reset()
    done = False
    while not done:
        action , _ = model.predict(obs)
        obs,reward,done,info = vec_env.step(action)
    if reward == 1:
        num_wins += 1
print("num wins:", num_wins)

num wins: 744


how is it even possible that my win rate is 75% but my average episode reward is 0

In [21]:
# 4x4
total_trials = 1000
num_wins = 0
for i in range(total_trials):
    obs = vec_env.reset()
    done = False
    while not done:
        action , _ = model.predict(obs)
        obs,reward,done,info = vec_env.step(action)
    if reward == 1:
        num_wins += 1
print("num wins:", num_wins)

num wins: 892


- Can't debug because fucking vec_env resets when done. Bad because I can't see the last position
- want to implement two agents playing together
- current implementation of the env is probably wrong? embedding a random opponent into the env seems weird