Our custom environment will inherit from the abstract class gym.Env. You shouldn’t forget to add the metadata attribute to your class. There, you should specify the render-modes that are supported by your environment (e.g. "human", "rgb_array", "ansi") and the framerate at which your environment should be rendered. Every environment should supportNone as render-mode; you don’t need to add it in the metadata. In GridWorldEnv, we will support the modes “rgb_array” and “human” and render at 4 FPS.

The __init__ method of our environment will accept the integer size, that determines the size of the square grid. We will set up some variables for rendering and define self.observation_space and self.action_space. In our case, observations should provide information about the location of the agent and target on the 2-dimensional grid. We will choose to represent observations in the form of a dictionaries with keys "agent" and "target". An observation may look like {"agent": array([1, 0]), "target": array([0, 3])}. Since we have 4 actions in our environment (“right”, “up”, “left”, “down”), we will use Discrete(4) as an action space. Here is the declaration of GridWorldEnv and the implementation of __init__:

The reset method will be called to initiate a new episode. You may assume that the step method will not be called before reset has been called. Moreover, reset should be called whenever a done signal has been issued. Users may pass the seed keyword to reset to initialize any random number generator that is used by the environment to a deterministic state. It is recommended to use the random number generator self.np_random that is provided by the environment’s base class, gym.Env. If you only use this RNG, you do not need to worry much about seeding, but you need to remember to call super().reset(seed=seed) to make sure that gym.Env correctly seeds the RNG. Once this is done, we can randomly set the state of our environment. In our case, we randomly choose the agent’s location and the randomly sample target positions, until it does not coincide with the agent’s position.
The reset method should return a tuple of the initial observation and some auxiliary information. We can use the methods _get_obs and _get_info that we implemented earlier for that:

In [3]:
import gym
from gym import spaces
import pygame
import numpy as np

class GridWorldEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4}

    def __init__(self, render_mode=None, size=5):
        self.size = size  # The size of the square grid
        self.window_size = 512  # The size of the PyGame window

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]).
        self.observation_space = spaces.Dict(
            {
                "agent": spaces.Box(0, size - 1, shape=(2,), dtype=int),
                "target": spaces.Box(0, size - 1, shape=(2,), dtype=int),
            }
        )

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = spaces.Discrete(4)

        """
        The following dictionary maps abstract actions from `self.action_space` to 
        the direction we will walk in if that action is taken.
        I.e. 0 corresponds to "right", 1 to "up" etc.
        """
        self._action_to_direction = {
            0: np.array([1, 0]),
            1: np.array([0, 1]),
            2: np.array([-1, 0]),
            3: np.array([0, -1]),
        }

        
    def reset(self, seed=None, options=None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Choose the agent's location uniformly at random
        self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

        # We will sample the target's location randomly until it does not coincide with the agent's location
        self._target_location = self._agent_location
        while np.array_equal(self._target_location, self._agent_location):
            self._target_location = self.np_random.integers(
                0, self.size, size=2, dtype=int
            )

        observation = self._get_obs()
        info = self._get_info()

        return observation, info # initial state
    
    def step(self, action):
        # Map the action (element of {0,1,2,3}) to the direction we walk in
        direction = self._action_to_direction[action]
        # We use `np.clip` to make sure we don't leave the grid
        self._agent_location = np.clip(
            self._agent_location + direction, 0, self.size - 1
        )
        # An episode is done iff the agent has reached the target
        terminated = np.array_equal(self._agent_location, self._target_location)
        reward = 1 if terminated else 0  # Binary sparse rewards
        observation = self._get_obs()
        info = self._get_info()

        return observation, reward, terminated, False, info # new state, reward


In [4]:
from gym.envs.registration import register

register(
    id='gym_examples/GridWorld-v0',
    entry_point='gym_examples.envs:GridWorldEnv',
    max_episode_steps=300,
)