# Make your own custom environment

This documentation overviews creating new environments and relevant
useful wrappers, utilities and tests included in Gymnasium designed for
the creation of new environments.


## Setup

### Recommended solution

1. Install ``pipx`` following the [pipx documentation](https://pypa.github.io/pipx/installation/).
2. Then install Copier:

In [1]:
!py -m pip install --user pipx

Collecting pipx
  Downloading pipx-1.7.1-py3-none-any.whl.metadata (18 kB)
Collecting argcomplete>=1.9.4 (from pipx)
  Downloading argcomplete-3.5.1-py3-none-any.whl.metadata (16 kB)
Collecting userpath!=1.9,>=1.6 (from pipx)
  Downloading userpath-1.9.2-py3-none-any.whl.metadata (3.0 kB)
Downloading pipx-1.7.1-py3-none-any.whl (78 kB)
Downloading argcomplete-3.5.1-py3-none-any.whl (43 kB)
Downloading userpath-1.9.2-py3-none-any.whl (9.1 kB)
Installing collected packages: argcomplete, userpath, pipx
Successfully installed argcomplete-3.5.1 pipx-1.7.1 userpath-1.9.2


In [4]:
!pipx install copier
!pip install copier
!pipx ensurepath

'copier' already seems to be installed. Not modifying existing installation in
'C:\Users\gaojin\pipx\venvs\copier'. Pass '--force' to force installation.
Defaulting to user installation because normal site-packages is not writeable
C:\Users\gaojin\AppData\Roaming\Python\Python312\Scripts is already in PATH.
Success! Added C:\Users\gaojin\.local\bin to the PATH environment variable.

Consider adding shell completions for pipx. Run 'pipx completions' for
instructions.

You will need to open a new terminal or re-login for the PATH changes to take
effect. Alternatively, you can source your shell's config file with e.g.
'source ~/.bashrc'.

Otherwise pipx is ready to go! ✨ 🌟 ✨


In [9]:
!copier --version

copier 9.4.1
[0m


然后，您只需运行以下命令并将字符串path/to/directory替换为要在其中创建新项目的目录的路径。

In [None]:
'''
这段只能在terminal中运行
'''
!copier copy https://github.com/Farama-Foundation/gymnasium-env-template.git "./learn/my_env"

## Subclassing gymnasium.Env 子类化gymnasium.Env

为了说明子类化gymnasium.Env的过程，我们将实现一个非常简单的游戏，称为GridWorldEnv 。我们会写 我们的自定义环境的代码 gymnasium_env/envs/grid_world.py 。环境 由固定大小的二维方形网格组成（通过指定 施工时的size参数）。代理可以在每个时间步长的网格单元之间垂直或水平移动。代理的目标是导航到在剧集开始时随机放置的网格上的目标。

-  观察提供了目标和代理的位置。
-  我们的环境中有 4 个动作，分别对应于“右”、“上”、“左”和“下”运动。
-  一旦代理导航到目标所在的网格单元，就会发出完成信号。
-  奖励是二元且稀疏的，这意味着即时奖励始终为零，除非智能体已达到目标，否则为 1。
  
我们一块一块看一下GridWorldEnv的源代码：


### Declaration and Initialization 声明和初始化

我们的自定义环境将继承自抽象类 gymnasium.Env .您不应该忘记添加metadata 归属于你的class。在那里，您应该指定渲染模式 受您的环境支持（例如， "human" 、 "rgb_array" 、 "ansi" ）以及渲染环境的帧速率。每个环境都应该支持None作为渲染模式；您不需要将其添加到元数据中。在GridWorldEnv中，我们将支持“rgb_array”和“ human”模式并以 4 FPS 渲染。

我们环境的__init__方法将接受整数 size ，决定方形网格的大小。我们将设置一些用于渲染的变量并定义self.observation_space和 self.action_space 。在我们的例子中，观察应该提供有关代理和目标在二维网格上的位置的信息。我们将选择以带有键"agent"和"target"的字典的形式表示观察结果。观察可能看起来像 {"agent": array([1, 0]), "target": array([0, 3])} 。由于我们的环境中有 4 个动作（“右”、“上”、“左”、“下”），因此我们将使用Discrete(4)作为动作空间。这是GridWorldEnv的声明和__init__的实现：


In [None]:
# gymnasium_env/envs/grid_world.py
from enum import Enum

import numpy as np
import pygame

import gymnasium as gym
from gymnasium import spaces


class Actions(Enum):
    RIGHT = 0
    UP = 1
    LEFT = 2
    DOWN = 3


class GridWorldEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4}

    def __init__(self, render_mode=None, size=5):
        self.size = size  # The size of the square grid
        self.window_size = 512  # The size of the PyGame window

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]).
        self.observation_space = spaces.Dict(
            {
                "agent": spaces.Box(0, size - 1, shape=(2,), dtype=int),
                "target": spaces.Box(0, size - 1, shape=(2,), dtype=int),
            }
        )
        self._agent_location = np.array([-1, -1], dtype=int)
        self._target_location = np.array([-1, -1], dtype=int)

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = spaces.Discrete(4)

        """
        The following dictionary maps abstract actions from `self.action_space` to
        the direction we will walk in if that action is taken.
        i.e. 0 corresponds to "right", 1 to "up" etc.
        """
        self._action_to_direction = {
            Actions.RIGHT.value: np.array([1, 0]),
            Actions.UP.value: np.array([0, 1]),
            Actions.LEFT.value: np.array([-1, 0]),
            Actions.DOWN.value: np.array([0, -1]),
        }

        assert render_mode is None or render_mode in self.metadata["render_modes"]
        self.render_mode = render_mode

        """
        If human-rendering is used, `self.window` will be a reference
        to the window that we draw to. `self.clock` will be a clock that is used
        to ensure that the environment is rendered at the correct framerate in
        human-mode. They will remain `None` until human-mode is used for the
        first time.
        """
        self.window = None
        self.clock = None

### Constructing Observations From Environment States

Since we will need to compute observations both in ``reset`` and
``step``, it is often convenient to have a (private) method ``_get_obs``
that translates the environment’s state into an observation. However,
this is not mandatory and you may as well compute observations in
``reset`` and ``step`` separately:



In [None]:
def _get_obs(self):
        return {"agent": self._agent_location, "target": self._target_location}

We can also implement a similar method for the auxiliary information
that is returned by ``step`` and ``reset``. In our case, we would like
to provide the manhattan distance between the agent and the target:



In [None]:
def _get_info(self):
        return {
            "distance": np.linalg.norm(
                self._agent_location - self._target_location, ord=1
            )
        }

Oftentimes, info will also contain some data that is only available
inside the ``step`` method (e.g., individual reward terms). In that case,
we would have to update the dictionary that is returned by ``_get_info``
in ``step``.



### Reset

将调用reset方法来启动新的episode。您可以假设在调用reset之前不会调用step方法。此外，只要发出完成信号，就应该调用reset 。用户可以传递seed关键字进行reset以将环境使用的任何随机数生成器初始化为确定性状态。建议使用环境基类gymnasium.Env提供的随机数生成器self.np_random 。如果你只使用这个RNG，你不需要太担心种子，但你需要记住调用``super().reset(seed=seed)``以确保gymnasium.Env 正确为 RNG 播种。完成后，我们可以随机设置 我们的环境状况。在我们的例子中，我们随机选择代理 位置和随机样本目标位置，直到不 与代理人的职位相符。

reset方法应该返回初始观察的元组和一些辅助信息。我们可以使用方法_get_obs和 我们之前为此实现的_get_info ：

In [None]:
def reset(self, seed=None, options=None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Choose the agent's location uniformly at random
        self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

        # We will sample the target's location randomly until it does not coincide with the agent's location
        self._target_location = self._agent_location
        while np.array_equal(self._target_location, self._agent_location):
            self._target_location = self.np_random.integers(
                0, self.size, size=2, dtype=int
            )

        observation = self._get_obs()
        info = self._get_info()

        if self.render_mode == "human":
            self._render_frame()

        return observation, info

### Step

step方法通常包含您环境的大部分逻辑。它接受一个action ，计算该动作的状态 应用该操作后的环境并返回 5 元组 (observation, reward, terminated, truncated, info) 。看 gymnasium.Env.step() 。一旦新的环境状态出现 计算完毕后，我们可以检查它是否是最终状态，然后我们设置 相应地done 。由于我们在中使用稀疏二元奖励 GridWorldEnv ，一旦我们知道，计算reward就微不足道了 done 。为了收集observation和info ，我们可以再次使用_get_obs和_get_info ：

In [None]:
def step(self, action):
        # Map the action (element of {0,1,2,3}) to the direction we walk in
        direction = self._action_to_direction[action]
        # We use `np.clip` to make sure we don't leave the grid
        self._agent_location = np.clip(
            self._agent_location + direction, 0, self.size - 1
        )
        # An episode is done iff the agent has reached the target
        terminated = np.array_equal(self._agent_location, self._target_location)
        reward = 1 if terminated else 0  # Binary sparse rewards
        observation = self._get_obs()
        info = self._get_info()

        if self.render_mode == "human":
            self._render_frame()

        return observation, reward, terminated, False, info

### Rendering

在这里，我们使用 PyGame 进行渲染。 Gymnasium 中包含的许多环境都使用类似的渲染方法，您可以将其用作您自己的环境的骨架：



In [None]:
def render(self):
        if self.render_mode == "rgb_array":
            return self._render_frame()

    def _render_frame(self):
        if self.window is None and self.render_mode == "human":
            pygame.init()
            pygame.display.init()
            self.window = pygame.display.set_mode(
                (self.window_size, self.window_size)
            )
        if self.clock is None and self.render_mode == "human":
            self.clock = pygame.time.Clock()

        canvas = pygame.Surface((self.window_size, self.window_size))
        canvas.fill((255, 255, 255))
        pix_square_size = (
            self.window_size / self.size
        )  # The size of a single grid square in pixels

        # First we draw the target
        pygame.draw.rect(
            canvas,
            (255, 0, 0),
            pygame.Rect(
                pix_square_size * self._target_location,
                (pix_square_size, pix_square_size),
            ),
        )
        # Now we draw the agent
        pygame.draw.circle(
            canvas,
            (0, 0, 255),
            (self._agent_location + 0.5) * pix_square_size,
            pix_square_size / 3,
        )

        # Finally, add some gridlines
        for x in range(self.size + 1):
            pygame.draw.line(
                canvas,
                0,
                (0, pix_square_size * x),
                (self.window_size, pix_square_size * x),
                width=3,
            )
            pygame.draw.line(
                canvas,
                0,
                (pix_square_size * x, 0),
                (pix_square_size * x, self.window_size),
                width=3,
            )

        if self.render_mode == "human":
            # The following line copies our drawings from `canvas` to the visible window
            self.window.blit(canvas, canvas.get_rect())
            pygame.event.pump()
            pygame.display.update()

            # We need to ensure that human-rendering occurs at the predefined framerate.
            # The following line will automatically add a delay to keep the framerate stable.
            self.clock.tick(self.metadata["render_fps"])
        else:  # rgb_array
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(canvas)), axes=(1, 0, 2)
            )

### Close

close方法应该关闭环境使用的所有开放资源。在许多情况下，您实际上不必费心去实现此方法。然而，在我们的示例中， render_mode可能是 "human" ，我们可能需要关闭已打开的窗口：
在其他环境中， close还可能关闭已打开的文件或释放其他资源。调用close后，您不应该与环境交互。


In [None]:
def close(self):
        if self.window is not None:
            pygame.display.quit()
            pygame.quit()

## Registering Envs

为了让 Gymnasium 检测到自定义环境，他们 必须按如下方式注册。我们将选择将此代码放入 gymnasium_env/__init__.py 。

In [None]:
from gymnasium.envs.registration import register

register(
    id="gymnasium_env/GridWorld-v0",
    entry_point="gymnasium_env.envs:GridWorldEnv",
)

环境 ID 由三个组件组成，其中两个是可选的：可选的命名空间（此处： gymnasium_env ）、强制名称（此处： GridWorld ）和可选但推荐的版本（此处：v0）。它也可能已注册为GridWorld-v0 （推荐的方法）、 GridWorld或gymnasium_env/GridWorld ，然后在环境创建期间应使用适当的 ID。

关键字参数max_episode_steps=300将确保通过gymnasium.make实例化的GridWorld环境将被包装在TimeLimit包装器中（有关更多信息，请参阅包装器文档）。如果智能体已达到目标或在当前情节中已执行 300 个步骤，则会产生完成信号。要区分截断和终止，您可以检查info["TimeLimit.truncated"] 。

除了id和entrypoint之外，您还可以将以下附加关键字参数传递给register ：

+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
| Name                 | Type      | Default   | Description                                                                                                   |
+======================+===========+===========+===============================================================================================================+
| ``reward_threshold`` | ``float`` | ``None``  | The reward threshold before the task is  considered solved                                                    |
+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
| ``nondeterministic`` | ``bool``  | ``False`` | Whether this environment is non-deterministic even after seeding                                              |
+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
| ``max_episode_steps``| ``int``   | ``None``  | The maximum number of steps that an episode can consist of. If not ``None``, a ``TimeLimit`` wrapper is added |
+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
| ``order_enforce``    | ``bool``  | ``True``  | Whether to wrap the environment in an  ``OrderEnforcing`` wrapper                                             |
+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
| ``kwargs``           | ``dict``  | ``{}``    | The default kwargs to pass to the environment class                                                           |
+----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+

大多数这些关键字（除了max_episode_steps ， order_enforce和kwargs ）不会改变环境实例的行为，而只是提供一些有关您的环境的额外信息。注册后，我们自定义的GridWorldEnv 环境可以创建为 env = gymnasium.make('gymnasium_env/GridWorld-v0') .

如果您的环境未注册，您可以选择传递一个要导入的模块，这将在创建环境之前注册您的环境，如下所示 - env = gymnasium.make('module:Env-v0') ，其中module 包含注册码。对于 GridWorld 环境，注册 代码是通过导入gymnasium_env来运行的，所以如果不可能 显式导入gymnasium_env，您可以在制作时注册 env = gymnasium.make('gymnasium_env:gymnasium_env/GridWorld-v0') 。当您被允许仅将环境 ID 传递到第三方代码库（例如学习库）时，这尤其有用。这使您可以注册您的环境，而无需编辑库的源代码。

``gymnasium_env/envs/__init__.py`` should have:

In [None]:
from gymnasium_env.envs.grid_world import GridWorldEnv

## Creating a Package

最后一步是将我们的代码构建为 Python 包。这涉及配置pyproject.toml 。如何执行此操作的最小示例如下

In [None]:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "gymnasium_env"
version = "0.0.1"
dependencies = [
  "gymnasium",
  "pygame==2.1.3",
  "pre-commit",
]

In [None]:
   # run_gymnasium_env.py

   import gymnasium
   import gymnasium_env
   env = gymnasium.make('gymnasium_env/GridWorld-v0')



## Creating Environment Instances

现在您可以使用以下命令在本地安装软件包：
.. code:: console

   pip install -e .

您可以通过以下方式创建环境实例：


您还可以将环境构造函数的关键字参数传递给 gymnasium.make自定义环境。在我们的例子中，我们可以这样做：
.. code:: python

   env = gymnasium.make('gymnasium_env/GridWorld-v0', size=10)

有时，您可能会发现跳过注册并自己调用环境的构造函数更方便。有些人可能会发现这种方法更Pythonic，并且像这样实例化的环境也非常好（但也要记住添加包装器！）。


## Using Wrappers

通常，我们想要使用自定义环境的不同变体，或者我们想要修改 Gymnasium 或其他方提供的环境的行为。包装器允许我们在不改变环境实现或添加任何样板代码的情况下做到这一点。查看包装器文档，了解如何使用包装器的详细信息以及实现您自己的包装器的说明。在我们的示例中，观察结果不能直接用于学习代码，因为它们是字典。然而，我们实际上不需要修改我们的环境实现来解决这个问题！我们可以简单地在环境实例之上添加一个包装器，将观察结果扁平化为单个数组：

In [None]:
import gymnasium
import gymnasium_env
from gymnasium.wrappers import FlattenObservation

env = gymnasium.make('gymnasium_env/GridWorld-v0')
wrapped_env = FlattenObservation(env)
print(wrapped_env.reset())     # E.g.  [3 0 3 3], {}


包装器有一个很大的优势，那就是它们可以使环境变得高度 模块化的。例如，不是将观察结果展平 GridWorld，您可能只想查看 GridWorld 的相对位置 目标和代理。在关于 ObservationWrappers我们已经实现了一个完成这项工作的包装器。此包装也可用于 gymnasium_env/wrappers/relative_position.py :

In [None]:
import gymnasium
import gymnasium_env
from gymnasium_env.wrappers import RelativePosition

env = gymnasium.make('gymnasium_env/GridWorld-v0')
wrapped_env = RelativePosition(env)
print(wrapped_env.reset())     # E.g.  [-3  3], {}
