https://gym.openai.com/docs/

- Getting Started with Gym
    + Installation
        - Building from Source
    + Environments
    + Observations
    + Spaces
- Available Environments
    + The registry
- Background: Why Gym? (2016)

# Getting Started with Gym

**Gym is a toolkit for developing and comparing reinforcement learning algorithms.** It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms:

https://github.com/openai/gym

These environments have a shared interface, allowing you to write general algorithms.

## Installation

To get started, you’ll need to have Python 3.5+ installed. Simply install `gym` using `pip`:

```
pip install gym
```

And you’re good to go!

### Building from Source

If you prefer, you can also clone the `gym` Git repository directly. This is particularly useful when you’re working on modifying Gym itself or adding environments. Download and install using:

```
git clone https://github.com/openai/gym
cd gym
pip install -e .
```

You can later run `pip install -e .[all]` to perform a full installation containing all environments.:

https://github.com/openai/gym#installing-everything

This requires installing several more involved dependencies, including `cmake` and a recent `pip` version.

## Environments

Here’s a bare minimum example of getting something running. This will run an instance of the `CartPole-v0` environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

https://www.youtube.com/watch?v=J7E6_my3CHk

In [1]:
import gym

env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

env.close()



It should look something like this:

Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen. More on that later. For now, please ignore the warning about calling `step()` even though this environment has already returned `done = True`.

If you’d like to see some other environments in action, try replacing `CartPole-v0` above with something like `MountainCar-v0`, `MsPacman-v0` (requires the [Atari dependency](https://github.com/openai/gym#atari)), or `Hopper-v1` (requires the [MuJoCo](https://github.com/openai/gym#mujoco) dependencies). Environments all descend from the `Env` base class.

Note that if you’re missing any dependencies, you should get a helpful error message telling you what you’re missing. ([Let us know](https://github.com/openai/gym/issues) if a dependency gives you trouble without a clear instruction to fix it.) [Installing](https://github.com/openai/gym#environment-specific-installation) a missing dependency is generally pretty simple. You’ll also need a [MuJoCo license](https://www.roboti.us/license.html) for `Hopper-v1`.

## Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s `step` function returns exactly what we need. In fact, `step` returns four values. These are:

- `observation` (object): an environment-specific object representing your observation of the environment
    + For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game

- `reward` (float): amount of reward achieved by the previous action
    + The scale varies between environments, but the goal is always to increase your total reward

- `done` (boolean): whether it’s time to reset the environment again
    + Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated
    + For example, perhaps the pole tipped too far, or you lost your last life

- `info` (dict): diagnostic information useful for debugging
    + It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change)
    + However, official evaluations of your agent are not allowed to use this for learning

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an `action`, and the environment returns an `observation` and a `reward`.

aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg

The process gets started by calling `reset()`, which returns an initial `observation`. So a more proper way of writing the previous code would be to respect the `done` flag:

In [2]:
import gym

env = gym.make('CartPole-v0')

for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

env.close()

[ 0.00318354  0.01116984  0.03464679 -0.02850192]
[ 0.00340694  0.20577825  0.03407675 -0.31005522]
[ 0.00752251  0.40039853  0.02787565 -0.59179953]
[ 0.01553048  0.59511937  0.01603965 -0.87557294]
[ 0.02743286  0.79001965 -0.0014718  -1.1631703 ]
[ 0.04323326  0.98516074 -0.02473521 -1.45631432]
[ 0.06293647  1.18057736 -0.0538615  -1.756621  ]
[ 0.08654802  1.37626671 -0.08899392 -2.06555696]
[ 0.11407335  1.18215658 -0.13030506 -1.8016737 ]
[ 0.13771648  0.9887098  -0.16633853 -1.55216487]
[ 0.15749068  1.18538983 -0.19738183 -1.89179127]
Episode finished after 11 timesteps
[ 0.01505141 -0.00244625 -0.0284009   0.04634925]
[ 0.01500249  0.1930712  -0.02747391 -0.25515739]
[ 0.01886391 -0.00164793 -0.03257706  0.02873476]
[ 0.01883095 -0.19628792 -0.03200237  0.31096388]
[ 0.01490519 -0.00072499 -0.02578309  0.0083626 ]
[ 0.01489069  0.19475705 -0.02561584 -0.29234241]
[ 0.01878583  0.3902347  -0.03146269 -0.5929929 ]
[ 0.02659053  0.19556697 -0.04332254 -0.31038433]
[ 0.03050187  

[-0.11263953 -0.95351664  0.17878454  1.64350836]
Episode finished after 10 timesteps
[-0.02403197  0.01835612 -0.04677557  0.03698327]
[-0.02366484 -0.17606493 -0.0460359   0.31454888]
[-0.02718614  0.01968153 -0.03974493  0.00771048]
[-0.02679251  0.21535028 -0.03959072 -0.29724278]
[-0.02248551  0.41101358 -0.04553557 -0.60214438]
[-0.01426523  0.6067419  -0.05757846 -0.90881537]
[-0.0021304   0.4124446  -0.07575477 -0.6347711 ]
[ 0.0061185   0.60853689 -0.08845019 -0.95031656]
[ 0.01828923  0.8047309  -0.10745652 -1.26942759]
[ 0.03438385  1.00104825 -0.13284507 -1.59373657]
[ 0.05440482  0.8077291  -0.1647198  -1.3452545 ]
[ 0.0705594   0.61501661 -0.19162489 -1.10830743]
Episode finished after 12 timesteps
[ 0.04239682 -0.02625447  0.01838256 -0.03378576]
[ 0.04187174  0.16859911  0.01770684 -0.32061258]
[ 0.04524372  0.36346448  0.01129459 -0.60765927]
[ 0.05251301  0.16818645 -0.00085859 -0.31144037]
[ 0.05587674 -0.02692326 -0.0070874  -0.01902835]
[ 0.05533827  0.16829961 -0.

[-0.09670696 -0.81633925  0.17906377  1.47991488]
[-0.11303374 -1.01313732  0.20866207  1.82275592]
Episode finished after 16 timesteps
[ 0.01533846 -0.01326086  0.04403862  0.03307243]
[ 0.01507324  0.1812028   0.04470007 -0.24539708]
[ 0.0186973   0.37565876  0.03979213 -0.52365221]
[ 0.02621048  0.18000003  0.02931908 -0.21870065]
[ 0.02981048 -0.0155285   0.02494507  0.08308467]
[ 0.02949991 -0.21099898  0.02660676  0.38353227]
[ 0.02527993 -0.0162647   0.03427741  0.09935569]
[ 0.02495463 -0.21186071  0.03626452  0.40265295]
[ 0.02071742 -0.40747774  0.04431758  0.70654501]
[ 0.01256786 -0.60318476  0.05844848  1.0128424 ]
[ 5.04168740e-04 -7.99035673e-01  7.87053279e-02  1.32329118e+00]
[-0.01547654 -0.99505873  0.10517115  1.63953102]
[-0.03537772 -0.80131492  0.13796177  1.38138441]
[-0.05140402 -0.60815781  0.16558946  1.13483393]
[-0.06356717 -0.4155433   0.18828614  0.89832535]
[-0.07187804 -0.22340347  0.20625265  0.67023891]
Episode finished after 16 timesteps


This should give a video and output like the following. You should be able to see where the resets happen.

## Spaces

In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an `action_space` and an `observation_space`. These attributes are of type `Space`, and they describe the format of valid actions and observations:

In [3]:
import gym

env = gym.make('CartPole-v0')

print(env.action_space)

Discrete(2)


In [4]:
print(env.observation_space)

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)


The `Discrete` space allows a fixed range of non-negative numbers, so in this case valid `action` values are either 0 or 1. The `Box` space represents an n-dimensional box, so valid `observations` will be an array of 4 numbers. We can also check the `Box`’s bounds:

In [5]:
print(env.observation_space.high)

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


In [6]:
print(env.observation_space.low)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


This introspection can be helpful to write generic code that works for many different environments. `Box` and `Discrete` are the most common `Space`s. You can sample from a `Space` or check that something belongs to it:

In [7]:
from gym import spaces

space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8

For `CartPole-v0` one of the actions applies force to the left, and one of them applies force to the right. (Can you figure out which is which?)

Fortunately, the better your learning algorithm, the less you’ll have to try to interpret these numbers yourself.

# Available Environments

Gym comes with a diverse suite of environments that range from easy to difficult and involve many different kinds of data. View the full list of environments to get the birds-eye view:

https://gym.openai.com/envs

- Classic control and toy text: complete small-scale tasks, mostly from the RL literature
    + They’re here to get you started

https://gym.openai.com/envs#classic_control

https://gym.openai.com/envs#toy_text

- Algorithmic: perform computations such as adding multi-digit numbers and reversing sequences
    + One might object that these tasks are easy for a computer
    + The challenge is to learn these algorithms purely from examples
    + These tasks have the nice property that it’s easy to vary the difficulty by varying the sequence length

https://gym.openai.com/envs#algorithmic

- Atari: play classic Atari games
    + We’ve integrated the [Arcade Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment) (which has had a big impact on reinforcement learning research) in an [easy-to-install](https://github.com/openai/gym#atari) form

https://gym.openai.com/envs#atari

- 2D and 3D robots: control a robot in simulation
    + These tasks use the [MuJoCo](http://www.mujoco.org/) physics engine, which was designed for fast and accurate robot simulation
    + Included are some environments from a recent [benchmark](http://arxiv.org/abs/1604.06778) by UC Berkeley researchers (who incidentally will be [joining us](https://gym.openai.com/blog/team-plus-plus/) this summer)
    + MuJoCo is proprietary software, but offers [free trial licenses](https://www.roboti.us/trybuy.html)

https://gym.openai.com/envs#mujoco

## The registry

`gym`’s main purpose is to provide a large collection of environments that expose a common interface and are versioned to allow for comparisons. To list the environments available in your installation, just ask `gym.envs.registry`:

In [8]:
from gym import envs

print(envs.registry.all())

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v3), EnvSpec(BipedalWalkerHardcore-v3), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v3), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(HalfCheetah-v3), EnvSpec(Hopper-v2), EnvSpec(Hopper-v3), EnvSpec(Swimmer-v2), EnvSp

This will give you a list of `EnvSpec` objects. These define parameters for a particular task, including the number of trials to run and the maximum number of steps. For example, `EnvSpec(Hopper-v1)` defines an environment where the goal is to get a 2D simulated robot to hop; `EnvSpec(Go9x9-v0)` defines a Go game on a 9x9 board.

These environment IDs are treated as opaque strings. In order to ensure valid comparisons for the future, environments will never be changed in a fashion that affects performance, only replaced by newer versions. We currently suffix each environment with a `v0` so that future replacements can naturally be called `v1`, `v2`, etc.

It’s very easy to add your own enviromments to the registry, and thus make them available for `gym.make()`: just `register()` them at load time.

# Background: Why Gym? (2016)

Reinforcement learning (RL) is the subfield of machine learning concerned with decision making and motor control. It studies how an agent can learn how to achieve goals in a complex, uncertain environment. It’s exciting for two reasons:

- RL is very general, encompassing all problems that involve making a sequence of decisions: 
    + for example, controlling a robot’s motors so that it’s able to [run](https://gym.openai.com/envs/Humanoid-v0) and [jump](https://gym.openai.com/envs/Hopper-v0), making business decisions like pricing and inventory management, or playing [video games](https://gym.openai.com/envs#atari) and [board games](https://gym.openai.com/envs#board_game)
    + RL can even be applied to supervised learning problems with sequential or structured outputs:
    
http://arxiv.org/abs/1511.06732

http://arxiv.org/abs/0907.0786

http://arxiv.org/abs/1601.01705

- RL algorithms have started to achieve good results in many difficult environments
    + RL has a long history, but until recent advances in deep learning, it required lots of problem-specific engineering
    + DeepMind’s [Atari results](https://deepmind.com/dqn.html), [BRETT](http://news.berkeley.edu/2015/05/21/deep-learning-robot-masters-skills-via-trial-and-error/) from [Pieter Abbeel](https://gym.openai.com/blog/welcome-pieter-and-shivon)’s group, and [AlphaGo](https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html) all used deep RL algorithms which did not make too many assumptions about their environment, and thus can be applied in other settings.

However, RL research is also slowed down by two factors:

- The need for better benchmarks. In supervised learning, progress has been driven by large labeled datasets like [ImageNet](http://www.image-net.org/)
    + In RL, the closest equivalent would be a large and diverse collection of environments
    + However, the existing open-source collections of RL environments don’t have enough variety, and they are often difficult to even set up and use

- Lack of standardization of environments used in publications
    + Subtle differences in the problem definition, such as the reward function or the set of actions, can drastically alter a task’s difficulty
    + This issue makes it difficult to reproduce published research and compare results from different papers

Gym is an attempt to fix both problems.
