# Custom openai-gym environment tutorial

This tutorial demonstrates how to create and load a custom environment for use with `gym`.

## Elements of an environment

An `environment` is an implementation of `gym.Env`.
Any custom environment can be created by inheriting from the `gym.Env` class:
```python
class CustomEnv(gym.Env):
    def step(self, action):
        return

    def reset(self):
        return

    @property
    def action_space(self):
        return

    @property
    def observation_space(self):
        return

    def reset(self):
        return

    def render(self):
        return

    def close(self):
        return

    def seed(self, seed=None):
        return
```

Note that in this example, I have defined the following methods and attributes of `CustomEnv`:
* step
* action_space
* obsevation_space
* reset
* *render*
* *close*
* *seed*

The last three methods (render, close, and seed) are optional, and can be left blank.

## Training an env in openai-gym

For our basic example, let us use the `CartPole-v1` ([source](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)) environment which comes with `gym`.
We will train this using the `stable-baselines` library, but note that other approaches such as `rllib` can be used.

In the `CartPole-v1` environment, the agent learns to balance a pole on top of a moving cart.
The agent knows the full state of the system, that is the position and velocities of the cart and the pole given by 
$$\mathrm{state} = \left[x, \dot{x}, \theta, \dot{\theta}\right].$$
Since this is known to the agent, the observation space is defined as 
```python
self.observation_space = spaces.Box(-high, high, dtype=np.float32)
```
where `high` is a length-4-array with bounding values for each of the state variables.

This agent has an action space defined as 
```python
self.action_space = spaces.Discrete(2)
```
which tells the model that there are two discrete actions, which represents moving left or right.

Finally, the entire physics of the cart and pole are defined in `self.step()`.
This is just an exercise in determining the applied force (due to the action) and calculating the updated positions and velocities using eulerian integration.


In [None]:
import gym
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) ## suppress warnings

from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy

In [9]:
env = gym.make('CartPole-v1')
# Optional: PPO2 requires a vectorized environment to run
# the env is now wrapped automatically when passing it to the constructor
# env = DummyVecEnv([lambda: env])

policy_kwargs = dict(net_arch=[64, 64])
# policy_kwargs = None
model = PPO2(MlpPolicy, env, verbose=1, policy_kwargs=policy_kwargs) # MlpPolicy is 64x64 FC layers by default
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

model.learn(total_timesteps=5000)

print(f"Untrained mean_reward:\t{mean_reward:.2f} +/- {std_reward:.2f}")
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Trained mean_reward:\t{mean_reward:.2f} +/- {std_reward:.2f}")


Wrapping the env in a DummyVecEnv.
--------------------------------------
| approxkl           | 8.6825254e-05 |
| clipfrac           | 0.0           |
| explained_variance | -0.0192       |
| fps                | 110           |
| n_updates          | 1             |
| policy_entropy     | 0.6930678     |
| policy_loss        | -0.0015568695 |
| serial_timesteps   | 128           |
| time_elapsed       | 4.58e-05      |
| total_timesteps    | 128           |
| value_loss         | 40.957977     |
--------------------------------------
---------------------------------------
| approxkl           | 2.8545483e-05  |
| clipfrac           | 0.0            |
| explained_variance | -0.0532        |
| fps                | 295            |
| n_updates          | 2              |
| policy_entropy     | 0.69272757     |
| policy_loss        | -0.00029773684 |
| serial_timesteps   | 256            |
| time_elapsed       | 1.24           |
| total_timesteps    | 256            |
| value_loss      

Training this agent for 5000 steps gives a noticeable improvement over the untrained model (assuming you get lucky with randomization).

You can visualize the results of training by setting `render = True` for the next section.
This is set to `False` by default because there are issues with rendering and jupyter notebooks.

In [3]:
render = False
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    if render:
      env.render()
    if done:
      obs = env.reset()

env.close()

## Training using a custom environment

A custom environment can be found in [stateless_cartpole](custom_env/stateless_cartpole.py).
This environment is heavily patterned from `CartPole-v1` with one key difference: the state observations are now limited to the positions,
$$\mathrm{state} = \left[x, \theta \right].$$
This configuration was designed as an example of the `rllib` [project](https://github.com/ray-project/ray/blob/master/rllib/examples/env/stateless_cartpole.py) to demonstrate an example where a fully connected neural network policy is insufficient to learn the cartpole problem (this is an example of a partially-observable markov process).
The primary changes in this custom environment can be found in the observation space.

The next section of code demonstrates how to load your custom environment, as well as tweaking your `stable_baselines` experiment to use the appropriate models.

In [11]:
from custom_env.stateless_cartpole import StatelessCartPole
env = StatelessCartPole()
env = DummyVecEnv([lambda: env])

policy_kwargs = dict(layers=[16, 16, 16])
model = PPO2(MlpLstmPolicy, env, nminibatches=1, verbose=1)
# model = PPO2(MlpPolicy, env, verbose=1)
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

model.learn(total_timesteps=5000)

print(f"Untrained mean_reward:\t{mean_reward:.2f} +/- {std_reward:.2f}")
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Trained mean_reward:\t{mean_reward:.2f} +/- {std_reward:.2f}")



---------------------------------------
| approxkl           | 1.8278017e-08  |
| clipfrac           | 0.0            |
| explained_variance | -0.000102      |
| fps                | 26             |
| n_updates          | 1              |
| policy_entropy     | 0.69314694     |
| policy_loss        | -6.5215863e-06 |
| serial_timesteps   | 128            |
| time_elapsed       | 1e-05          |
| total_timesteps    | 128            |
| value_loss         | 48.960938      |
---------------------------------------
--------------------------------------
| approxkl           | 8.088178e-08  |
| clipfrac           | 0.0           |
| explained_variance | -0.00057      |
| fps                | 139           |
| n_updates          | 2             |
| policy_entropy     | 0.69314694    |
| policy_loss        | -2.626516e-05 |
| serial_timesteps   | 256           |
| time_elapsed       | 4.85          |
| total_timesteps    | 256           |
| value_loss         | 39.527786     |
------------

Apparently, there are also some changes to the replay loop when using an LSTM policy.

In [5]:
# obs = env.reset()
# for i in range(1000):
#     action, _states = model.predict(obs)
#     obs, rewards, done, info = env.step(action)
#     env.render() # uncomment to render
#     if done:
#       obs = env.reset()

# env.close()

obs = env.reset()
# Passing state=None to the predict function means
# it is the initial state
state = None
# When using VecEnv, done is a vector
done = [False for _ in range(env.num_envs)]
for _ in range(1000):
    # We need to pass the previous state and a mask for recurrent policies
    # to reset lstm state when a new episode begin
    action, state = model.predict(obs, state=state, mask=done)
    obs, reward , done, _ = env.step(action)
    # Note: with VecEnv, env.reset() is automatically called

    # Show the env
    # env.render()

And that's it!
We have loaded and trained a custom environment!

## Other Examples
* [Stock Market](https://towardsdatascience.com/creating-a-custom-openai-gym-environment-for-stock-trading-be532be3910e)
