# Gym environments

In this course, we will mostly address RL environments available in the **OpenAI Gym** framework:

<https://gym.openai.com>

It provides a multitude of RL problems, from simple text-based problems with a few dozens of states (Gridworld, Taxi) to continuous control problems (Cartpole, Pendulum) to Atari games (Breakout, Space Invaders) to complex robotics simulators (Mujoco, but the license is expensive):

<https://gym.openai.com/envs>

You can install gym and its dependencies with:

```
pip install atari_py gym
```

In [8]:
import gym
import numpy as np

The main interest of gym is that all problems have a common interface defined by the class `gym.Env`. There are only three methods that have to be implemented and used when creating a new environment:

* `reset()` restarts the environment and returns an initial state $s_0$.

* `step(action)` takes an action $a_t$ and returns the new state $s_{t+1}$, the reward $r_{t+1}$. It also returns a boolean flag indicating whether the current state is terminal and dictionary containing additional info for debugging (optional).

* `render()` displays the current state of the MDP, either text-based or in a graphical window. When learning, this should not be called to gain performance.

Additionally, you override the constructor `__init__()` to setup the state space (called observation space) and the action space. 

State and action space can either be :

* discrete (`gym.spaces.Discrete(nb_states)`), with states being an integer between 0 and `nb_states` -1.

* feature-based (`gym.spaces.Box(low=0, high=255, shape=(SCREEN_HEIGHT, SCREEN_WIDTH, 3))`) for pixel frames.

* continuous:

```python
gym.spaces.Tuple(
    gym.spaces.Box(-180.0, 180.0, 1), # First joint
    gym.spaces.Box(-180.0, 180.0, 1)  # Second joint    
)
```

Here is an example of a dummy environment with discrete states and actions, where the transition probabilities and rewards are completely random:

In [9]:
class FooEnv(gym.Env):

    def __init__(self, nb_states, nb_actions):
        "Initialize the environment, can accept additional parameters such as the number of states and actions."
        # State space, can be discrete or continuous.
        self.observation_space = gym.spaces.Discrete(nb_states)
        
        # Action space, can be discrete or continuous.
        self.action_space = gym.spaces.Discrete(nb_actions)    
        
        super().__init__()
    
    def reset(self):
        "Resets the environment and starts from an initial state."
        
        # Sample one state randomly 
        self.state = self.observation_space.sample()
        
        return self.state
    
    def step(self, action):
        """Takes an action and returns a new state, a reward, a boolean (True for terminal states) 
        and a dictionary with additional info (optional)"""
        
        self.state = self.observation_space.sample() # Random transition to another state
        self.reward = np.random.uniform(0, 1, 1)[0] # Random reward
        self.done = False # Continuing task
        self.info = {} # No info
        
        return self.state, self.reward, self.done, self.info

    def render(self, mode='human'):
        "Displays the current state of the environment. Can be text or video frames."
        
        print(self.state)
    
    def close(self):
        "To be called before exiting, to free resources (not needed here)."
        pass

With this interface, we can interact with the environment in a standardized way:

* We first create the environment.

* We pick an initial state with `reset()`.

* For a fixed number of steps (or until the episode terminates):

    * We render the curent state with `render()`.
    
    * We select an action using our RL algorithm or randomly.
    
    * We take that action (`step()`), observe the new state and the reward.
    
    * We go into the new state.

In [10]:
# Create the environment
env = FooEnv(10, 4)

# Sample the initial state
state = env.reset()

# Sample 10 transitions
for t in range(10):
    # Render the current state
    env.render()

    # Select an action randomly
    action = env.action_space.sample()
    
    # Sample a single transition
    next_state, reward, done, info = env.step(action)
    
    # Go in the next state
    state = next_state

# Exit cleanly
env.close()

0
3
2
6
4
8
5
8
9
2


That's it. To use any of the gym environments, replace the creation of the `FooEnv` instance by:

```python
env = gym.make('CartPole-v0')
```

if you want to interact with the Cartpole environment. Each environment has a unique key, with a version number (`v0`).

To have a list of the available environments, call:

```python
envs = gym.envs.registry.all()
print(envs)
```

**Q:** Interact randomly with many gym environments (Cartpole, Pendulum, Breakout, SpaceInvaders, etc). Print the rewards you obtain.

*Note 1:* If you run a fixed number of steps, you should reset the environment when a terminal state is encountered, other wise you will be stuck in that terminal state:

```python
if done:
    state = env.reset()
```

*Note 2:* If you stop the execution of a cell but the window does not close, run `env.close()` in a separate cell.

*Note 3:* Some environments are very fast, especially Atari games. A simple solution is to have python **sleep** a bit after rendering aframe, so that you can see something:

```python
import time
# ...
for t in range(1000):
    env.render()
    time.sleep(0.01) # sleep 10 milliseconds
    # ...
```

In [11]:
import gym 
# Create the environment
env = gym.make('CartPole-v0')

# Sample the initial state
state = env.reset()

# Sample 1000 transitions
for t in range(1000):
    # Render the current state
    env.render()

    # Select an action randomly
    action = env.action_space.sample()
    
    # Sample a single transition
    next_state, reward, done, info = env.step(action)
    
    # Go in the next state
    state = next_state
    
    # If terminal, reset
    if done:
        state = env.reset()

# Exit cleanly
env.close()

In [12]:
import time

# Create the environment
env = gym.make('SpaceInvaders-v0')

# Sample the initial state
state = env.reset()

# Sample 1000 transitions
for t in range(1000):
    # Render the current state
    env.render()
    time.sleep(0.01)

    # Select an action randomly
    action = env.action_space.sample()
    
    # Sample a single transition
    next_state, reward, done, info = env.step(action)
    
    # Go in the next state
    state = next_state
    
    # If terminal, reset
    if done:
        state = env.reset()

# Exit cleanly
env.close()

**Q:** Create a `RecyclingRobot` gym-like environment using last week's exercise.

The parameters `alpha`, `beta`, `r_wait` and `r_search` should be passed to the constructor of the environment andsaved as attributes.

The state space is discrete, with two states `high` and `low` which will have indices 0 and 1. The three discrete actions `search`, `wait` and `recharge` have indices 0, 1, and 2.

The initial state of the MDP (`reset()`) should be the high state.

The `step()` should generate transitions according to the dynamics of the MDP. Depending on the current state and the chosen action, make a transition to another state. For the actions `search` and `wait`, sample the reward from the normal distribution with mean `r_search` (resp. `r_wait`) and variance 0.5. 

If the random agent selects `recharge` in `high`, do nothing (next state is high, reward is 0).

Rendering is just prntingthe current state. There is nothing to close.

Interact randomly with the MDP for several steps and observe the rewards. 

In [13]:
class RecylingRobot(gym.Env):

    def __init__(self, alpha, beta, r_search, r_wait):
        "Initialize the environment."
        
        # Store parameters
        self.alpha = alpha
        self.beta = beta
        self.r_search = r_search
        self.r_wait = r_wait
        
        # State space, can be discrete or continuous.
        self.observation_space = gym.spaces.Discrete(2)
        
        # Action space, can be discrete or continuous.
        self.action_space = gym.spaces.Discrete(3)    
        
        super().__init__()
    
    def reset(self):
        "Resets the environment and starts from an initial state."
        
        # Start in the high state
        self.state = 0
        
        return self.state
    
    def step(self, action):
        """Takes an action and returns a new state, a reward, a boolean (True for terminal states) 
        and a dictionary with additional info (optional)"""
        
        if self.state == 0: # high
            if action == 0: # search
                p = np.random.rand()
                if p < self.alpha:
                    self.state = 0 # high
                else:
                    self.state = 1 # low
                self.reward = float(np.random.normal(self.r_search, 0.5, 1))
            elif action == 1: # wait
                self.state = 0 # high
                self.reward = float(np.random.normal(self.r_wait, 0.5, 1))
            elif action == 2: # recharge
                self.state = 0 # high
                self.reward = 0.0
        elif self.state == 1: # low
            if action == 0: # search
                p = np.random.rand()
                if p < self.beta:
                    self.state = 1 # low
                    self.reward = float(np.random.normal(self.r_search, 0.5, 1))
                else:
                    self.state = 0 # high
                    self.reward = -3.0
            elif action == 1: # wait
                self.state = 1 # low
                self.reward = float(np.random.normal(self.r_wait, 0.5, 1))
            elif action == 2: # recharge
                self.state = 0 # high
                self.reward = 0.0
        
        return self.state, self.reward, False, {}

    def render(self, mode='human'):
        "Displays the current state of the environment. Can be text or video frames."
        
        print(self.state)
    
    def close(self):
        "To be called before exiting, to free resources (not needed here)."
        pass

In [14]:
# Create the environment
env = RecylingRobot(alpha=0.3, beta=0.2, r_search=6, r_wait=2)

states = ['high', 'low']
actions = ['search', 'wait', 'recharge']

# Sample the initial state
state = env.reset()

# Sample 1000 transitions
for t in range(10):
    # Render the current state
    env.render()

    # Select an action randomly
    action = env.action_space.sample()
    
    # Sample a single transition
    next_state, reward, done, info = env.step(action)
    
    print(states[state], "+", actions[action], "->", states[next_state], ":", reward)
    
    # Go in the next state
    state = next_state

# Exit cleanly
env.close()

0
high + recharge -> high : 0.0
0
high + recharge -> high : 0.0
0
high + search -> low : 7.072737524808381
1
low + search -> low : 4.8748445459280205
1
low + recharge -> high : 0.0
0
high + search -> low : 6.499376102875357
1
low + wait -> low : 1.6026651046615343
1
low + wait -> low : 2.640782574427737
1
low + wait -> low : 2.0998397935226567
1
low + search -> high : -3.0


To be complete, let's implement the random agent as a class:

In [15]:
class RandomAgent:
    def __init__(self, env, gamma):
        self.env = env
        self.gamma = gamma
    def select_action(self, state):
        return env.action_space.sample()
    def learn(self, state, action, reward, next_state):
        pass

The agent takes the environment as an argument, as well as the discount rate $\gamma$. `select_action()` randomly samples an action from the action space of the environment. `learn()` does nothing because random agents do not learn. This will have to be implemented for RL algorithms.

**Q:** Modify the interaction loop so that the `RandomAgent` interacts with the `RecyclingRobot` environment for a fixed number of iterations.

In [16]:
# Create the environment
env = RecylingRobot(alpha=0.3, beta=0.2, r_search=6, r_wait=2)

states = ['high', 'low']
actions = ['search', 'wait', 'recharge']

# Creating the random agent
agent = RandomAgent(env, gamma=0.7)

# Sample the initial state
state = env.reset()

# Sample 1000 transitions
for t in range(1000):
    # Render the current state
    env.render()

    # Select an action randomly
    action = agent.select_action(state)
    
    # Sample a single transition
    next_state, reward, done, info = env.step(action)
    
    print(states[state], "+", actions[action], "->", states[next_state], ":", reward)
    
    # Learn from the (s, a, r, s') transition
    agent.learn(state, action, reward, next_state)
    
    # Go in the next state
    state = next_state

# Exit cleanly
env.close()

0
high + wait -> high : 1.2092650868954213
0
high + wait -> high : 1.9985443925921826
0
high + search -> high : 6.878534062032354
0
high + search -> high : 5.876490775384255
0
high + wait -> high : 2.6095647752778515
0
high + search -> low : 6.142351959104785
1
low + wait -> low : 2.318363792841558
1
low + wait -> low : 1.6704729016262323
1
low + search -> high : -3.0
0
high + wait -> high : 1.9673012704614368
0
high + search -> low : 6.742837143430857
1
low + recharge -> high : 0.0
0
high + recharge -> high : 0.0
0
high + wait -> high : 2.3981331489900515
0
high + recharge -> high : 0.0
0
high + wait -> high : 2.2192785911313315
0
high + search -> high : 6.080127019829225
0
high + wait -> high : 0.8007715425065975
0
high + wait -> high : 1.8432478405678347
0
high + search -> high : 5.599776258981913
0
high + search -> low : 5.558733917161804
1
low + wait -> low : 2.525313755731908
1
low + recharge -> high : 0.0
0
high + search -> low : 5.296551032079032
1
low + wait -> low : 2.2929838

That's it! We now "only" need to define classes for all the sampling-based RL algorithms (MC, TD, deep RL) and we can interact with any environment using the previous cell!