<h1 style="text-align: center;">OpenAI Gym</h1>

<img width="700px" src="assets/openai.png">

In this chapter, we'll learn the basics of the OpenAI Gym API and write our first randomly behaving agent to make ourselves familiar with all the concepts.

To experiment with the code on your own (the most useful way to learn anything), it would be better get access to a machine with a GPU. This
can be done in various ways:
- Buying a modern GPU suitable for CUDA
- Using cloud instances: Both Amazon AWS and Google Cloud can provide you with GPU-powered instances

<br>

# 1. The anatomy of the agent

---

As we saw in the previous chapter, there are several entities in RL's view of the world:

- **Agent:** A person or a thing that takes an active role. In practice, it's some piece of code, which implements some policy. Basically, this policy must decide what action is needed at every time step, given our observations.

- **Environment:** Some model of the world, which is external to the agent and has the responsibility of providing us with observations and giving us rewards. It changes its state based on our actions.

In [None]:
# Import the libraries
import random

In [None]:
# The environment
class Environment:
    
    # Initialization function
    def __init__(self):
        """
        Funcion for initializing the internal state of the environment. In here, we assign 
        a counter that limits the number of time steps the agent is allowed to take to 
        interact with the environment.
        """
        self.steps_left = 10

    # Get the observation
    def get_observation(self):
        """
        Function for getting the current environment's observation. In here, we return the 
        observation vector of zero since the environment has no internal state.
        """
        return [0.0, 0.0, 0.0]

    # Get the action
    def get_actions(self):
        """
        Function for querying the set of actions it can execute.
        """
        return [0, 1]

    # Signal the end of episode
    def is_done(self):
        """
        Function that signals the end of the episode to the agent.
        """
        return self.steps_left == 0

    
    def action(self, action):
        """
        Function for action which does the following two things:
            - Handles the agent's action
            - Returns the reward for this action.
            
        In here, the reward is random and its action is discarded. Additionally, we update the 
        count of steps and refuse to continue the episodes which are over.
        """
        # If in terminal state, then game is finisehd
        if self.is_done():
            raise Exception("Game is over")
            
        # Decreasing the time step by 1
        self.steps_left -= 1
        
        # Return a random reward
        return random.random()

In [None]:
# The agent
class Agent:
    
    # The constructor
    def __init__(self):
        """
        Function for initializing the counter that will keep the total reward accumulated by the 
        agent during the episode
        """
        self.total_reward = 0.0

    def step(self, env):
        """
        The step function accepts the environment instance as an argument and allows the agent to 
        perform the following actions:
        
            - Observe the environment
            - Make a decision about the action to take based on the observations
            - Submit the action to the environment
            - Get the reward for the current step
        
        For our example, the agent is dull and ignores observations obtained during the decision 
        process about which action to take. Instead, every action is selected randomly.
        """
        # Get the current obervation
        current_obs = env.get_observation()
        
        # Get the actions
        actions = env.get_actions()
        
        # Take a random action and get the reward
        reward = env.action(random.choice(actions))
        
        # Accumulate the reward
        self.total_reward += reward

In [1]:
# The final piece is the glue code, which creates both classes and runs one episode:        
if __name__ == "__main__":
    env = Environment()
    agent = Agent()

    while not env.is_done():
        agent.step(env)

    print("Total reward got: %.4f" % agent.total_reward)

Total reward got: 4.6047


By running this several times, you'll get different amounts of reward gathered by the agent.

The simplicity of the preceding code allows us to illustrate important basic concepts that come from the RL model. The environment could be an extremely complicated physics model, and an agent could easily be a large neural network implementing the latest RL algorithm, but the basic pattern stays the same: on every step, an agent takes some observations from the environment, does its calculations, and selects the action to issue. The result of this action is a reward and new observation.

You may wonder, if the pattern is the same, why do we need to write it from scratch? Perhaps it is already implemented by somebody and could be used as a library?
Of course, such frameworks exist, but before we spend some time discussing them, let's prepare your development environment.

<br>

# 2. Hardware and software requirements

---

The external libraries we'll use in this book are open source software, including the following:
- **numpy==1.14.2**
- **atari-py==0.1.1**
- **gym==0.10.4**
- **ptan==0.3**
- **opencv-python==3.4.0.12**
- **scipy==1.0.1**
- **pytorch==0.4.0**
- **torch==0.4.0**
- **torchvision==0.2.1**
- **tensorboard-pytorch==0.7.1**
- **tensorflow==1.7.0**
- **tensorboard==1.7.0**

<br>

# 3. OpenAI Gym API

---

Gym is a python library which was developed by OpenAI (www.openai.com). The main goal of this library is to provide a collection of environments for using it in RL. So obviously the central class in this library is called Env. Each environment provides with the following pieces of functionality:

- A **set of actions** that can be executed in an environment. These <u>actions</u> can be <u>discrete</u> or continuous. It can also be a <u>combination</u> of them.

- The **observations** that an environment provides the agent with.

- **Step method** for executing an action which returns the <u>current observation</u>, <u>reward</u>, and <u>indication that the episode is over</u>.

- **Reset method** for returning the environment to its initial state and to obtain the first observation.


Let's talk about these components of the environment in detail:

<br>

### 3.1. Action space

The actions that an agent can execute can be <u>discrete</u>, <u>continuous</u>, or a <u>combination of both</u>:
- **Discrete actions:** 
    - A fixed set of things that an agent could do. 
    - For example, directions in a grid (like left, right, up, or down). 
    - Another example is a push button (which can be pressed or released). The reason why this is a discrete action space is because the main characteristic of a discrete action space is that only one action from the action space is possible.


- **Continuous action:** 
    - A continous action has a value attached to it, for instance, a steering wheel (which can be turned at a specific angle, or an accelerator pedal, which can be pressed with different levels of force). A description of a continuous action includes the boundaries of the value that the action could have. In the case of a steering wheel, it could be from −720 degrees to 720 degrees. For an accelerator pedal, it's usually from 0 to 1.


- **Multiple actions:**
    - Of course, we're not limited to a single action to perform, and the environment could have multiple actions, such as pushing multiple buttons simultaneously or steering the wheel and pressing two pedals (brake and accelerator). To support such cases, Gym defines a special container class that allows the nesting of several action spaces into one unified action.

<br>

### 3.2. Observation space

Beside the reward, environment provide the agent with observations. These **observations** can be just a bunch of numbers or it can also be complex multidimensional tensors (like color images from several cameras).

An observation can be **discrete** (much like action spaces). An example of such a discrete observation space could be a light bulb, which could be in two states: on or off, given to us as a Boolean value.

The following diagram represents a class (among Gym's classes):

<img width="500px" src="assets/hierarchy of the space.png">

The space class have two methods:
- **sample()**: This returns a random sample from the space
- **contains(x)**: This checks if the argument x belongs to the space's domain

These two methods are abstract and are re-implemented in the child classes:

1. **Box:** 
    - This class represents an n-dimensional tensor of rational numbers with intervals (that is <u>[low, high]</u>). 
    - For example, let's say an accelerator pedal has a single value between 0.0 and 1.0. This can be encoded as ***Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)***.
        - The shape is a tuple of length 1 with a single value of 1 (one-dimensional tensor with a single value).
        - The dtype parameter specifies the value type. In here, it is a NumPy 32-bit float.
        - Another example for Box can be an Atari screen obervation which is an RGB image of size 210 × 160: ***Box(low=0, high=255, shape=(210, 160, 3), dtype=np.uint8)***.
            - The shape argument is a tuple of three elements. the first dimension is the height of the image, the second is the width, and the third equals 3 which correspond to the three color planes of RGB. So, in total, every observation is a 3D tensor with 100,800 bytes.


2. **Discrete:** 
    - This class represents a set of items that is numbered from 0 to n-1. This contains a field named **n** which is the item count. 
    - For example, <i>Discrete(n=4)</i> in action space can mean that there is four directions that we can move (left, right, up, down).        


3. **Tuple:**
    - This class allows us to combine several Space class instances together. This enables us to create action and observation spaces of any complexity that we want. 
    - For exmaple, let's say we want to create an action space for a car. The car has several controls (that is changing at every timestep), including the <u>steering wheel angle</u>, <u>brake pedal position</u>, and <u>accelerator pedal position</u>. These three controls can be shown by three float values in one single Box instance. These three controls can be shown by three float values in one single Box instance. Beside these three controls, the car has extra discrete controls as well. These discrete control can be a turn signal (which could be "off," "right," or "left'"), horn ("on" or "off"), and others. For combining all these controls into one action space, we can create ***Tuple(spaces=(Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32), Discrete(n=3), Discrete(n=2)))***. 

There are other sub-classes defined in Gym as well, however those three ones the most useful one.

All subclasses implement the sample() and contains() methods:

- The **sample()** function performs a random sample corresponding to the Space class and parameters. This is mostly useful for action spaces, when we need to choose the random action.
    - **Discrete.sample()** returns a random element from a discrete range.
    - **Box.sample()** returns a random tensor with proper dimensions and values lying inside the given range.


- The **contains()** method verifies that the given arguments comply with Space parameters, and it is used in the internals of Gym to check an agent's actions for sanity. 

<br>

Every environment has two members of type Space: 
1. **action_space**
2. **observation_space**

Note that dealing with pixels of the screen is different from handling discrete observations. In case of using pixels, we may want to preprocess images with convolutional layers or with other methods from the computer vision toolbox.

<br>

### 3.3. The environment

The environment is represented in Gym by the Env class, which has the following members:
- **action_space**: This is the field of the Space class, providing the allowed actions in the environment. 
- **observation_space**: This field has the same Space class, but specifies the observations provided by the environment.
- **reset()**: This resets the environment to its initial state, returning the initial observation vector.
- **step()**: This method allows the agent to give the action and returns the information about the outcome of the action: 
    - The next observation 
    - The local reward
    - End-of-episode flag
- **render()** This is an extra utility method which allows us to obtain the observation in a human-friendly form. However, we won't use them.

<br>

Communication with the environment are performed via two methods: 
1. **reset**: <br>The reset() method has no arguments, and it instructs an environment to reset into its initial state and obtain the initial observation. Note that you have to call reset() after the creation of the environment. As you may remember, the agent's communication with the environment could have an end (like a "Game Over" screen). Such sessions are called episodes, and after the end of the episode, an agent needs to start over. The value returned by this method is the first observation of the environment.<br>
2. **step**: The step() method is the central piece in the environment's functionality, which does several things in one call, which are as follows:
        1. Telling the environment which action we'll execute on the next step
        2. Getting the new observation from the environment after this action
        3. Getting the reward the agent gained with this step
        4. Getting the indication that the episode is over
    
The first item (action) is passed as the only argument to this method, and the rest is returned by function. Precisely, it's a tuple of four elements (observation, reward, done, and extra_info). They have these types and meanings:
- **observation**: This is a NumPy vector or a matrix with observation data.
- **reward**: This is the float value of the reward.
- **done**: This is a Boolean indicator, which is True when the episode is over.
- **extra_info**: This could be anything environment-specific with extra information about the environment. The usual practice is to ignore this value in general RL methods (not taking into account the specific details of the particular environment).

So, you may have already got the idea of environment usage in an agent's code: in a loop, call the step() method with an action to perform until this method's done flag becomes True. Then we can call reset() to start over. There is only one piece missing: how we create Env objects in the first place.

<br>

### 3.4. Creation of the environment

Every environment has a unique name of the ***EnvironmentName-vN*** form, where N is the number used to distinguish between different versions of the same environment. To create the environment, the Gym package provides the make(env_name) function with the only argument of the environment's name in the string form.
Note that the same environment can have different variations in the settings and observations spaces. For example, the Atari game Breakout has these environment names:

- **Breakout-v0, Breakout-v4**: The original breakout with a random initial position and direction of the ball.
- **BreakoutDeterministic-v0**, **BreakoutDeterministic-v4**: Breakout with the same initial placement and speed vector of the ball.
- **BreakoutNoFrameskip-v0**, **BreakoutNoFrameskip-v4**: Breakout with every frame displayed to the agent.
- **Breakout-ram-v0**, **Breakout-ram-v4**: Breakout with observation of full Atari emulation memory (128 bytes) instead of screen pixels.
- **Breakout-ramDeterministic-v0**, **Breakout-ramDeterministic-v4**
- **Breakout-ramNoFrameskip-v0**, **Breakout-ramNoFrameskip-v4**

In total, there are 12 environments for good old Breakout. In case you've never seen it before, here is a screenshot of its gameplay:

<img src="assets/space invadors.png">

Each environment can be divided into several groups:
1. **Classic control problems**: These are toy tasks that are used in optimal control theory and RL papers as benchmarks or demonstrations. They are usually simple, with a low-dimension observation and action spaces, but they are useful as quick checks when implementing algorithms. Think about them as the "MNIST for RL". 
<img src="assets/classic_control.png">


2. **Atari 2600**: These are games from the classic game platform from the 1970s. There are 63 unique games.
<img src="assets/atari.png">


3. **Algorithmic**: These are problems that aim to perform small computation tasks, such as copying the observed sequence or adding numbers.
<img src="assets/Algorithms.png">


4. **Board games**: These are the games of Go and Hex.


5. **Box2D**: These are environments that use the Box2D physics simulator to learn walking or car control.
<img src="assets/Box2D.png">


6. **MuJoCo**: This is another physics simulator used for several continuous control problems.
<img src="assets/MuJoCo.png">

7. **Parameter tuning**: This is RL being used to optimize neural network parameters.


8. **Toy text**: These are simple grid-world text environments.
<img src="assets/Toy text.png">


9. **PyGame**: These are several environments implemented using the PyGame engine.


10. **Doom**: These are nine mini-games implemented on top of ViZdoom.


The full list of environments can be found at https://gym.openai.com/envs or on the wiki page in the project's GitHub repository.

<br>

### 3.5. The CartPole session

The CartPole environment is from the "classic control" group and its gist is to control the platform with a stick attached by its bottom part. The trickiness is that this stick tends to fall right or left and you need to balance it by moving the platform to the right or left on every step. The warning message we see is not our fault, but a small inconsistency inside Gym, which doesn't affect the result.

<img width="400px" src="assets/CartPole.png">

In [2]:
# Import Gym package
import gym

# Create CartPole environment
e = gym.make('CartPole-v0')

The **observation** of this environment is four float numbers containing information about: 
- The x coordinate of the stick's center of mass
- Its speed
- Its angle to the platform
- Its angular speed.

Of course, by applying some math and physics knowledge, it won't be complicated to convert these numbers into actions when we need to balance the stick, but our problem is much trickier: how do we learn to balance this system without knowing the exact meaning of the observed numbers and only by getting the reward? The reward in this environment is 1 given on every time step. The episode continues until the stick falls; so to get a more accumulated reward, we need to balance the platform in a way to avoid the stick falling.

In [2]:
# Reset the environment + Obtain the first observation (we always need to reset the newly created environment)
obs = e.reset()
obs

array([ 0.02576695, -0.01319964, -0.00929848, -0.04565716])

In below, the action_space field is of the Discrete type, so our **actions** will be one of followings:
- **0**: Pushing the platform to the left 
- **1**: Pushing the platform to the right. 

In [3]:
# Get the action space
e.action_space

Discrete(2)

In below, the **observation space** is of Box(4,) which means a vector of size four with values inside the [−inf, inf] interval.

In [13]:
# Get the observation space
e.observation_space

Box(4,)

In below, we pushed our platform to the left by executing the action 0 and got the tuple of four elements:
1. New observation that is a new vector of four numbers
2. Reward of 1.0
3. The done flag = False, which means that the episode is not over yet
4. Extra information about the environment that is an empty dictionary

In [5]:
# Take action 0 (going left) and get the tuple of (new observation, reward, terminal state, extra information)
e.step(0)

(array([ 0.02550296, -0.20818703, -0.01021162,  0.24407757]), 1.0, False, {})

In below, we used the sample() method on action_space and observation_space. This method returns a random sample from the underlying space, which in the case of our Discrete action space means a random number of 0 or 1 and for the observation space is a random vector of four numbers. 

The random sample of the observation space is not very useful, however the sample from the action space could be used when we're not sure how to perform an action. This feature is especially handy for us, as we don't know any RL methods yet, but still want to play around with the Gym environment.

In [10]:
# Random sample from action space
e.action_space.sample()

0

In [9]:
# Random sample from action space
e.action_space.sample()

1

In [11]:
# Random sample from observation space
e.observation_space.sample()

array([1.56953299e+00, 2.80430355e+38, 1.02720976e-01, 3.39920022e+38],
      dtype=float32)

In [12]:
# Random sample from observation space
e.observation_space.sample()

array([ 3.3692710e+00, -2.4931366e+37, -2.9104799e-01,  2.1715210e+38],
      dtype=float32)

<br>

# 4. The random CartPole agent
---

Although this environment is much more complex than our first example in The anatomy of the agent section, the code of the agent is much shorter. This is the power of reusability, abstractions, and third-party libraries!

So, here is the code (you can find it in Chapter02/02_cartpole_random.py):

In [16]:
# Import the gym library
import gym

# Executing the main program
if __name__ == "__main__":
    
    # Get the CartPole environment
    env = gym.make("CartPole-v0")
    
    # Initialize total reward to 0
    total_reward = 0.0
    
    # Initialize the total number of time steps
    total_steps = 0
    
    # Reset the environment + Obtain the first observation
    obs = env.reset()
    
    # Infinite loop
    while True:
        
        # Sample a random action (from a discrete action space which is a random number of 0 or 1)
        action = env.action_space.sample()
        
        # Take the above action and get the tuple of (new observation, reward, terminal state, extra information)
        obs, reward, done, _ = env.step(action)
        
        # Append the reward into total reward
        total_reward += reward
        
        # Increment the total time step
        total_steps += 1
        
        # If episode ended then break
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

Episode done in 12 steps, total reward 12.00


On average, our random agent makes 12–15 steps before the pole falls and the episode ends. Most of the environments in Gym have a **"reward boundary"** which is the average reward that the agent should gain during 100 consecutive episodes to "solve" the environment. For CartPole, this boundary is 195, which means that on average, the agent must hold the stick during 195-time steps or longer. Using this perspective, our random agent's performance looks poor. However, don't be disappointed too early, because we are just at the beginning, and soon we will solve CartPole and many other much more interesting and challenging environments.

<br>

# 5. The extra Gym functionality – wrappers and monitors

---

What we discussed so far covers two-thirds of the Gym core API and the essential functions required to start writing agents. The rest of the API you can live without, but it will make your life easier and your code cleaner. So, let's look at a quick overview of the rest of the API.

<br>

### 5.1. Wrappers

In most of the times, we want to extend the environment's functionality in some generic way. For example... 

- An environment gives observations and we usually want to accumulate them in some buffer and provide the last N observations to the agent (common in dynamic computer games since one single frame is not enough to get the full information about the game state).

- When we want to crop or preprocess image pixels for making it easy for agent to digest or to normalize the reward somehow.

There are many such situations that we'd like to "wrap" the existing environment and add some extra logic doing something. Gym provides you with a convenient framework for these situations, called the **Wrapper** class. The class structure is shown in the following diagram.

<img width="400px" src="assets/wrapper.png">

The Wrapper class inherits the Env class. Its constructor accepts only one argument: the instance of the Env class to be "wrapped." To add extra functionality, you need to redefine the methods you want to extend such as step() or reset(). The only requirement is to call the original method of the superclass.

To handle more specific requirements, such as a Wrapper class that wants to process only observations from the environment or only actions, there are subclasses of Wrapper that allow filtering of only a specific portion of information. They are as follows:

- **ObservationWrapper**: You need to redefine observation(obs) method of the parent. The obs argument is an observation from the wrapped environment, and this method should return the observation that will be given to the agent.
- **RewardWrapper**: This exposes the reward(rew) method, which could modify the reward value given to the agent.
- **ActionWrapper**: You need to override the action(act) method, which could tweak the action passed to the wrapped environment to the agent.

For example, let's imagine a situation where we want to modify the actions sent by the agent and, with a probability of 10%, replace the current action with a random one. This might look unwise, but this is one of the most practical methods for solving the "exploration/exploitation problem". As you remmber, in chapter 1 we learned that by issuing random actions, we make our agent explore the environment and from time to time drift away from the beaten track of its policy. This is an easy thing to do, using the ActionWrapper class:

In [7]:
# Import the libraries
import gym
import random

# Random action wrapper class
class RandomActionWrapper(gym.ActionWrapper):
    
    # Init function
    def __init__(self, env, epsilon = 0.5):
        
        # Initialize the wrapper by calling a parent's init method
        super(RandomActionWrapper, self).__init__(env)
        
        # Initialize the epsilon (probability of a random action)
        self.epsilon = epsilon

    # Action function (This is a method that we need to override from a parent's class to tweak the agent's actions)
    def action(self, action):
        
        # Get a random number + check if it's less than epsilon
        if random.random() < self.epsilon:
            
            # Sample a random action from the action space
            print("Random!")
            return self.env.action_space.sample()
        
        # Return action if epsilon i smaller
        return action

    
# Executing the main program
if __name__ == "__main__":
    
    # Get the CartPole environment and pass it to our wrapper constructor
    env = RandomActionWrapper(gym.make("CartPole-v0"))

    # Reset the environment + Obtain the first observation
    obs = env.reset()
    
    # Initialize total reward to 0
    total_reward = 0.0

    # Infinite loop
    while True:
        
        # Take action 0 and get the tuple of (new observation, reward, terminal state, extra information)
        obs, reward, done, _ = env.step(0)
        
        # Append reward into total reward
        total_reward += reward
        
        # If terminal state
        if done:
            
            # Break the loop
            break
    
    # Print the total reward
    print("Reward got: %.2f" % total_reward)

Random!
Random!
Random!
Random!
Random!
Random!
Random!
Random!
Reward got: 12.00


**You can play with the epsilon parameter on the wrapper's creation and verify that randomness improves the agent's score on average.**

<br>

### 5.2. Monitor (OPTIONAL)

Another class you should be aware of is Monitor. It is implemented like Wrapper and can write information about your agent's performance in a file with an optional video recording of your agent in action. Some time ago, it was possible to upload the result of the Monitor class' recording to the https://gym.openai.com website and see your agent's position in comparison to other people's results (see thee following screenshot), but, unfortunately, at the end of August 2017, OpenAI decided to shut down this upload functionality and froze all the results. There are several activities to implement an alternative to the original website, but they are not ready yet. I hope this situation will be resolved soon, but at the time of writing it's not possible to check your result against those of others.

Just to give you an idea of how the Gym web interface looked, here is the CartPole environment leaderboard:

<img width="700px" src="assets/cartpolev0.png">

Every submission in the web interface had details about training dynamics. For example, the following is the author's solution for one of Doom's mini-games:

<img width="700px" src="assets/doom.png">

Despite this, Monitor is still useful, as you can take a look at your agent's life inside the environment. So, here is how we add Monitor to our random CartPole agent, which is the only difference (the entire code is in Chapter02/04_cartpole_random_ monitor.py):

In [None]:
# Import the gym library
import gym

# Executing the main program
if __name__ == "__main__":
    
    # Get the CartPole environment
    env = gym.make("CartPole-v0")
    
    # Monitor
    env = gym.wrappers.Monitor(env, "recording")

    # Initialize total reward to 0
    total_reward = 0.0
    
    # Initialize the total number of time steps
    total_steps = 0
    
    # Reset the environment + Obtain the first observation
    obs = env.reset()

    # Infinite loop
    while True:
        
        # Sample a random action (from a discrete action space which is a random number of 0 or 1)
        action = env.action_space.sample()
        
        # Take the above action and get the tuple of (new observation, reward, terminal state, extra information)
        obs, reward, done, _ = env.step(action)
        
        # Append reward into total reward
        total_reward += reward
        
        # Append reward into total reward
        total_steps += 1
        
        # If terminal state
        if done:
            
            # Break the loop
            break

    # Print the reward
    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
    
    # Close the environment
    env.close()
    env.env.close()

The second argument in <code>gym.wrappers.Monitor(env, "recording")</code> is the name of the directory it will write the results to. This directory shouldn't exist, otherwise your program will fail with an exception (to overcome this, you could either remove the existing directory or pass the force=True argument to the Monitor class' constructor).

The Monitor class requires the FFmpeg utility to be present on the system, which is used to convert captured observations into an output video file. This utility must be available, otherwise Monitor will raise an exception. The easiest way to install FFmpeg is using your system's package manager, which is OS distribution-specific.
To start this example, one of these three extra prerequisites should be met:
- The code should be run in an X11 session with the OpenGL extension (GLX)
- The code should be started in an Xvfb virtual display
- You can use X11 forwarding in ssh connection

The cause of this is video recording, which is done by taking screenshots of the window drawn by the environment. Some of the environment uses OpenGL to
draw its picture, so the graphical mode with OpenGL needs to be present. This could be a problem for a virtual machine in the cloud, which physically doesn't
have a monitor and graphical interface running. To overcome this, there is a special "virtual" graphical display, called Xvfb (X11 virtual framebuffer), which basically starts a virtual graphical display on the server and forces the program to draw inside it. This would be enough to make Monitor happily create the desired videos.
To start your program in the Xvbf environment, you need to have it installed on your machine (it usually requires installing the xvfb package) and run the special script, xvfb-run:

                $ xvfb-run -s "-screen 0 640x480x24" python 04_cartpole_random_monitor.py
                [2017-09-22 12:22:23,446] Making new env: CartPole-v0
                [2017-09-22 12:22:23,451] Creating monitor directory recording
                [2017-09-22 12:22:23,570] Starting new video recorder writing to
                recording/openaigym.video.0.31179.video000000.mp4
                Episode done in 14 steps, total reward 14.00
                [2017-09-22 12:22:26,290] Finished writing results. You can upload
                them to the scoreboard via gym.upload('recording')
                
As you may see from the preceding log, the video has been written successfully, so you can peek inside one of your agent's sections by playing it.

Another way to record your agent's actions is to use ssh X11 forwarding, which uses the ssh ability to tunnel X11 communications between the X11 client (Python code which wants to display some graphical information) and X11 server (software which knows how to display this information and has access to your physical display).
In X11 architecture, the client and the server are separated and can work on different machines. To use this approach, you need the following:

1. An X11 server running on your local machine. Linux comes with X11 server as a standard component (all desktop environments are using X11). On a Windows machine, you can set up third-party X11 implementations such
as open source VcXsrv (available in https://sourceforge.net/projects/ vcxsrv/).

2. The ability to log in to your remote machine via ssh, passing the –X command-line option: ssh –X servername. This enables X11 tunneling and allows all processes started in this session to use your local display for graphics output.

Then you can start a program that uses the Monitor class and it will display the agent's actions, capturing the images into a video file.

<br>

# 06. Summary

---

Congratulation! You have started to learn the practical side of RL! In this chapter, we installed OpenAI Gym with tons of environments to play with, studied its basic API and created a randomly behaving agent. You also learned how to extend the functionality of existing environments in a modular way and got familiar with a way to record our agent's activity using the Monitor wrapper.
In the next chapter, we will do a quick DL recap using PyTorch, which is a favorite library among DL researchers. Stay tuned.

***THE END***