In [1]:
# pip install gym

Note: you may need to restart the kernel to use updated packages.


In [1]:
import gym

In [2]:
e = gym.make('CartPole-v0')

Here, we have imported the gym package and created an environment called CartPole.

The observation of this environment is four floating-point numbers containing information about the x coordinate of the stick's center of mass, its speed, its angle to the platform, and its angular speed

The reward in this environment is 1, and it is given on every time step. The episode continues until the stick falls, so to get a more accumulated reward, we need to balance the platform in a way to avoid the stick falling

In [3]:
obs = e.reset()
obs

array([-0.03456031, -0.01822479, -0.04376858, -0.01860489])

Here, we reset the environment and obtained the first observation

The action_space field is of the Discrete type, so our actions will be just 0 or 1, where 0 means pushing the platform to the left and 1 means to the right. The observation space is of Box(4,), which means a vector of size 4 with values inside the [−inf, inf] interval

In [4]:
e.action_space

Discrete(2)

In [5]:
e.observation_space

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

In [6]:
e.step(0)

(array([-0.0349248 , -0.21269262, -0.04414068,  0.25995362]), 1.0, False, {})

Here, we pushed our platform to the left by executing the action 0 and got the tuple of four elements:
*    A new observation, which is a new vector of four numbers  
*    A reward of 1.0  
*    The done flag with value False, which means that the episode is not over yet and we are more or less okay  
*    Extra information about the environment, which is an empty dictionary  

In [7]:
e.action_space.sample()

1

In [8]:
e.action_space.sample()

0

In [9]:
e.observation_space.sample()

array([ 4.17177486e+00, -4.02094942e+37,  1.23289436e-01,  3.11900481e+38],
      dtype=float32)

In [10]:
e.observation_space.sample()

array([-2.8129265e-01, -1.8738752e+38, -2.9526111e-01,  7.3845509e+36],
      dtype=float32)

This method returned a random sample from the underlying space, which in the case of our Discrete action space means a random number of 0 or 1, and for the observation space means a random vector of four numbers. The random sample of the observation space may not look useful, and this is true, but the sample from the action space could be used when we are not sure how to perform an action

In [16]:
import gym


if __name__ == "__main__":
    env = gym.make("CartPole-v0")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()
    
    """
    Here, we created the environment and initialized the counter of steps and 
    the reward accumulator. On the last line, we reset the environment to 
    obtain the first observation 
    """

    while True:
        """
        In this loop, we sampled a random action, then asked the environment to 
        execute it and return to us the next observation (obs), the reward, and 
        the done flag. If the episode is over, we stop the loop and show how 
        many steps we have taken and how much reward has been accumulated.
        """
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (
        total_steps, total_reward))
    
"""
On average, our random agent takes 12 to 15 steps before the pole falls and the 
episode ends. Most of the environments in Gym have a "reward boundary," which is 
the average reward that the agent should gain during 100 consecutive episodes to 
"solve" the environment. For CartPole, this boundary is 195, which means that, on 
average, the agent must hold the stick for 195 time steps or longer.
"""

Episode done in 20 steps, total reward 20.00


'\nOn average, our random agent takes 12 to 15 steps before the pole falls and the \nepisode ends. Most of the environments in Gym have a "reward boundary," which is \nthe average reward that the agent should gain during 100 consecutive episodes to \n"solve" the environment. For CartPole, this boundary is 195, which means that, on \naverage, the agent must hold the stick for 195 time steps or longer.\n'