<a href="https://colab.research.google.com/github/rahiakela/deep-reinforcement-learning-with-python/blob/main/02-guide-to-gym-toolkit/2_classic_control_environments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Classic control environments

The gym provides environments for several classic control tasks such as cart pole balancing, swinging up the pendulum, mountain car climbing and so on. Let's understand how to create a gym environment for a cart pole balancing task. The cart pole environment is shown below:


<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/17.PNG?raw=1'/>

Cart Pole balancing is one of the classical control problems. As shown in the above figure, the pole is attached to the cart and the goal of our agent is to balance the pole on the cart, that is, the goal of our agent is to keep the pole straight up standing on the cart as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/18.PNG?raw=1'/>


So the agent tries to push the cart left and right to keep the pole standing straight on the cart. Thus our agent performs two actions which are pushing the cart to the left and pushing the cart to the right to keep the pole standing straight on the cart. You can also check this very interesting video https://youtu.be/qMlcsc43-lg which shows how the RL agent balances the pole on the cart by moving the cart left and right. 

Now, let's learn how to create the cart pole environment using the gym. The environment id of the cart pole environment in the gym is `CartPole-v0` , so we can just use our `make` function to create the cart pole environment as shown below:


Reference:

- https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb

- https://medium.com/analytics-vidhya/rendering-openai-gym-environments-in-google-colab-9df4e7d6f99f

- https://www.toptal.com/machine-learning/deep-dive-into-reinforcement-learning


## Setup

**Render OpenAI Gym Environments in CoLab**

It is possible to visualize the game your agent is playing, even on CoLab.  This section provides information on how to generate a video in CoLab that shows you an episode of the game your agent is playing. This video process is based on suggestions found [here](https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t).

Begin by installing **pyvirtualdisplay** and **python-opengl**.

In [3]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

Next, we install needed requirements to display an Atari game.

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

Next we define functions used to show the video by adding it to the CoLab notebook.

In [4]:
import gym
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment 
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

The Gym library allows us to query some of these attributes from environments.  I created the following function to query gym environments.

In [5]:
def query_environment(name):
  env = gym.make(name)
  spec = gym.spec(name)
  print(f"Action Space: {env.action_space}")
  print(f"Observation Space: {env.observation_space}")
  print(f"Max Episode Steps: {spec.max_episode_steps}")
  print(f"Nondeterministic: {spec.nondeterministic}")
  print(f"Reward Range: {env.reward_range}")
  print(f"Reward Threshold: {spec.reward_threshold}")

In [6]:
query_environment("CartPole-v0")

Action Space: Discrete(2)
Observation Space: Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
Max Episode Steps: 200
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: 195.0


Now we are ready to play the game.  We use a simple random agent.

In [7]:
env = wrap_env(gym.make("CartPole-v0"))

observation = env.reset()

while True:
  
    env.render()
    
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
        
    if done: 
      break;
            
env.close()
show_video()

We can also close rendering the environment using the `close` function:

In [8]:
env.close()

## State space

Now, let's look at the state space of our cart pole environment. Wait! What are the states here? In the frozen lake environment we had discrete 16 states from (S to G). But how can we describe the states here? Can we describe the state by cart position? Yes! Note that the cart position is a continuous value. So, in this case, our state space will be continuous values, unlike the frozen lake environment where our state space had discrete values (S to G).

But with just the cart position alone we cannot describe the state of the environment completely. So we include cart velocity, pole angle and pole velocity at the tip. So we can describe our state space by an array of values as shown below:

`array([cart position, cart velocity, pole angle, pole velocity at the tip])`

Note that all of these values are continuous, that is:

* The value of cart position ranges from -4.8 to 4.8
* The value of cart velocity ranges from -Inf to Inf
* The value of pole angle ranges from -0.418 radians to 0.418 radians 
* The value of pole velocity city at the tip ranges from -Inf to Inf

Thus, our state space contains an array of continuous values. Let's learn how can we obtain this from the gym. In order to get the state space, we can just type `env.observation_space` as shown below:

In [9]:
print(env.observation_space)

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)


Box implies that our state space consists of continuous values and not discrete values. That is, in the frozen lake environment, we obtained the state space as `Discrete(16)` which implies that we have 16 discrete states (S to G). But now we got our state space as `Box(4,)` which implies that our state space is continuous and consists of an array of 4 values.

For example, let's reset our environment and see how our initial state space will look like. We can reset the environment using the `reset` function:



In [10]:
print(env.reset())

[-0.01983735  0.04149103 -0.04353716  0.04083448]


It implies our initial state space, as we can notice, we have an array of 4 values which denotes the cart position, cart velocity, pole angle and pole velocity at the tip respectively. That is:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/19.PNG?raw=1'/>

Okay, how can we obtain the maximum and minimum values of our state space? We can obtain the maximum values of our state space using `env.observation_space.high` and the minimum values of our state space using `env.observation_space.low`

For example, let's look at the maximum value of our state space:

In [11]:
print(env.observation_space.high)

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


It implies that:

1. The maximum value of the cart position is 4.8
2. We learned that the maximum value of cart velocity is  +Inf, we know that infinity is not really a number, so it is represented using the largest positive real value 3.4028235e+38.
3. The maximum value of the pole angle is 0.418 radians.
4. The maximum value of pole velocity at the tip is +Inf, so it is represented using largest positive real value 3.4028235e+38

Similarly, we can obtain the minimum value of our state space as:

In [12]:
print(env.observation_space.low)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


## Action space

Now, let's look at the action space. We already learned that in the cart pole environment we perform two actions which are pushing the cart to the left and pushing the cart to the right and thus action space is discrete since we have only two discrete actions.

In order to get the action space, we can just type `env.action_space` as shown below:

In [13]:
print(env.action_space)

Discrete(2)


As we can observe `Discrete(2)` implies that our action space is discrete and we have two actions in our action space. Note that the actions will be encoded into numbers as shown below:

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-with-python/20.PNG?raw=1'/>

## Cart Pole Balancing with Random Policy

Let's create an agent with the random policy, that is, we create the agent that selects the random action in the environment and tries to balance the pole. The agent receives +1 reward every time the pole stands straight up on the cart. We will generate over 100 episodes and we will see the return (sum of rewards) obtained over each episode. Let's learn this step by step.

Set the number of episodes and number of time steps in the episode:

In [14]:
num_episodes = 100
num_timesteps = 50

In [15]:
# for each episode
for i in range(num_episodes):
  # set the Return to 0
  return_val = 0
  # initialize the state by resetting the environment
  state = env.reset()

  # for each step in the episode
  for t in range(num_timesteps):
    # render the environment
    env.render()

    # randomly select an action by sampling from the environment
    random_action = env.action_space.sample()

    # perform the randomly selected action
    next_state, reward, done, info = env.step(random_action)

    # update the return
    return_val = return_val + reward

    # if the next state is a terminal state then end the episode
    if done:
      break

  # for every 10 episodes, print the return (sum of rewards)
  if i % 10 == 0:
    print("Episode: {}, Return: {}".format(i, return_val))

Episode: 0, Return: 18.0
Episode: 10, Return: 21.0
Episode: 20, Return: 35.0
Episode: 30, Return: 10.0
Episode: 40, Return: 13.0
Episode: 50, Return: 37.0
Episode: 60, Return: 11.0
Episode: 70, Return: 29.0
Episode: 80, Return: 28.0
Episode: 90, Return: 21.0


In [16]:
# Close the environment:
env.close()

In [17]:
# check the pole
show_video()

## Mountain Car environment

 We don’t need to implement the Mountain Car environment ourselves; the OpenAI Gym library provides that implementation. Let’s see a random agent (an agent that takes random actions) in our environment:

In [23]:
env = wrap_env(gym.make("MountainCar-v0"))
done = True
episode = 0
episode_return = 0.0

for episode in range(5):
   for step in range(200):
       if done:
           if episode > 0:
               print("Episode return: ", episode_return)
           obs = env.reset()
           episode += 1
           episode_return = 0.0
           env.render()
       else:
           obs = next_obs
       action = env.action_space.sample()
       next_obs, reward, done, _ = env.step(action)
       episode_return += reward
       env.render()

env.close()
show_video()

Episode return:  -200.0
Episode return:  -200.0
Episode return:  -200.0
Episode return:  -200.0


Now we need to replace random actions with something better. There are many algorithms one could use. For this, I think an approach called **deep 
Q-learning** is a good fit.