# System Design Integration and Control - Part II

### Open AI Gym Environments
Gym is a standard for RL environments and is coupled with other simulation environments: Box2D, Mujoco, etc... Either install it with:  

`pip install gym`

or alternatively

`git clone https://github.com/openai/gym` <br>
`cd gym` <br>
`pip install -e .` <br>

In [1]:
import gym

### Environment Instance Creation

In [2]:
env = gym.make("FrozenLake-v1",  is_slippery=False)
print("Observation Space: " + str(env.observation_space))
print("Action Space: " + str(env.action_space))
observation = env.reset()
print("Observation after reset: " + str(observation))
print("The action space can be sampled, for example:", env.action_space.sample())

Observation Space: Discrete(16)
Action Space: Discrete(4)
Observation after reset: 0
The action space can be sampled, for example: 2


#### Environments can be rendered

In [3]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


### Understanding Environment

The __State__ space is : _Discrete_? _Continuous_? <br>
How many dimensions it has? <br>
What is the state data type? <br>

The __Action__ space is _Discrete_? _Continuous_? <br>
How many actions? <br>
What is the action data type? <br>

Is the environment __episodic__, runs in episodes? terminates? When? <br>

And what characterizes the reward, is it: negative? positive? discrete? continuous? smooth? sparse? <br>

### Agent takes steps in the environment receives: <font color=blue> new obs, reward, done and info  </font>

In [4]:
#observation = env.reset()
env.render()
observation, reward, done, info = env.step(1)
print("obs", observation, "rew", reward, "done", done)
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG
obs 4 rew 0.0 done False
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


### Lets define the Agent-Environment Loop: One Episode

In [5]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, rew, done, _ = env.step(action)
    print("obs",obs,"rew",rew, "done", done)
    


obs 0 rew 0.0 done False
obs 0 rew 0.0 done False
obs 0 rew 0.0 done False
obs 0 rew 0.0 done False
obs 0 rew 0.0 done False
obs 4 rew 0.0 done False
obs 8 rew 0.0 done False
obs 9 rew 0.0 done False
obs 10 rew 0.0 done False
obs 11 rew 0.0 done True


### The Agent-Environment Loop : run one and several episodes

In [11]:
def run_episode(bRender=False, model=None):
    done = False
    obs = env.reset()
    sum_reward = 0
    while not done:
        if model == None:
            action = env.action_space.sample()
        else:
            action, _ = model.predict(obs, deterministic=True)
        obs, rew, done, _ = env.step(action)
        sum_reward += rew
        if bRender:
            env.render()
        #print("obs",obs,"rew",rew, "done", done)
    print("Episode ended sum_reward =",sum_reward)
    return rew
    
for i in range(10):
    run_episode()

Episode ended sum_reward = 11.0
Episode ended sum_reward = 23.0
Episode ended sum_reward = 24.0
Episode ended sum_reward = 25.0
Episode ended sum_reward = 23.0
Episode ended sum_reward = 17.0
Episode ended sum_reward = 22.0
Episode ended sum_reward = 22.0
Episode ended sum_reward = 13.0
Episode ended sum_reward = 13.0


In [13]:
env = gym.make("CartPole-v1")
run_episode(bRender=True)
env.close()

Episode ended sum_reward = 11.0


### For using Box2D (physics library) environments like LunarLander-v2 you need to install library box2d in addition to gym
`pip install box2d`

In [14]:
env = gym.make("LunarLander-v2")
run_episode(bRender=True)
env.close()

Episode ended sum_reward = -472.4686204662702


## Learning Libraries for solving Gym Environments 

https://stable-baselines3.readthedocs.io/en/master/ <br>
https://github.com/AI4Finance-Foundation/ElegantRL <br>
https://docs.ray.io/en/latest/rllib/index.html <br>


In [19]:
from stable_baselines3 import PPO

env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

run_episode(bRender=True, model=model)
env.close()

Episode ended sum_reward = 156.0


In [21]:
env = gym.make("LunarLander-v2")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

run_episode(bRender=True, model=model)
env.close()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 88.2     |
|    ep_rew_mean     | -192     |
| time/              |          |
|    fps             | 1761     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 88.1        |
|    ep_rew_mean          | -170        |
| time/                   |             |
|    fps                  | 1336        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008924368 |
|    clip_fraction        | 0.0343      |
|    clip_range           | 0.2         |
|    entropy_loss   

In [24]:
env = gym.make("BipedalWalker-v3")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50000)

run_episode(bRender=True, model=model)
env.close()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 338      |
|    ep_rew_mean     | -116     |
| time/              |          |
|    fps             | 1333     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 216         |
|    ep_rew_mean          | -117        |
| time/                   |             |
|    fps                  | 1076        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008955124 |
|    clip_fraction        | 0.08        |
|    clip_range           | 0.2         |
|    entropy_loss   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 487         |
|    ep_rew_mean          | -111        |
| time/                   |             |
|    fps                  | 913         |
|    iterations           | 11          |
|    time_elapsed         | 24          |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.009828264 |
|    clip_fraction        | 0.0928      |
|    clip_range           | 0.2         |
|    entropy_loss         | -5.61       |
|    explained_variance   | -0.148      |
|    learning_rate        | 0.0003      |
|    loss                 | 0.297       |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.0089     |
|    std                  | 0.976       |
|    value_loss           | 0.418       |
-----------------------------------------
-----------------------------------------
| rollout/                |       

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 644         |
|    ep_rew_mean          | -106        |
| time/                   |             |
|    fps                  | 922         |
|    iterations           | 20          |
|    time_elapsed         | 44          |
|    total_timesteps      | 40960       |
| train/                  |             |
|    approx_kl            | 0.006945964 |
|    clip_fraction        | 0.0585      |
|    clip_range           | 0.2         |
|    entropy_loss         | -5.23       |
|    explained_variance   | 0.171       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.0745      |
|    n_updates            | 190         |
|    policy_gradient_loss | -0.00605    |
|    std                  | 0.889       |
|    value_loss           | 0.199       |
-----------------------------------------
-----------------------------------------
| rollout/                |       