# Reinforcement Learning "Hello World"

## Copyright

*Copyright Geoscience DS&ML Special Interest Group, 2022.*
*License: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)*
*Author(s): [Altay Sansal](https://github.com/tasansal)*

## Prerequisites

This notebook assumes familiarity with Reinforcement Learning concepts (agents, spaces, actions, ...) and knowledge of [OpenAI's Gym](https://www.gymlibrary.ml/) APIs and environments. If you are not familiar with RL or Gym at all, we refer the reader to the following resources:
1. [OpenAI Gym Introduction](https://www.gymlibrary.ml/content/api/)
2. [OpenAI Gym Introduction](https://www.gymlibrary.ml/content/api/)
3. [OpenAI Spinning Up in Deep RL](https://spinningup.openai.com/en/latest/)
    * [Part 1: Key Concepts in RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
    * [Part 2: Kinds of RL Algorithms](https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html)
    * [Part 3: Intro to Policy Optimization](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
    * [Key Papers](https://spinningup.openai.com/en/latest/spinningup/keypapers.html)
    * [RL Algorithms](https://spinningup.openai.com/en/latest/user/algorithms.html)
4. [David Silver's RL Course](https://www.davidsilver.uk/teaching/)
5. [Berkeley's Deep RL Bootcamp](https://sites.google.com/view/deep-rl-bootcamp/lectures)
6. [PyTorch DQN from Scratch](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
7. [OpenAI Gym Tutorials](https://www.gymlibrary.ml/content/tutorials/)
8. [More Resources](https://github.com/dennybritz/reinforcement-learning#resources)
9. [If you like MATLAB...](https://www.mathworks.com/products/reinforcement-learning.html)

First two resources are bare-minimum, but for the curious minded, we included more detailed resources!

## Introduction

Here we will be using an environment from `Gym`. The environment we choose is the `CartPole-v1`, which is a part of the classic control system environments.

The problem we are trying to solve is trying to keep a pole upright, which is attached to a frictionless cart with an un-actuated joint. The goal is to balance the pole by moving the cart to the left or to the right.

Untrained environment looks like this:
<img src="https://www.gymlibrary.ml/_images/cart_pole.gif" width="350"/>

More information about the environment can be found [here](https://www.gymlibrary.ml/environments/classic_control/cart_pole/) and the source code for the environment can be found [here](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py).

The environment has the following action and observation spaces:

| Property          | Details             |
|-------------------|---------------------|
| Action Space      | `Discrete(2)`       |
| Observation Space | `Box(4, 'float32')` |

Action Space:

| Num | Action                 |
|:---:|------------------------|
|  0  | Push cart to the left  |
|  1  | Push cart to the right |

Observation Space:

| Num | Action                |     Min      |     Max     |
|:---:|-----------------------|:------------:|:-----------:|
|  0  | Cart Position         |    `-4.8`    |    `4.8`    |
|  1  | Cart Velocity         |    `-inf`    |    `inf`    |
|  2  | Pole Angle            | `-0.418 rad` | `0.418 rad` |
|  3  | Pole Angular Velocity |    `-inf`    |    `inf`    |

Let's take a more detailed look!

We first import `gym` and make the environment. Then we will look at its pre-configured action space and observation space using built in attributes.

In [5]:
import gym

env_id = "CartPole-v1"
env = gym.make(env_id)

In [6]:
env.action_space

Discrete(2)

In [7]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

Here let's see what the observation state looks like if we reset the environment five times.

In [21]:
for idx in range(5):
    print(f"reset {idx + 1}", env.reset())

reset 1 [ 0.01589157 -0.03395329  0.04970384  0.01610303]
reset 2 [-0.0248685   0.0477733   0.03452688 -0.03332323]
reset 3 [ 0.02041535 -0.0374088  -0.03411317  0.00380573]
reset 4 [ 0.02248388 -0.00815395  0.01375304 -0.04466613]
reset 5 [ 0.04557251 -0.01278952 -0.04608022  0.00124771]


As expected, we can see the cart and pole parameters are all different once we reset the environment.

Reminder: Values are `[cart_position, cart_velocity, pole_angle(rad), pole_angular_velocity]`.

Now let's reset the environment and take 10 steps, then we will visualize it. Since there is no trained model, the actions we sample are going to be randomly selected from the action space.

In [22]:
from gym import wrappers

env = gym.make('SpaceInvaders-v0')
env_w = wrappers.Monitor(env, "./gym-results", force=True)
env_w.reset()
for idx in range(10):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)

    if done:
        break

env.close()

DependencyNotInstalled: Found neither the ffmpeg nor avconv executables. On OS X, you can install ffmpeg via `brew install ffmpeg`. On most Ubuntu variants, `sudo apt-get install ffmpeg` should do it. On Ubuntu 14.04, however, you'll need to install avconv with `sudo apt-get install libav-tools`. Alternatively, please install imageio-ffmpeg with `pip install imageio-ffmpeg`

Here, the returned variables are **observation** which is the current state, the **reward** is the current reward for the action we have taken (not cumulative), **done** states if the environment finished, or failed etc., and finally **info** is a diagnostic variable (if we included things in the `step` implementation). In the cartpole environment, the info is empty.

In [6]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

# Instantiate the agent
model = PPO("MlpPolicy", env, verbose=1)
# Train the agent
model.learn(total_timesteps=int(100_000), log_interval=10)
# Save the agent
model.save("dqn_lunar")
del model  # delete trained model to demonstrate loading

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 146         |
|    ep_rew_mean          | 146         |
| time/                   |             |
|    fps                  | 1599        |
|    iterations           | 10          |
|    time_elapsed         | 12          |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.006854622 |
|    clip_fraction        | 0.074       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.583      |
|    explained_variance   | 0.933       |
|    learning_rate        | 0.0003      |
|    loss                 | 2.43        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.0129     |
|    value_loss           | 16.2        |
-----------------------------------------
------------------------

In [7]:
model = PPO.load("dqn_lunar", env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [8]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
mean_reward

500.0

In [13]:
# Enjoy trained agent
# import imageio
import numpy as np

obs = env.reset()

img = env.render(mode="rgb_array")
images = []
for i in range(1000):
    images.append(img)
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    img = env.render(mode="rgb_array")

#
# imageio.mimsave(
#     "test.gif",
#     [img for i, img in enumerate(images) if i % 10 == 0],
#     fps=29,
# )

KeyboardInterrupt: 