## Types of Reinforcement Learning

* __Policy Gradients__

* __Deep Q Networks (DQN)__

* __Markov Decision Processes (MDP)__

### Terminology

The AI (player) is the __agent__ which makes __observations__ within an __environment__, takes __actions__, and receives __rewards__.

__policy__: The algorithm a agent uses to determine its actions. This can be a neural network, for example.

    Stochastic Policy - A random algorithm suchas the one a robot vacuum uses
    
    Policy Search - search combinations of parameters, find the ones that maximizes performance
    
        * Brute force the search, checking all combinations
        
        * Genetic policy algorithm - create a random set of parameters, keep the 20% that perform
        
          the best from those, generate new sets...evolve the policy until it performs well
          
        * Evaluate the gradients of the rewards with regards to policy parameters
          
          (called policy gradients)

## Policy Search

<img src="images/agent_env.PNG" width="800" height="400"/>

In [4]:
!pip install pandas

Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/bb/71/8f53bdbcbc67c912b888b40def255767e475402e9df64050019149b1a943/pandas-1.0.3-cp36-cp36m-manylinux1_x86_64.whl (10.0MB)
[K     |████████████████████████████████| 10.0MB 102kB/s eta 0:00:01
Collecting pytz>=2017.2
[?25l  Downloading https://files.pythonhosted.org/packages/4f/a4/879454d49688e2fad93e59d7d4efda580b783c745fd2ec2a3adf87b0808d/pytz-2020.1-py2.py3-none-any.whl (510kB)
[K     |████████████████████████████████| 512kB 3.1MB/s eta 0:00:01
Installing collected packages: pytz, pandas
Successfully installed pandas-1.0.3 pytz-2020.1
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
import gym
import tensorflow as tf 
import numpy as np
import pandas as pd
from tensorflow import keras
# from tensorflow.python.keras.optimizers import Adam

# tf.enable_eager_execution()

In [6]:
env = gym.make('CartPole-v1')
obs = env.reset()
obs

array([-0.04313881,  0.03761702,  0.02104993, -0.00929697])

In [7]:
# env.render()

In [8]:
action = 1
obs, reward, done, info = env.step(action)
print(obs, reward, done, info)

[-0.04238647  0.23243086  0.02086399 -0.29526476] 1.0 False {}


In [21]:
def basic_policy(obs): 
    angle = obs[2] 
    return 0 if angle < 0 else 1 
 
totals = [] 
for episode in range(500): 
    episode_rewards = 0 
    obs = env.reset() 
    for step in range(200): 
        action = basic_policy(obs) 
        obs, reward, done, info = env.step(action) 
        episode_rewards += reward 
        if done: 
            break 
    totals.append(episode_rewards)

In [22]:
def render_policy_net(model, n_max_steps=200, seed=42):
    frames = []
    env = gym.make("CartPole-v1")
    env.seed(seed)
    np.random.seed(seed)
    obs = env.reset()
    for step in range(n_max_steps):
        frames.append(env.render(mode="rgb_array"))
        left_proba = model.predict(obs.reshape(1, -1))
        action = int(np.random.rand() > left_proba)
        obs, reward, done, info = env.step(action)
        if done:
            break
    env.close()
    return frames

In [24]:
n_environments = 50
n_iterations = 5000

envs = [gym.make("CartPole-v1") for _ in range(n_environments)]
for index, env in enumerate(envs):
    env.seed(index)
np.random.seed(42)
observations = [env.reset() for env in envs]
optimizer = keras.optimizers.RMSprop()
loss_fn = keras.losses.binary_crossentropy

for iteration in range(n_iterations):
    # if angle < 0, we want proba(left) = 1., or else proba(left) = 0.
    target_probas = np.array([([1.] if obs[2] < 0 else [0.])
                              for obs in observations])
    with tf.GradientTape() as tape:
        left_probas = model(np.array(observations))
        loss = tf.reduce_mean(loss_fn(target_probas, left_probas))
    print("\rIteration: {}, Loss: {:.3f}".format(iteration, loss.numpy()), end="")
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    actions = (np.random.rand(n_environments, 1) > left_probas.numpy()).astype(np.int32)
    for env_index, env in enumerate(envs):
        obs, reward, done, info = env.step(actions[env_index][0])
        observations[env_index] = obs if not done else env.reset()

for env in envs:
    env.close()

Iteration: 4999, Loss: 0.029

In [None]:

now = datetime.now()

current_time = now.strftime("%H:%M:%S")
print("Current Time =", current_time)

In [36]:
# def exec_time(func):
from datetime import datetime

now = datetime.now()

future = datetime.now()
print("Execution Time =", (future-now))


Execution Time = 0:00:09.707189


In [10]:
np.mean(totals), np.std(totals), np.min(totals), np.max(totals) 

(42.0, 8.54072596445993, 24.0, 68.0)

This model only survived 72 steps (or 68, or whatever `np.max(totals)` gives). This is more of a brute force way to train the model, but the cart will have to move more erratically to keep up with the pole, and will ultimately fail. The neural net takes all 4 parameters, and outputs a probability. Since only two actions are possible (move left, or move right), the output gives $$p_{left} \ and \ p_{right}=(1-p_{left})$$

$p_{left}$ is related to $action_0$ and $p_{right}$ is related to $action_1$

If it outputs 0.7, then we pick action 0 with a 70% probability, and action 1 with a 30% probability.

<img src="images/neural_net.PNG" width="800" height="400"/>

In [11]:
n_inputs = 4 # == env.observation_space.shape[0] 
keras.backend.clear_session()
model = keras.models.Sequential([ 
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]), 
    keras.layers.Dense(1, activation="sigmoid"), 
])

### Policy Gradients

> REINFORCE algorithm

> * Have AI play game several times, at each step, compute gradient but do not apply them

> * Compute each action's advantage

> * If advantage is positive (action is probably good), apply the gradients to make the action more likely

> * Compute the mean of all resulting gradient vectors, use it to perform a Gradient Descent step

In [12]:
def play_one_step(env, obs, model, loss_fn): 
    with tf.GradientTape() as tape: 
        left_proba = model(obs[np.newaxis]) 
        action = (tf.random.uniform([1, 1]) > left_proba) 
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32) 
        loss = tf.reduce_mean(loss_fn(y_target, left_proba)) 
    grads = tape.gradient(loss, model.trainable_variables) 
    obs, reward, done, info = env.step(int(action[0, 0].numpy())) 
    return obs, reward, done, grads

In [13]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn): 
    all_rewards = [] 
    all_grads = [] 
    for episode in range(n_episodes): 
        current_rewards = [] 
        current_grads = [] 
        obs = env.reset() 
        for step in range(n_max_steps): 
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn) 
            current_rewards.append(reward) 
            current_grads.append(grads) 
            if done: 
                break 
        all_rewards.append(current_rewards) 
        all_grads.append(current_grads) 
    return all_rewards, all_grads

In [14]:
def discount_rewards(rewards, discount_factor): 
    discounted = np.array(rewards) 
    for step in range(len(rewards) - 2, -1, -1): 
        discounted[step] += discounted[step + 1] * discount_factor 
    return discounted 
 
def discount_and_normalize_rewards(all_rewards, discount_factor): 
    all_discounted_rewards = [discount_rewards(rewards, discount_factor) 
                              for rewards in all_rewards] 
    flat_rewards = np.concatenate(all_discounted_rewards) 
    reward_mean = flat_rewards.mean() 
    reward_std = flat_rewards.std() 
    return [(discounted_rewards - reward_mean) / reward_std 
            for discounted_rewards in all_discounted_rewards]

In [15]:
discount_rewards([10, 0, -50], discount_factor=0.8)

array([-22, -40, -50])

In [16]:
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_factor=0.8) 

[array([-0.28435071, -0.86597718, -1.18910299]),
 array([1.26665318, 1.0727777 ])]

In [17]:
n_iterations = 150 
n_episodes_per_update = 10
n_max_steps = 200 
discount_factor = 0.95

In [18]:
optimizer = tf.keras.optimizers.Adam(lr=0.01) 
loss_fn = keras.losses.binary_crossentropy

In [20]:
for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model, loss_fn)
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
    discount_factor)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
            for episode_index, final_rewards in enumerate(all_final_rewards)
                for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

KeyboardInterrupt: 

In [19]:
env = gym.make("CartPole-v1")
env.seed(42);

for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model, loss_fn)
    total_rewards = sum(map(sum, all_rewards))                     # Not shown in the book
    print("\rIteration: {}, mean rewards: {:.1f}".format(          # Not shown
        iteration, total_rewards / n_episodes_per_update), end="") # Not shown
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                       discount_rate)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

env.close()



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Iteration: 0, mean rewards: 22.0

NameError: name 'discount_rate' is not defined

Using RASCOM - Anaconda prompt with the (base) environment,

* Navigate to `D:\gal\ml-agents`

* Launch unity hub

  * Launch project `Project` using Unity `2018.4.17f1`
 
* Navigate to `ML-Agents/Examples/<pick an example>/Scenes/<name of scene>`

In [None]:
# create unity env
from gym_unity.envs import BaseEnv
env_id = "Project\Assets\ML-Agents\Builds\UnityEnvironment.exe"
env = BaseEnv(env_id, worker_id=2, use_visual=False, no_graphics=False)

# run stable baselines
env = DummyVecEnv([lambda: env])  # The algorithms require a vectorized environment to run
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)