## Types of Reinforcement Learning

* __Policy Gradients__

* __Deep Q Networks (DQN)__

* __Markov Decision Processes (MDP)__

### Terminology

The AI (player) is the __agent__ which makes __observations__ within an __environment__, takes __actions__, and receives __rewards__.

__policy__: The algorithm a agent uses to determine its actions. This can be a neural network, for example.

    Stochastic Policy - A random algorithm suchas the one a robot vacuum uses
    
    Policy Search - search combinations of parameters, find the ones that maximizes performance
    
        * Brute force the search, checking all combinations
        
        * Genetic policy algorithm - create a random set of parameters, keep the 20% that perform
        
          the best from those, generate new sets...evolve the policy until it performs well
          
        * Evaluate the gradients of the rewards with regards to policy parameters
          
          (called policy gradients)

## Policy Search

<img src="images/agent_env.PNG" width="800" height="400"/>

In [1]:
import gym
import tensorflow as tf 
import numpy as np
import pandas as pd
from tensorflow import keras
# from tensorflow.python.keras.optimizers import Adam

# tf.enable_eager_execution()

In [2]:
env = gym.make('CartPole-v1')
obs = env.reset()
obs

array([-0.02444914,  0.04753504, -0.02570328, -0.04110926])

In [3]:
# env.render()

In [4]:
action = 1
obs, reward, done, info = env.step(action)
print(obs, reward, done, info)

[-0.02349844  0.24301596 -0.02652546 -0.34178972] 1.0 False {}


In [5]:
def basic_policy(obs): 
    angle = obs[2] 
    return 0 if angle < 0 else 1 
 
totals = [] 
for episode in range(500): 
    episode_rewards = 0 
    obs = env.reset() 
    for step in range(200): 
        action = basic_policy(obs) 
        obs, reward, done, info = env.step(action) 
        episode_rewards += reward 
        if done: 
            break 
    totals.append(episode_rewards)

In [6]:
np.mean(totals), np.std(totals), np.min(totals), np.max(totals) 

(42.406, 8.392446842250477, 24.0, 71.0)

This model only survived 72 steps (or 68, or whatever `np.max(totals)` gives). This is more of a brute force way to train the model, but the cart will have to move more erratically to keep up with the pole, and will ultimately fail. The neural net takes all 4 parameters, and outputs a probability. Since only two actions are possible (move left, or move right), the output gives $$p_{left} \ and \ p_{right}=(1-p_{left})$$

$p_{left}$ is related to $action_0$ and $p_{right}$ is related to $action_1$

If it outputs 0.7, then we pick action 0 with a 70% probability, and action 1 with a 30% probability.

<img src="images/neural_net.PNG" width="800" height="400"/>

In [18]:
n_inputs = 4 # == env.observation_space.shape[0] 
keras.backend.clear_session()
model = keras.models.Sequential([ 
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]), 
    keras.layers.Dense(1, activation="sigmoid"), 
])

### Policy Gradients

> REINFORCE algorithm

> * Have AI play game several times, at each step, compute gradient but do not apply them

> * Compute each action's advantage

> * If advantage is positive (action is probably good), apply the gradients to make the action more likely

> * Compute the mean of all resulting gradient vectors, use it to perform a Gradient Descent step

In [8]:
def play_one_step(env, obs, model, loss_fn): 
    with tf.GradientTape() as tape: 
        left_proba = model(obs[np.newaxis]) 
        action = (tf.random.uniform([1, 1]) > left_proba) 
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32) 
        loss = tf.reduce_mean(loss_fn(y_target, left_proba)) 
    grads = tape.gradient(loss, model.trainable_variables) 
    obs, reward, done, info = env.step(int(action[0, 0].numpy())) 
    return obs, reward, done, grads

In [9]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn): 
    all_rewards = [] 
    all_grads = [] 
    for episode in range(n_episodes): 
        current_rewards = [] 
        current_grads = [] 
        obs = env.reset() 
        for step in range(n_max_steps): 
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn) 
            current_rewards.append(reward) 
            current_grads.append(grads) 
            if done: 
                break 
        all_rewards.append(current_rewards) 
        all_grads.append(current_grads) 
    return all_rewards, all_grads

In [10]:
def discount_rewards(rewards, discount_factor): 
    discounted = np.array(rewards) 
    for step in range(len(rewards) - 2, -1, -1): 
        discounted[step] += discounted[step + 1] * discount_factor 
    return discounted 
 
def discount_and_normalize_rewards(all_rewards, discount_factor): 
    all_discounted_rewards = [discount_rewards(rewards, discount_factor) 
                              for rewards in all_rewards] 
    flat_rewards = np.concatenate(all_discounted_rewards) 
    reward_mean = flat_rewards.mean() 
    reward_std = flat_rewards.std() 
    return [(discounted_rewards - reward_mean) / reward_std 
            for discounted_rewards in all_discounted_rewards]

In [11]:
discount_rewards([10, 0, -50], discount_factor=0.8)

array([-22, -40, -50])

In [12]:
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_factor=0.8) 

[array([-0.28435071, -0.86597718, -1.18910299]),
 array([1.26665318, 1.0727777 ])]

In [13]:
n_iterations = 150 
n_episodes_per_update = 10
n_max_steps = 200 
discount_factor = 0.95

In [16]:
optimizer = tf.keras.optimizers.Adam(lr=0.01) 
loss_fn = keras.losses.binary_crossentropy

In [19]:
env = gym.make("CartPole-v1")
env.seed(42);

for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model, loss_fn)
    total_rewards = sum(map(sum, all_rewards))                     # Not shown in the book
    print("\rIteration: {}, mean rewards: {:.1f}".format(          # Not shown
        iteration, total_rewards / n_episodes_per_update), end="") # Not shown
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                       discount_rate)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

env.close()

AttributeError: 'Tensor' object has no attribute 'numpy'