## Types of Reinforcement Learning

* __Policy Gradients__

* __Deep Q Networks (DQN)__

* __Markov Decision Processes (MDP)__

### Terminology

The AI (player) is the __agent__ which makes __observations__ within an __environment__, takes __actions__, and receives __rewards__.

__policy__: The algorithm a agent uses to determine its actions. This can be a neural network, for example.

    Stochastic Policy - A random algorithm suchas the one a robot vacuum uses
    
    Policy Search - search combinations of parameters, find the ones that maximizes performance
    
        * Brute force the search, checking all combinations
        
        * Genetic policy algorithm - create a random set of parameters, keep the 20% that perform
        
          the best from those, generate new sets...evolve the policy until it performs well
          
        * Evaluate the gradients of the rewards with regards to policy parameters
          
          (called policy gradients)

## Policy Search

<img src="images/agent_env.PNG" width="800" height="400"/>

In [4]:
import gym
import numpy as np
env = gym.make('CartPole-v1')
obs = env.reset()
obs

array([-0.03814324, -0.03463322,  0.02370684, -0.00810226])

In [2]:
env.render()

True

In [5]:
action = 1
obs, reward, done, info = env.step(action)
print(obs, reward, done, info)

[-0.03883591  0.16014087  0.0235448  -0.29321213] 1.0 False {}


In [6]:
def basic_policy(obs): 
    angle = obs[2] 
    return 0 if angle < 0 else 1 
 
totals = [] 
for episode in range(500): 
    episode_rewards = 0 
    obs = env.reset() 
    for step in range(200): 
        action = basic_policy(obs) 
        obs, reward, done, info = env.step(action) 
        episode_rewards += reward 
        if done: 
            break 
    totals.append(episode_rewards)

In [7]:
np.mean(totals), np.std(totals), np.min(totals), np.max(totals) 

(42.616, 9.584390643123848, 24.0, 72.0)

This model only survived 72 steps. This is more of a brute force way to train the model, but the cart will have to move more erratically to keep up with the pole, and will ultimately fail. The neural net takes all 4 parameters, and outputs a probability. Since only two actions are possible (move left, or move right), the output gives $$p_{left} \ and \ p_{right}=(1-p_{left})$$

$p_{left}$ is related to $action_0$ and $p_{right}$ is related to $action_1$

If it outputs 0.7, then we pick action 0 with a 70% probability, and action 1 with a 30% probability.

<img src="images/neural_net.PNG" width="800" height="400"/>

In [8]:
import tensorflow as tf 
from tensorflow import keras 
 
n_inputs = 4 # == env.observation_space.shape[0] 
 
model = keras.models.Sequential([ 
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]), 
    keras.layers.Dense(1, activation="sigmoid"), 
])