# Cart Pole using Vanilla Policy Gradients

We are using a **On-Policy** vanilla Policy Gradient which has the following implemented <br/>
- **Causality** 
- Mean Rewards **Baselines**
- **Discounting** since sum of rewards in an infinite horizon problem will explode the gradient
- And Gradient **Averaging** across all mini-batches of the policy's trajectories to calculate expecations over all  gradients

Following are NOT Implemented
- **Parallelism** in sampling
- Frequent model **Saving**
- num_of_episodes **decaying** since episode length increases as the policy improves
- Hyper-Parameter **Tuning** ... the following ideas in this [link](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f) has to be used. 

The following tutorial on Gradient Averaging in the context of tensorflow's **keras** helped me complete this notebook
https://medium.com/analytics-vidhya/tf-gradienttape-explained-for-keras-users-cc3f06276f22

## Step 1: Imports

In [1]:
import gym
import tensorflow as tf
from tensorflow import keras
import numpy as np
import datetime as dt
import math

## Step 2: Environment

In [2]:
GAMMA = 0.99

env = gym.make("CartPole-v0")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(state_size, action_size)

STORE_PATH = '/Users/SV/Desktop/Lyra/CS285/PolicyGradients'
logger = tf.summary.create_file_writer(STORE_PATH + f"/PG-CartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")


4 2


## Step 3: Network

In [3]:
network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(action_size, activation='softmax')
])


## Step 4: Sample Action

In [4]:
def sample_action(state):
    # Since only one state is there, N = 1 in the mini-batch.
    # And index 0, gives the action for that state.     
    softmax_out = network(state.reshape((1, -1)))
    selected_action = np.random.choice(action_size, p=softmax_out.numpy()[0])
    return selected_action

## Step 5: Sample Episode

In [5]:
def get_rewards2go(rewards):

    reward_sum = 0
    result = []
    for reward in reversed(rewards):
        reward_sum = reward + GAMMA * reward_sum
        result.append(reward_sum)
    result.reverse()
    return result

def sample_episode(states, rewards, rewards2go, actions):
    state = env.reset()
    raw_rewards = []
    while True:
        action = sample_action(state)
        next_state, reward, done, _ = env.step(action)        
        
        states.append(state)        
        actions.append(action)
        raw_rewards.append(reward)
        state = next_state
        
        if done:
            rewards.extend(raw_rewards)             
            rewards2go.extend(get_rewards2go(raw_rewards))
            return


## Step 6: Training

In [6]:
sample_size = 20000
batch_size = 5120
steps = 1000

network_optimizer = tf.keras.optimizers.Adam(lr=0.001)

def averageGradients(grads, N):
    for i in range(len(grads)):
        grads[i] = grads[i]/N
        
        
def addGradients(grads, batch_grads):
    sum_grad = []
    for (grad, batch_grad) in zip(grads, batch_grads):
        sum_grad.append(grad + batch_grad)
    return sum_grad

def generate_sample_batcher(states, rewards2go, actions):
    rewards2go = np.array(rewards2go)
    states = np.array(states)
    actions = np.array(actions)
    
    baseline = np.mean(rewards2go)
    rewards2go = rewards2go - baseline
        
    def sample_batcher(n):
        return states[n:n+batch_size], rewards2go[n:n+batch_size], actions[n:n+batch_size]

    return sample_batcher

for step in range(steps):
    
    # This is one gradient step    
    rewards = []
    rewards2go = []
    states = []
    actions = []
    sum_reward = 0

    N = 0 
    while(len(states) < sample_size):
        sample_episode(states, rewards, rewards2go, actions)
        N += 1
        if(N%30 == 0):
            print("Sampling Episode - ", N)
    print("Sampled", N, "Episodes")
    
    
    avg_reward = np.sum(np.array(rewards)) / N
    bat_per_epoch = math.floor(len(states) / batch_size)
    sample_batcher = generate_sample_batcher(states, rewards2go, actions)
    
    gradients = None
    for i in range(bat_per_epoch):
        n = i*batch_size
        batch_states, batch_rewards2go, batch_actions = sample_batcher(n)
        
        with tf.GradientTape() as tape:
            predictions = network(batch_states)
            loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=batch_actions, y_pred=predictions, from_logits=False)
            weighted_loss = loss * batch_rewards2go
        batch_gradients = tape.gradient(weighted_loss, network.trainable_variables)

        if gradients is None:
            gradients = batch_gradients
        else:
            gradients = addGradients(gradients, batch_gradients)
    
    averageGradients(gradients, N)
    network_optimizer.apply_gradients(zip(gradients, network.trainable_variables)) 

    if step % 100 == 0:
        print("Saving model, actor & critic @ timestep", step)
        network.save_weights(STORE_PATH + f"/network{dt.datetime.now().strftime('%d%m%Y%H%M')}")
        # For Actor-Critic
        # critic.save_weights(STORE_PATH + f"/critic{dt.datetime.now().strftime('%d%m%Y%H%M')}")

    print(f"Step: {step}, AvgReward: {avg_reward}, BatchLen: {bat_per_epoch}")
    with logger.as_default():
            tf.summary.scalar('avgReward', avg_reward, step=step)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Sampling Episode -  300
Sampling Episode -  330
Sampling Episode -  360
Sampling Episode -  390
Sampling Episode -  420
Sampling Episode -  450
Sampling Episode -  480
Sampling Episode -  510
Sampling Episode -  540
Sampling Episode -  570
Sampling Episode -  600
Sampling Episode -  630
Sampling Episode -  660
Sampled 676 Episodes
Saving model, actor & critic @ timestep 0
Step: 0, AvgReward: 29.59467455621302, BatchLen: 3
Sampling Episode -  30
Sampling Episode -  60
Sampling Episo

Sampling Episode -  180
Sampling Episode -  210
Sampled 211 Episodes
Step: 20, AvgReward: 95.08056872037915, BatchLen: 3
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampled 214 Episodes
Step: 21, AvgReward: 93.57943925233644, BatchLen: 3
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampled 197 Episodes
Step: 22, AvgReward: 101.92893401015229, BatchLen: 3
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampled 186 Episodes
Step: 23, AvgReward: 108.03763440860214, BatchLen: 3
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampled 183 Episodes
Step: 24, AvgReward: 110.24043715846994, BatchLen

KeyboardInterrupt: 