# Cart Pole using Vanilla Policy Gradients

We are using a **On-Policy** vanilla Policy Gradient which has the following implemented <br/>
- **Causality** 
- Mean Rewards **Baselines**
- **Discount** since sum of rewards in an infinite horizon problem will explode the gradient
- And Gradient **Averaging** across all mini-batches of the policy's trajectories.

Following are NOT Implemented
- **Parallelism** in sampling
- Frequent model **Saving**
- num_of_episodes **decaying** (Reason - As policy improves, episode length increases making training slow)

## Step 1: Imports

In [1]:
import gym
import tensorflow as tf
from tensorflow import keras
import numpy as np
import datetime as dt
import math

## Step 2: Environment

In [2]:
GAMMA = 0.95

env = gym.make("CartPole-v0")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(state_size, action_size)

STORE_PATH = '/Users/SV/Desktop/Lyra/CS285/PolicyGradients'
logger = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")


4 2


## Step 3: Network

In [3]:
network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(action_size, activation='softmax')
])


## Step 4: Sample Action

In [4]:
def sample_action(state):
    # Since only one state is there, N = 1 in the mini-batch.
    # And index 0, gives the action for that state.     
    softmax_out = network(state.reshape((1, -1)))
    selected_action = np.random.choice(action_size, p=softmax_out.numpy()[0])
    return selected_action

## Step 5: Sample Episode

In [5]:
def discounted_rewardsToGo(rewards):

    reward_sum = 0
    result = []
    for reward in reversed(rewards):
        reward_sum = reward + GAMMA * reward_sum
        result.append(reward_sum)
    result = np.array(result)
    # result -= np.mean(result)
    # result /= np.std(result)
    return result[::-1] 

def sample_episode(states, rewards, actions):
    state = env.reset()
    raw_rewards = []
    while True:
        states.append(state)
        action = sample_action(state)
        actions.append(action)
        state, reward, done, _ = env.step(action)
        raw_rewards.append(reward)
        
        if done:
            # loss = one_episode_gradient_update(network, rewards, states, actions, num_actions)
            tot_reward = sum(raw_rewards)
            episode_rewards = discounted_rewardsToGo(raw_rewards)
            rewards.extend(episode_rewards) 
            return tot_reward


## Step 6: Training

In [6]:
num_episodes = 300
steps = 500
batch_size = 256
optimizer = tf.keras.optimizers.Adam(lr=0.001)

def averageGradients(grads):
    for i in range(len(grads)):
        grads[i] = grads[i]/num_episodes
        
        
def addGradients(grads, batch_grads):
    sum_grad = []
    for (grad, batch_grad) in zip(grads, batch_grads):
        sum_grad.append(grad + batch_grad)
    return sum_grad

def batch_step(states, actions, rewards):
    with tf.GradientTape() as tape:
        predictions = network(states)
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=actions, y_pred=predictions, from_logits=False)
        weighted_loss = loss * rewards
    batch_gradients = tape.gradient(weighted_loss, network.trainable_variables)
    return batch_gradients

for step in range(steps):
    
    # This is one gradient step    
    rewards = []
    states = []
    actions = []
    sum_reward = 0

    for episode in range(num_episodes):
        sum_reward += sample_episode(states,rewards,actions)
        if(episode%30 == 0):
            print("Sampling Episode - ", episode)
    
    baseline = np.mean(np.array(rewards))
    for i in range(len(rewards)):
        rewards[i] -= baseline
    
    gradients = None
    bat_per_epoch = math.floor(len(states) / batch_size)
    for i in range(bat_per_epoch):
        n = i*batch_size
        states_np = np.array(states[n:n+batch_size])
        actions_np = np.array(actions[n:n+batch_size])
        rewards_np = np.array(rewards[n:n+batch_size])
        batch_gradients = batch_step(states_np, actions_np, rewards_np)

        if gradients is None:
            gradients = batch_gradients
        else:
            gradients = addGradients(gradients, batch_gradients)
    
    averageGradients(gradients)
    optimizer.apply_gradients(zip(gradients, network.trainable_variables)) 
    
    avg_reward = sum_reward / num_episodes
    print(f"Step: {step}, AvgReward: {avg_reward}, BatchLen: {bat_per_epoch}")
    with logger.as_default():
            tf.summary.scalar('avgReward', avg_reward, step=step)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 0, AvgReward: 24.30666666666667, BatchLen: 28
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 1, AvgReward: 25.956666666666667, BatchLen: 30
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling 

Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 26, AvgReward: 69.99666666666667, BatchLen: 82
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 27, AvgReward: 66.07666666666667, BatchLen: 77
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 28, AvgReward: 68.47666666666667, BatchLen: 80
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 29, AvgReward: 77.17333333333333, 

Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 55, AvgReward: 176.93666666666667, BatchLen: 207
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 56, AvgReward: 179.86333333333334, BatchLen: 210
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 57, AvgReward: 181.14666666666668, BatchLen: 212
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Epi

Sampling Episode -  270
Step: 83, AvgReward: 194.9, BatchLen: 228
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 84, AvgReward: 197.30666666666667, BatchLen: 231
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 85, AvgReward: 195.81666666666666, BatchLen: 229
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode -  60
Sampling Episode -  90
Sampling Episode -  120
Sampling Episode -  150
Sampling Episode -  180
Sampling Episode -  210
Sampling Episode -  240
Sampling Episode -  270
Step: 86, AvgReward: 194.98666666666668, BatchLen: 228
Sampling Episode -  0
Sampling Episode -  30
Sampling Episode - 

KeyboardInterrupt: 