
Implementation Details
1. Define the Cartpole Environment
2. Define HyperParameters
3. Define Neural Network'
   1. 3 Layers Neural Network in which the input of the cartpole environment
   2. Output is the softmax layer of the action size 
   3. Adam Optimizer
   4. Loss Function- reduce_mean(discounted_reward of the episode*neg_loss_with_logits)
   5. Loop through the episodes
       1. Pass through the neural network and get probability ditribution.
       2. Take action depending upon the probability distribution
       3. Store all the states, rewards and actions untill the episode is done depending upon the actions given by the network.
       4. when done calculate the discounted rewards and pass the states passed by the neural network to train the network.
   6. This way the whole policy update is done as after each episode the policy is updated

## Defining the Cartpole Environment

In [1]:
import gym
import numpy as np

In [2]:
env = gym.make("CartPole-v1")
env = env.unwrapped
env.seed(1)

[1]

## Hyperparameters

In [3]:
gamma=0.95   ##Discount factor
episodes=1000  ##Episodes
actions=env.action_space.n  ##Number of actions
state_size=4  ##Number of states
learning_rate=0.01

## Neural Network

In [4]:
import tensorflow as tf

In [5]:
tf.compat.v1.disable_eager_execution()
input_state=tf.compat.v1.placeholder(tf.float32, shape=[None, state_size],name="input_state")
action_space=tf.compat.v1.placeholder(tf.int32, shape=[None, actions],name="action_space")
discounted_reward=tf.compat.v1.placeholder(tf.float32, shape=[None,],name="discounted_reward")
fc1=tf.keras.layers.Dense(10,activation='relu',name="fc1")(input_state)
fc2=tf.keras.layers.Dense(actions,activation='relu',name="fc2")(fc1)
fc3=tf.keras.layers.Dense(actions,name="fc3")(fc2)
action_output=tf.keras.layers.Dense(actions,activation='softmax',name="action_output")(fc3)
neg_loss_prob=tf.compat.v1.nn.softmax_cross_entropy_with_logits_v2(logits = fc3, labels = action_space)
loss=tf.math.reduce_mean(discounted_reward*neg_loss_prob)
training=tf.compat.v1.train.AdamOptimizer(learning_rate).minimize(loss)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Building Gameplay Bot for Cartpole

In [6]:
def discounted_rewards(rewards):
    cummulative_reward=0.0
    discounted_episode_rewards = np.zeros_like(rewards)
    for i in reversed(range(len(rewards))):
        cummulative_reward = cummulative_reward * gamma + rewards[i]
        discounted_episode_rewards[i] = cummulative_reward
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    return discounted_episode_rewards

In [7]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
with tf.compat.v1.Session() as sess:
    sess.run(tf.compat.v1.global_variables_initializer())
    for each in range(episodes):
        episode_rewards_sum = 0
        episode_states=[]
        episode_rewards=[]
        episode_actions=[]
        state = env.reset()
        env.render()
        while True:
            action_probability=sess.run(action_output,feed_dict={input_state:state.reshape([1,4])})
            action = np.random.choice(range(action_probability.shape[1]), p=action_probability.ravel())
            new_state, reward, done, info=env.step(action)
            episode_states.append(state)
            episode_rewards.append(reward)
            action_ = np.zeros(actions)
            action_[action] = 1
            episode_actions.append(action_)
            if done:
                episode_rewards_sum = np.sum(episode_rewards)
                allRewards.append(episode_rewards_sum)
                total_rewards = np.sum(allRewards)
                maximumRewardRecorded = np.amax(allRewards)
                print("==========================================")
                print("Episode: ", each+1)
                print("Reward: ", episode_rewards_sum)
                print("Max reward so far: ", maximumRewardRecorded)
                discounted_episode_rewards=discounted_rewards(episode_rewards)
                loss_, _ = sess.run([loss, training], feed_dict={input_state: np.vstack(np.array(episode_states)),
                                                                 action_space: np.vstack(np.array(episode_actions)),
                                                                 discounted_reward: discounted_episode_rewards 
                                                                })
                break
            state = new_state
env.close()

Episode:  1
Reward:  9.0
Max reward so far:  9.0
Episode:  2
Reward:  13.0
Max reward so far:  13.0
Episode:  3
Reward:  14.0
Max reward so far:  14.0
Episode:  4
Reward:  21.0
Max reward so far:  21.0
Episode:  5
Reward:  14.0
Max reward so far:  21.0
Episode:  6
Reward:  12.0
Max reward so far:  21.0
Episode:  7
Reward:  37.0
Max reward so far:  37.0
Episode:  8
Reward:  41.0
Max reward so far:  41.0
Episode:  9
Reward:  42.0
Max reward so far:  42.0
Episode:  10
Reward:  12.0
Max reward so far:  42.0
Episode:  11
Reward:  15.0
Max reward so far:  42.0
Episode:  12
Reward:  14.0
Max reward so far:  42.0
Episode:  13
Reward:  47.0
Max reward so far:  47.0
Episode:  14
Reward:  10.0
Max reward so far:  47.0
Episode:  15
Reward:  14.0
Max reward so far:  47.0
Episode:  16
Reward:  10.0
Max reward so far:  47.0
Episode:  17
Reward:  11.0
Max reward so far:  47.0
Episode:  18
Reward:  37.0
Max reward so far:  47.0
Episode:  19
Reward:  56.0
Max reward so far:  56.0
Episode:  20
Reward:  2

Episode:  88
Reward:  29.0
Max reward so far:  117.0
Episode:  89
Reward:  24.0
Max reward so far:  117.0
Episode:  90
Reward:  27.0
Max reward so far:  117.0
Episode:  91
Reward:  16.0
Max reward so far:  117.0
Episode:  92
Reward:  32.0
Max reward so far:  117.0
Episode:  93
Reward:  38.0
Max reward so far:  117.0
Episode:  94
Reward:  71.0
Max reward so far:  117.0
Episode:  95
Reward:  20.0
Max reward so far:  117.0
Episode:  96
Reward:  46.0
Max reward so far:  117.0
Episode:  97
Reward:  34.0
Max reward so far:  117.0
Episode:  98
Reward:  23.0
Max reward so far:  117.0
Episode:  99
Reward:  21.0
Max reward so far:  117.0
Episode:  100
Reward:  23.0
Max reward so far:  117.0
Episode:  101
Reward:  21.0
Max reward so far:  117.0
Episode:  102
Reward:  21.0
Max reward so far:  117.0
Episode:  103
Reward:  16.0
Max reward so far:  117.0
Episode:  104
Reward:  19.0
Max reward so far:  117.0
Episode:  105
Reward:  12.0
Max reward so far:  117.0
Episode:  106
Reward:  14.0
Max reward s

Episode:  183
Reward:  10.0
Max reward so far:  117.0
Episode:  184
Reward:  9.0
Max reward so far:  117.0
Episode:  185
Reward:  9.0
Max reward so far:  117.0
Episode:  186
Reward:  9.0
Max reward so far:  117.0
Episode:  187
Reward:  8.0
Max reward so far:  117.0
Episode:  188
Reward:  10.0
Max reward so far:  117.0
Episode:  189
Reward:  10.0
Max reward so far:  117.0
Episode:  190
Reward:  11.0
Max reward so far:  117.0
Episode:  191
Reward:  8.0
Max reward so far:  117.0
Episode:  192
Reward:  9.0
Max reward so far:  117.0
Episode:  193
Reward:  10.0
Max reward so far:  117.0
Episode:  194
Reward:  10.0
Max reward so far:  117.0
Episode:  195
Reward:  9.0
Max reward so far:  117.0
Episode:  196
Reward:  10.0
Max reward so far:  117.0
Episode:  197
Reward:  8.0
Max reward so far:  117.0
Episode:  198
Reward:  9.0
Max reward so far:  117.0
Episode:  199
Reward:  9.0
Max reward so far:  117.0
Episode:  200
Reward:  10.0
Max reward so far:  117.0
Episode:  201
Reward:  10.0
Max reward

Episode:  277
Reward:  10.0
Max reward so far:  117.0
Episode:  278
Reward:  10.0
Max reward so far:  117.0
Episode:  279
Reward:  11.0
Max reward so far:  117.0
Episode:  280
Reward:  10.0
Max reward so far:  117.0
Episode:  281
Reward:  9.0
Max reward so far:  117.0
Episode:  282
Reward:  10.0
Max reward so far:  117.0
Episode:  283
Reward:  9.0
Max reward so far:  117.0
Episode:  284
Reward:  9.0
Max reward so far:  117.0
Episode:  285
Reward:  10.0
Max reward so far:  117.0
Episode:  286
Reward:  9.0
Max reward so far:  117.0
Episode:  287
Reward:  9.0
Max reward so far:  117.0
Episode:  288
Reward:  10.0
Max reward so far:  117.0
Episode:  289
Reward:  11.0
Max reward so far:  117.0
Episode:  290
Reward:  9.0
Max reward so far:  117.0
Episode:  291
Reward:  10.0
Max reward so far:  117.0
Episode:  292
Reward:  9.0
Max reward so far:  117.0
Episode:  293
Reward:  10.0
Max reward so far:  117.0
Episode:  294
Reward:  10.0
Max reward so far:  117.0
Episode:  295
Reward:  9.0
Max rewa