After multi-armed and contextual bandits, full reinforcement learning problem has to further consider taking observations from the world, and taking actions which provide the optimal reward not just in the present, but over the long run. This is referred to as **Markov Decision Processes(MDPs)**. <U>These environments not only provide rewards and state transitions given actions, but those rewards are also dependent on the state of the environment and the action within that state. These dynamics are also temporal, and can be delayed over time.</U>

* ***Delayed reward*** in CartPole of openai gym:

Keeping the pole in the air as long as possible means moving in ways that will be advantageous for both the present and the future. To accomplish this we will adjust the reward value for each observation-action pair using a function that weighs actions over time.

To take reward over time into account, the form of Policy Gradient we used in the previous tutorials will need a few adjustments. The first of which is that we now need to update our agent with more than one experience at a time. To accomplish this, we will collect experiences in a buffer, and then occasionally use them to update the agent all at once. These sequences of experience are sometimes referred to as rollouts, or experience traces. We can’t just apply these rollouts by themselves however, we will need to ensure that the rewards are properly adjusted by a discount factor

Intuitively this allows each action to be a little bit responsible for not only the immediate reward, but all the rewards that followed. 

In [1]:
import tensorflow as tf
import tensorflow.contrib as tc
import numpy as np
import gym
import matplotlib.pyplot as plt
%matplotlib inline

try:
    xrange = xrange
except:
    xrange = range

In [2]:
env = gym.make('CartPole-v0')

In [3]:
gamma = 0.99

def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

In [4]:
class pg_net():
    def __init__(self, lr, n_obs, n_action, n_hidden, layer_norm):
        self.n_obs = n_obs
        self.n_action = n_action
        self.n_hidden = n_hidden
        self.layer_norm = layer_norm
        
        self.obs, self.output, self.chosen_action = self.build_net()
        
        self.reward_plc = tf.placeholder(shape=[None], dtype=tf.float32)
        self.action_plc = tf.placeholder(shape=[None], dtype=tf.int32)
        
        self.shape = tf.shape(self.output)
        self.indexes = tf.range(0, tf.shape(self.output)[0])*tf.shape(self.output)[1] + self.action_plc
        self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)

        loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_plc)
        
        tvars = tf.trainable_variables()
        self.gradient_holders = []
        for idx,var in enumerate(tvars):
            placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
            self.gradient_holders.append(placeholder)
        
        self.gradients = tf.gradients(loss, tvars)
        
        optimizer = tf.train.AdamOptimizer(learning_rate=lr)
        self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))
        
    def build_net(self):
        obs_plc = tf.placeholder(shape=[None, self.n_obs], dtype=tf.float32)
        
        x = tf.layers.dense(obs_plc, self.n_hidden)
        x = tf.nn.relu(x)
        if self.layer_norm:
            self.x = tc.layers.layer_norm(x, center=True, scale=True)
        
        x = tf.layers.dense(x, self.n_action)
        if self.layer_norm:
            self.x = tc.layers.layer_norm(x, center=True, scale=True)
        x = tf.nn.softmax(x)
        
        chosen_action = tf.argmax(x, 1)
    
        return obs_plc, x, chosen_action

In [9]:
tf.reset_default_graph() #Clear the Tensorflow graph.

myAgent = pg_net(lr=1e-2, n_obs=4, n_action=2, n_hidden=8, layer_norm=False) #Load the agent.

total_episodes = 5000 #Set total number of episodes to train agent on.
max_ep = 999
update_frequency = 5

init = tf.global_variables_initializer()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    total_reward = []
    total_lenght = []
        
    gradBuffer = sess.run(tf.trainable_variables())
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
        
    while i < total_episodes:
        s = env.reset()
        running_reward = 0
        ep_history = []
        for j in range(max_ep):
            #Probabilistically pick an action given our network outputs.
            a_dist = sess.run(myAgent.output, feed_dict={myAgent.obs:[s]})
            a = np.random.choice(a_dist[0],p=a_dist[0])
            a = np.argmax(a_dist == a)

            s1,r,d,_ = env.step(a) #Get our reward for taking an action given a bandit.
            ep_history.append([s,a,r,s1])
            s = s1
            running_reward += r
            if d == True:
                #Update the network.
                ep_history = np.array(ep_history)
                ep_history[:,2] = discount_rewards(ep_history[:,2])
                feed_dict={myAgent.reward_plc:ep_history[:,2],
                        myAgent.action_plc:ep_history[:,1],myAgent.obs:np.vstack(ep_history[:,0])}
                grads, indx, rep = sess.run([myAgent.gradients, myAgent.indexes, myAgent.responsible_outputs], \
                                                feed_dict=feed_dict)
                print(indx)
                print(rep)
                for idx,grad in enumerate(grads):
                    gradBuffer[idx] += grad

                if i % update_frequency == 0 and i != 0:
                    feed_dict= dictionary = dict(zip(myAgent.gradient_holders, gradBuffer))
                    _ = sess.run([myAgent.update_batch], feed_dict=feed_dict)
                    for ix,grad in enumerate(gradBuffer):
                        gradBuffer[ix] = grad * 0
                
                total_reward.append(running_reward)
                total_lenght.append(j)
                break

        
            #Update our running tally of scores.
        if i % 100 == 0:
            print(np.mean(total_reward[-100:]))
        i += 1

[ 0  2  5  7  8 11 12 14 17 18 21 22 25 27 28 31 32 34 36 39 40 42]
[0.49916077 0.4860956  0.51965755 0.51128477 0.50428104 0.5110443
 0.5052062  0.48910004 0.5168593  0.49129605 0.5147863  0.49323213
 0.5129822  0.5051179  0.5201501  0.5059404  0.5191729  0.49301788
 0.48644993 0.5191087  0.48924538 0.48343438]
22.0
[ 1  2  4  7  9 10 12 15 17 18 20 22 25 26 29 31 33 35 37 38 40 42 44 46
 49 51 53 55 56 58 61 62 64 67 69 71 72 75 76 79 80 82 84 86 88]
[0.49761227 0.47307503 0.5027847  0.5050895  0.49485222 0.47516724
 0.5047846  0.5067981  0.49165818 0.47773236 0.50494665 0.49138483
 0.51453793 0.49366245 0.51238227 0.50431406 0.481754   0.50677073
 0.53472614 0.43676808 0.4664289  0.4967295  0.49055165 0.48118457
 0.5249458  0.5170059  0.50225556 0.48698536 0.48601097 0.5179123
 0.5159161  0.52418506 0.4799757  0.5291918  0.51941204 0.47436976
 0.51725686 0.483959   0.52663755 0.49335676 0.5369547  0.4973824
 0.45493752 0.44684297 0.4397426 ]
[ 0  3  4  6  8 10 13 14 17 19 21 22 25 2

[  1   2   4   7   8  11  12  14  17  18  20  23  25  26  28  31  33  35
  37  39  40  42  44  46  49  51  52  55  57  58  61  63  64  66  68  70
  73  75  76  79  81  82  84  87  89  90  93  95  96  98 101 102 104 107]
[0.47295666 0.56025046 0.527378   0.4949577  0.53064257 0.49286297
 0.53350365 0.50918686 0.51012224 0.51281255 0.49335694 0.52586466
 0.5018404  0.5206773  0.5008032  0.5189162  0.49527285 0.47323757
 0.43410435 0.42653143 0.5834171  0.57142997 0.5611238  0.52205783
 0.49900845 0.47609958 0.5726874  0.4763117  0.42741045 0.57892656
 0.43264496 0.4202058  0.5903245  0.579054   0.55289274 0.5121505
 0.5107962  0.48779702 0.555638   0.49034294 0.4471652  0.6004492
 0.54461014 0.49906632 0.45949373 0.61032057 0.46930563 0.39944246
 0.6134094  0.5857092  0.49509168 0.5749214  0.4934229  0.53775615]
[ 0  2  5  6  9 11 12 15 17 19 21 22 24 26 28 30 33 35 36 38 40 42 44 46]
[0.5329336  0.50135905 0.52074957 0.5031827  0.51929176 0.49575937
 0.544765   0.49709296 0.45691434 0.4

[  1   2   4   6   9  11  13  14  16  18  21  22  25  27  29  30  33  35
  36  39  41  42  44  46  49  51  52  54  57  59  60  63  64  66  68  71
  73  75  76  78  80  82  85  86  89  90  92  95  97  98 100 103 104 107]
[0.46942168 0.5892811  0.52909935 0.46926636 0.5675215  0.52861637
 0.4669452  0.5918765  0.52936465 0.46995082 0.5673925  0.4709253
 0.56697714 0.5292563  0.46423775 0.59675044 0.47205415 0.40425444
 0.62941563 0.4067105  0.3728724  0.6571486  0.62324286 0.5828517
 0.51096636 0.41796303 0.6216666  0.5782573  0.51937485 0.4220274
 0.62037253 0.4253467  0.6188722  0.57191795 0.47431752 0.56339025
 0.52524954 0.42447034 0.6200304  0.57215434 0.47045967 0.4325233
 0.6042008  0.4330971  0.6044056  0.43204457 0.39378497 0.6429803
 0.60867035 0.42534465 0.38529545 0.6530783  0.3782115  0.661012  ]
[ 1  3  4  6  8 10 12 15 17 18 21 23 24 27 28 31 32 35 36 38 40 42 44 47]
[0.43857196 0.39494428 0.63647264 0.6019519  0.5527496  0.4777575
 0.44031483 0.59605116 0.5580191  0.47956

[ 0  2  5  7  8 11 12 15 17 19 21 23 25 26 28 30 32 34 36 39 41 43 44 47
 49 51 52 55 56 58 61 63 64 66 68 71 72 74 77 79 81 83 84 86]
[0.5143318  0.42700192 0.6226909  0.57488036 0.504037   0.5792252
 0.497286   0.58440274 0.5106279  0.39561188 0.3402436  0.29602823
 0.2567413  0.77840453 0.7354758  0.6888586  0.6342145  0.55866724
 0.4276908  0.6212819  0.57251376 0.4365693  0.6366707  0.4411758
 0.36665195 0.31376895 0.7281259  0.31874967 0.7238843  0.67696404
 0.38479027 0.32474428 0.7187616  0.6718542  0.6131771  0.45744777
 0.6170473  0.54926336 0.52330244 0.44210565 0.37339905 0.32449597
 0.7191842  0.6725699 ]
[ 0  3  5  7  8 10 12 15 16 19 20 23 24 27 28 30 33 35 36 38 41 42 44 47
 49 50 53 55 56 59 60 62 65 66 69 71 72 74 77 79 80]
[0.5014119  0.5789035  0.4945602  0.3961421  0.6588573  0.59997296
 0.50677687 0.5748621  0.5139975  0.5715924  0.519802   0.5686166
 0.524636   0.5659     0.5288469  0.43658558 0.6137323  0.5603252
 0.53699094 0.44077381 0.6103768  0.4424457  0.39

KeyboardInterrupt: 