# Homework 2: Deep Deterministic Policy Gradient (DDPG) 

## Scott Scheraga  7/5/2020


Implement the DDPG algorithm.
Please do so in native TensorFlow 2, i.e. any use of tensorflow.compat.v1 methods should be kept to a minimum.
Train your agent on the HalfCheetahBulletEnv-v0 environment.
https://github.com/openai/gym/blob/master/docs/environments.md#pybullet-robotics-environments
https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA/edit#heading=h.wz5to0x8kqmr
Push your code to GitHub with instructions for running it.
Perform experiments that help you to understand the performance of your agent.

Record a short clip (10–15 seconds) of your trained agent running on the environment. You can use gym.wrappers.monitoring.video_recorder for this or simply use screen recording software.
    Write a paper describing the environment, your agent, and your experiments.
        There is a strict limit of 2000 words.
            You will be penalized if you exceed this limit.
            There is no lower bound; if you are able to meet the requirements with fewer words, please do so.
            There is also no page limit, and you may include as many figures as you wish. However, each figure must be adequately described and must be of your own creation.
            
            
Cite your sources in ACM or IEEE citation style. References are not included in the word count.
        Include a link to your code.
        Describe the environment. What are the observations and available actions? What is an episode in this environment?
        Briefly explain the DDPG algorithm.
        Report on your experiments and discuss the results. At a minimum, you should have graphs showing the following:
            learning behavior of your agent, i.e., total reward attained per episode during training;
            performance of your trained agent, i.e., total reward attained per episode with your trained agent (show at least 200 episodes' worth of data); and
            how the learning behavior changes with choice of hyperparameters.
        Describe and reflect on challenges you faced.




In [1]:
#Code is largely built off of https://keras.io/examples/rl/ddpg_pendulum/


#from gym.envs.registration import registry, make, spec, register
#python -m pybullet_envs.examples.enjoy_TF_HalfCheetahBulletEnv_v0_2017may

import tensorflow as tf
tf.enable_eager_execution()
#tf.config.run_functions_eagerly(True)
import gym
import pybullet_envs
#import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
from gym import wrappers
from IPython import display

from datetime import datetime

%matplotlib inline


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:

#env = gym.make('Pendulum-v0')
#env = gym.make('CartPole-v1')
env = gym.make('HalfCheetahBulletEnv-v0')

print(env.observation_space.shape)
print(env.action_space.shape)
num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))

#set upper and lower bound, to be used for instantiating Actor netoworks, and
#determining filtering legal actions in the policy function. 
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]
print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))


(26,)
(6,)
Size of State Space ->  26
Size of Action Space ->  6
Max Value of Action ->  1.0
Min Value of Action ->  -1.0




In [3]:
class OUActionNoise:
    #The intention behind the Ornstein-Uhlenbeck process is that if you make
    #noise at a given timestep correlated to previous noise, then the signal will stay
    #pointed in the same direction, and not cancel itself out at each subsequent timestep
    #This usage is to help favor exploration, and to be more suitable for control tasks.
    
    #Later in the code,the this function is called with the arguments:
    # mean=np.zeros(1), std_deviation=float(0.12) * np.ones(1), 
    #with everything else being default. 
    def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, x_initial=None):
        self.theta = theta
        self.mean = mean
        self.std_dev = std_deviation
        self.dt = dt
        self.x_initial = x_initial
        self.reset()

    def __call__(self):
        # Formula taken from https://www.wikipedia.org/wiki/Ornstein-Uhlenbeck_process.
        x = (
            self.x_prev
            + self.theta * (self.mean - self.x_prev) * self.dt
            + self.std_dev * np.sqrt(self.dt) * np.random.normal(size=self.mean.shape)
        )
        # Makes next noise dependent on current one,Store x into x_prev
        self.x_prev = x
        return x

    def reset(self):
        if self.x_initial is not None:
            self.x_prev = self.x_initial
        else:
            self.x_prev = np.zeros_like(self.mean)


In [4]:
class Buffer:
    #Defines a replay buffer. When called, it returns a random batch of experiencies. 
    def __init__(self, buffer_capacity=100000, batch_size=256):
    
        # Number of "experiences" to store at max
        self.buffer_capacity = buffer_capacity
        # Num of tuples to train on.
        self.batch_size = batch_size

        #Number of times record method was called.
        self.buffer_counter = 0
    
        # Create empty buffer arrays for incoming observation tuples for
        #state, action, reward and next_state  
        self.state_buffer = np.zeros((self.buffer_capacity, num_states))
        self.action_buffer = np.zeros((self.buffer_capacity, num_actions))
        self.reward_buffer = np.zeros((self.buffer_capacity, 1))
        self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))

    
    def record(self, obs_tuple):
        
        #If the buffer_capacity is exceeded,
        #then set index to zero, replacing old records. 
        #Otherwise index is the same as buffer_counter
        index = self.buffer_counter % self.buffer_capacity

        # Given a (s,a,r,s') observation tuple input, 
        #sort the data into our 4 buffers.
        self.state_buffer[index] = obs_tuple[0]
        self.action_buffer[index] = obs_tuple[1]
        self.reward_buffer[index] = obs_tuple[2]
        self.next_state_buffer[index] = obs_tuple[3]
        
        #Increment the counter for number of records 
        self.buffer_counter += 1

    # Compute the loss and update parameters
    def learn(self):
        # Set the sampling range as the minimum between current buffer counter 
        #and buffer capacity
        record_range = min(self.buffer_counter, self.buffer_capacity)
        
        # Remove temporal associations in data by randomly sampling existing indices
        #Creates a shuffled list of all of the exisiting indicies within the batch size. 
        batch_indices = np.random.choice(record_range, self.batch_size)

        # Convert (s,a,r,s') batches with the shuffled indicies to tensor objects and 
        #convert reward_batch to float32
        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        reward_batch = tf.cast(reward_batch, dtype=tf.float32)
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])

        
        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        
        state_batch = tf.cast(state_batch, dtype=tf.float32)  #added for testing
        action_batch = tf.cast(action_batch, dtype=tf.float32) #added for testing
        reward_batch = tf.cast(reward_batch, dtype=tf.float32)
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])
        next_state_batch = tf.cast(next_state_batch, dtype=tf.float32) 
        
        
        
        # Trains and updates the Actor & Critic networks. 
        with tf.GradientTape() as tape:
            #target_actor generates target_actions from the next state batch
            target_actions = target_actor(next_state_batch)
            
            #y is the moving expected-return target that the critic will attempt to achieve.
            #The reward batch is added to gamma(the future-rewards discount factor)*
            #the target critic's output based on the actor network's actions based on 
            #the same shuffled next_state_batch information
            y = reward_batch + gamma * target_critic([next_state_batch, target_actions])
            
            #Generate a critic value based on the shuffled state and action batch data. 
            critic_value = critic_model([state_batch, action_batch])
            #Generate the Critic loss, which is the mean squared error between the moving 
            #expected-return target y and the critic_value. 
            critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))

        #generate the critic gradient using the tape.gradient function, using the critic loss, and 
        #trainable variables in the critic model
        critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
        #update the critic model by applying the critic gradient to the model's trainable variables
        #using the critic optimizer, which is defined later.
        critic_optimizer.apply_gradients(
            zip(critic_grad, critic_model.trainable_variables)
        )

        with tf.GradientTape() as tape:
            #Generate actions by inputting the state_batch information to the actor model
            actions = actor_model(state_batch)
            #Unlike before,generate a critic value based on the actor_model's output and the state batch.
            critic_value = critic_model([state_batch, actions])
            
            #the actor loss is the negative of the mean of all of the values in the critic model's output
            #It is negative in this implementation so that the algorithm seeks to maximize this value. 
            actor_loss = -tf.math.reduce_mean(critic_value)

        #generate the actor gradient using the tape.gradient function, using the actor loss, and 
        #trainable variables in the actor model   
        actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
        #update the actor model by applying the actor gradient to the model's trainable variables
        #using the actor optimizer, which is defined later.
        actor_optimizer.apply_gradients(
            zip(actor_grad, actor_model.trainable_variables)
        )

#update_target is called at every timestep, and updates the target actor and 
#target critic, slowly, using the update rate tau.
def update_target(tau):
    #reset new_weights list
    new_weights = []
    #reset target_variables as the current target critic's weights
    target_variables = target_critic.weights
    #For each of the weights in the critic model, add to new_weights list the
    #existing weight (variable) * update rate tau + each of the weights in the target_critic * (1-tau)
    for i, variable in enumerate(critic_model.weights):
        new_weights.append(variable * tau + target_variables[i] * (1 - tau))

    #reset the target critic's weights as the new weights    
    target_critic.set_weights(new_weights)
    
    
    #reset new_weights list
    new_weights = []
    #reset target_variables as the current target_actor's weights
    target_variables = target_actor.weights
    #For each of the weights in the actor model, add to new_weights list the
    #existing weight (variable) * update rate tau + each of the weights in the target_actor * (1-tau)
    for i, variable in enumerate(actor_model.weights):
        new_weights.append(variable * tau + target_variables[i] * (1 - tau))

    #set the actor critic's weights as the new weights      
    target_actor.set_weights(new_weights)


In [5]:
def get_actor():  #makes actor network
    # Initialize weights between -3e-3 and 3-e3
    last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)
    #Input layer has 26 inputs, Two dense layers of size 512, with relu activation, 
    #with batch normalization layers in between, and a 
    #6-output final layer, with tanh activation. Initialize the kernel with the random uniform 
    #values in last_init. 
    inputs = layers.Input(shape=(num_states,))
    out = layers.Dense(512, activation="relu")(inputs)
    out = layers.BatchNormalization()(out)
    out = layers.Dense(512, activation="relu")(out)
    out = layers.BatchNormalization()(out)
    outputs = layers.Dense(num_actions, activation="tanh", kernel_initializer=last_init)(out)
    
    #Multply the network outputs by env.action_space.high[0], which is 1.0
    outputs = outputs * upper_bound

    #generate and return the model based on the inputs and outputs layers
    model = tf.keras.Model(inputs, outputs)
    return model

def get_critic():
    # Critic inputs are states and actions. (Sizes of 26 and 6 for halfcheetah)
    state_input = layers.Input(shape=(num_states))
    action_input = layers.Input(shape=(num_actions))
    #action_out = layers.Dense(32, activation="relu")(action_input)  #Changed from 32 to 512 in V5
    #action_out = layers.BatchNormalization()(action_out)#Changed from 32 to 512 in V5

    concat = layers.Concatenate()([state_input, action_input])
    
    #Two dense layers of size 512, with relu activation, 
    #with batch normalization layers in between, and a 
    # 1-output dense final layer,
    out = layers.Dense(512, activation="relu")(concat)
    out = layers.BatchNormalization()(out)
    out = layers.Dense(512, activation="relu")(out)
    out = layers.BatchNormalization()(out)
    outputs = layers.Dense(6)(out)

    # Outputs single value for give state-action
    model = tf.keras.Model([state_input, action_input], outputs)

    return model


In [6]:
#The Policy function generates sampled actions based on the state, adds noise, 
#and return only values within the upper and lower bounds.
def policy(state, noise_object):
    #Given the input state, and noise use the actor model to generate sampled actions
    sampled_actions = tf.squeeze(actor_model(state))
    noise = noise_object()
    
    # Add the input noise to sampled action
    sampled_actions = sampled_actions.numpy() + noise
    #sampled_actions = sampled_actions + noise
    
    #If any of the the sampled action+noise values are greater than 1.0 
    #or smaller than -1.0, convert the values to 1.0 or -1.0 respectively 
    legal_action = np.clip(sampled_actions, lower_bound, upper_bound)

    #Remove 1-D entries from the shape of the legal_action array
    return np.squeeze(legal_action)


In [7]:
std_dev = 0.2
#Instantiate OU noise given the values below. 
ou_noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(std_dev) * np.ones(1))

actor_model = get_actor()
critic_model = get_critic()

target_actor = get_actor()
target_critic = get_critic()

# Set the target actor and target critic models to be 
#the same as their non-target models. 
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

# Learning rate for actor-critic models
critic_lr = 0.001  #originally .002
actor_lr = 0.0001  #originally .001

#set optimizers as Adam, using the learning rates above
critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)

#Set the number of epochs to run.
total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.01

#Instatiate the buffer at a size of 50,000, as opposed to the default 100,000
buffer = Buffer(50000, 64)  


In [8]:
"""
actor_model.load_weights("chet_actor400v5.h5")
critic_model.load_weights("chet_critic400v5.h5")
#
target_actor.load_weights("chet_target_actor400v5.h5")
target_critic.load_weights("chet_target_critic400v5.h5")
# define an empty list
# open file and read the content in a list
with open('rewardlist400v5.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        ep_reward_list.append(currentPlace)

"""

'\nactor_model.load_weights("chet_actor400v5.h5")\ncritic_model.load_weights("chet_critic400v5.h5")\n#\ntarget_actor.load_weights("chet_target_actor400v5.h5")\ntarget_critic.load_weights("chet_target_critic400v5.h5")\n# define an empty list\n# open file and read the content in a list\nwith open(\'rewardlist400v5.txt\', \'r\') as filehandle:\n    for line in filehandle:\n        # remove linebreak which is the last character of the string\n        currentPlace = line[:-1]\n\n        # add item to the list\n        ep_reward_list.append(currentPlace)\n\n'

In [None]:
# To store reward history of each episode
ep_reward_list = []

# To store average reward history of last few episodes
#avg_reward_list = []
env.render()

dateTimeObj = datetime.now()



print("Time is: ", dateTimeObj)

# Takes about 20 min to train
for ep in range(total_episodes):

    prev_state = env.reset()
    episodic_reward = 0

    for g in range (999):
    #while True:
        # Uncomment this to see the Actor in action
        # But not in a python notebook.
        # env.render()

        #Convert previous state to a tensor
        tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)

        #Action the legal versions of the the actor's actions given the previous state, 
        #with added noise from OU noise -SEE Policy function. 
        action = policy(tf_prev_state, ou_noise)
        
        # Recieve state, reward, and done information by stepping the environment forward
        state, reward, done, info = env.step(action)

        #Add information to the buffer
        buffer.record((prev_state, action, reward, state))
        #Add timestep reward to the episodic reward
        episodic_reward += reward

        #Update the actor and critic networks
        buffer.learn()
        #Update the target actor and target critic networks, with the tau update variable.
        update_target(tau)

        
        # End this episode when `done` is True
        #if done:
           # break

        prev_state = state

    ep_reward_list.append(episodic_reward)

    # Mean of last 40 episodes
    #avg_reward = np.mean(ep_reward_list[-40:])
    print("Episode * {} *  Episodic Reward: {}".format(ep, episodic_reward ))
    #avg_reward_list.append(avg_reward)
    
dateTimeObj = datetime.now()
print("Time is: ", dateTimeObj)

# Plotting graph
# Episodes versus Avg. Rewards
plt.plot(ep_reward_list)
plt.xlabel("Episode")
plt.ylabel("Avg. Episodic Reward")
plt.show()

actor_model.save_weights("chet_actor100v8.h5")
critic_model.save_weights("chet_critic100v8.h5")

target_actor.save_weights("chet_target_actor100v8.h5")
target_critic.save_weights("chet_target_critic100v8.h5")

with open('rewardlist100v8.txt', 'w') as filehandle:
    for listitem in ep_reward_list:
        filehandle.write('%s\n' % listitem)
        
        
       
# for ep in range(total_episodes):

#     prev_state = env.reset()
#     episodic_reward = 0

#     for g in range (999):
#     #while True:
#         # Uncomment this to see the Actor in action
#         # But not in a python notebook.
#         # env.render()

#         tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)

#         action = policy(tf_prev_state, ou_noise)
#         # Recieve state and reward from environment.
#         state, reward, done, info = env.step(action)

#         buffer.record((prev_state, action, reward, state))
#         episodic_reward += reward

#         buffer.learn()
#         update_target(tau)

#         # End this episode when `done` is True
#         #if done:
#             #break

#         prev_state = state

#     ep_reward_list.append(episodic_reward)

#     # Mean of last 40 episodes
#    # avg_reward = np.mean(ep_reward_list[-40:])
    
#     print("Episode * {} *  Episodic Reward: {}".format(ep, episodic_reward ))
    

# dateTimeObj = datetime.now()
# print("Time is: ", dateTimeObj)
# # Plotting graph
# # Episodes versus Avg. Rewards
# plt.plot(ep_reward_list)
# plt.xlabel("Episode")
# plt.ylabel("Episodic Reward")
# plt.show()

# actor_model.save_weights("chet_actor200v8.h5")
# critic_model.save_weights("chet_critic200v8.h5")

# target_actor.save_weights("chet_target_actor200v8.h5")
# target_critic.save_weights("chet_target_critic200v8.h5")

# with open('rewardlist200v8.txt', 'w') as filehandle:
#     for listitem in ep_reward_list:
#         filehandle.write('%s\n' % listitem)

#  Record:  V1- nearly unmodified from keras example. Increased Some netwroks rom 16 or 32 to 512.
#V2- Simplified critic network, so that it is just two 512 layers
#V3- Same as V2, but reduced learning rates, so that there is hopefully less bouncing. 



Time is:  2020-12-27 16:32:58.291945


In [19]:
actor_model.load_weights("chet_actor100v8.h5")
critic_model.load_weights("chet_critic100v8.h5")

target_actor.load_weights("chet_target_actor100v8.h5")
target_critic.load_weights("chet_target_critic100v8.h5")
"""
with open('rewardlist200v8.txt', 'w') as filehandle:
    for listitem in ep_reward_list:
        filehandle.write('%s\n' % listitem)
"""      

"\nwith open('rewardlist200v8.txt', 'w') as filehandle:\n    for listitem in ep_reward_list:\n        filehandle.write('%s\n' % listitem)\n"

In [20]:

target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())


In [None]:
#actor_model.load_weights("chet_actor140v5.h5")
#critic_model.load_weights("chet_critic140v5.h5")

#target_actor.load_weights("chet_target_actor140v5.h5")
#target_critic.load_weights("chet_target_critic140v5.h5")

env.reset()
img = plt.imshow(env.render(mode='rgb_array')) # only call this once
for _ in range(800):
    img.set_data(env.render(mode='rgb_array')) # just update the data
    display.display(plt.gcf())
    display.clear_output(wait=True)
    action = env.action_space.sample()
    #env.step(action)  #investigate 
    env.render()

In [23]:
#https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/
#To Load files:
places = ['Berlin', 'Cape Town', 'Sydney', 'Moscow']

with open('listfile.txt', 'w') as filehandle:
    for listitem in places:
        filehandle.write('%s\n' % listitem)

#In line 6 the listitem is extended by a linebreak "\n", firstly, and stored into the output file, secondly. To read the entire list from the file listfile.txt back into memory this Python code shows you how it works:

# define an empty list
places = []

# open file and read the content in a list
with open('listfile.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        places.append(currentPlace)
