Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

Several gradient steps on recorded trajectory #13

Open
Martinez-Piazuelo opened this issue Mar 21, 2019 · 0 comments
Open

Several gradient steps on recorded trajectory #13

Martinez-Piazuelo opened this issue Mar 21, 2019 · 0 comments

Comments

@Martinez-Piazuelo
Copy link

Martinez-Piazuelo commented Mar 21, 2019

Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.

Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.

The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).

As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant