Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try replacing Distributions with tf.Distributions #52

Closed
ryanjulian opened this issue Apr 15, 2018 · 8 comments
Closed

Try replacing Distributions with tf.Distributions #52

ryanjulian opened this issue Apr 15, 2018 · 8 comments
Assignees
Labels
big big multi-feature projects refactor

Comments

@ryanjulian
Copy link
Owner

No description provided.

@eric-heiden
Copy link
Collaborator

To add some details: Distributions are used by policies and other modules to add distribution functionality, such as computing the KL divergence between two distributions etc. given the parameters of a distribution (which are often output tensors of an NN). TF.Distributions probably implements the same thing so that we should try to replace our code by using the TF counterpart.

@ryanjulian
Copy link
Owner Author

If possible, remove rllab Distributions entirely.

@ryanjulian ryanjulian added the big big multi-feature projects label May 23, 2018
@ryanjulian ryanjulian removed this from the Week of May 21 milestone May 29, 2018
@silrep2
Copy link
Collaborator

silrep2 commented May 30, 2018

@ryanjulian @eric-heiden

  1. Is there a log_likelihood_sym function in tensorflow distribution?
  2. distribution is created by policy, then is used by batch polopt, passing parameters as dictionary.
    But the tensorflow distribution requires parameters placeholder at initiation(which is in policy), not at call time(BatchPolopt). And we need one more distribution instance(not only a dictionary) to call the kl function. Should I create one more distribution based on the policy.dist type? Is there a way good to achieve that and avoid big change?
  3. the calculation of kl is based on a empty placeholder. Where can I find the data input source?

@ryanjulian
Copy link
Owner Author

ryanjulian commented May 30, 2018

  1. I think it should be tf.Distribution.log_prob. Be sure to test an example and compare outputs to verify.

  2. Fixing the placeholders will probably require a larger refactor, but that's okay. The way the current code is organized for this case is pretty bad. There are two patterns which immediately come to mind to use. (1) as you mentioned, create placeholders outside of Policy and pass it to the Policy constructor or (2) have the Policy constructor create its own placeholders in its constructor, and provide an API for BatchPolopt to access them. I prefer 2.

  3. KL divergence is an operation defined on two distributions. In BatchPolopt it is used by the optimizer to calculate the distance between the old policy distribution and an updated policy distribution, to make sure the updated policy does not diverge too much from the old one.

    The short answer to your question is that old_dist_info_vars is part of the data generated by the sampler while the agent interacts with the environment. It is a list of distributions over actions: one distribution $Pr(a_i)$ for each action $a_i$ taken. dist_info_vars is calculated on the fly by the optimizer, by taking the sequence of states $[s_1, s_2, ..., s_i]$ and passing it to an updated policy $\pi_{new}(a_i|s_i)$. More info on why this is below.

Sampled KL-divergence

It is intractable to evaluate KL divergence directly on our policies, so we sample KL divergence. This leads to the complicated structure in the KL parts of BatchPolopt.

The policy $\pi(a_i|s_i) = Pr(a_i|s_i)$ is a distribution which gives the probability of taking an action $a_i$ while in state $s_i$. While rllab collects rollouts, at every step it passes the current state $s_i$ to the policy $\pi(a_i|s_i)$. This yields a particular non-conditioned distribution $Pr(a_i)$ for that time step. In the case of BatchPolopt, we just choose an action $a_i$ at random with underlying probability $Pr(a_i)$ and perform that in the environment, but we also store that distribution $Pr(a_i)$ for later. The sequence of distributions $[Pr(a_1), Pr(a_2), ..., Pr(a_i)]$ is denoted old_dist_info_vars in BatchPolopt.

Of course, because of how TensorFlow works, old_dist_info_vars is a placeholder, and rather than representing distributions directly, it represents their underlying parameters (vars), which are then fed to TensorFlow during the optimization.

Later in BatchPolopt, during the optimization, we would like to calculate the KL divergence between the old policy $\pi_{old}(a_i|s_i)$ and a hypothetical new policy $\pi_{new}(a_i|s_i)$ created by a parameter update. We already have a bunch of sampled distributions from the old policy $old_dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{old}$. To generate samples from the new policy, we just have to feed the old states $[s_1, s_2, ..., s_i]$ to the policy $pi_{new}(a_i|s_i)$. This yields $dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{new}]$

Finally, equipped with $old_dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{old}$ and $dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{new}]$, we can calculate a sampled KL divergence. It is tractable to calculate the KL divergence between the sampled distributions $KL[Pr_{old}(a_i) || Pr_{new}(a_i)]$. So we calculate this quantity for each pair of distributions, and then take their average to estimate the overall KL divergence.

@silrep2
Copy link
Collaborator

silrep2 commented Jun 5, 2018

Hi @ryanjulian

I write some test code to verify my idea but find some problems.

  1. As you recommended, I create two distributions inside policy and expose their placeholder.
    Then I try to replace the old_dist_info_vars and dist_info_vars with my new distributions.
    It seems that the placeholder of dist_info_vars was created by policy.dist_info_sym. So I just assign the dist_info_vars["mean"], which is the original placeholder of mean, to my new placeholder. But it said You must feed a value for placeholder tensor 'policy/norm.mean'
    I think dist_info_vars["mean"] is fed, the self.policy._mean will be fed. What's the problem?
  2. I have tested the tf.Distribution.log_prob function, which is the same as log_likelihood_sym, in addition to an extra reduce_sum. If I uncomment the code to use tf_dist.log_prob. There will be a value error. Do you have any hint of that?
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients, between variables ["<tf.Variable 'policy/mean_network/hidden_0/W:0' shape=(4, 32) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_0/b:0' shape=(32,) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_1/W:0' shape=(32, 32) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_1/b:0' shape=(32,) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/output/W:0' shape=(32, 1) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/output/b:0' shape=(1,) dtype=float32_ref>", "<tf.Variable 'policy/output_std_param/param:0' shape=(1,) dtype=float32_ref>"] and loss Tensor("Neg:0", shape=(), dtype=float32).

@silrep2
Copy link
Collaborator

silrep2 commented Jun 6, 2018

@ryanjulian
Please ignore last comment, I already solved it.
Now my solution is: create two distributions in policy. if it's old distribution, create placeholders for it. if it's new distribution, use the existing placeholder which is from a layer output. all changes are here

It is runnable but still test code. And I still keep the legacy DiagonalGaussian because the sampler will use the distribution to calc entropy. So it may need a copy of sampler in tensorflow sandbox? If you think this solution is ok, I will replace the remaining.

@ryanjulian
Copy link
Owner Author

See rlworkgroup/garage#15

@ryanjulian
Copy link
Owner Author

@silrep2 can you show me what you mean with a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big big multi-feature projects refactor
Projects
None yet
Development

No branches or pull requests

3 participants