Try replacing Distributions with tf.Distributions #52

ryanjulian · 2018-04-15T20:54:06Z

No description provided.

eric-heiden · 2018-05-13T00:47:40Z

To add some details: Distributions are used by policies and other modules to add distribution functionality, such as computing the KL divergence between two distributions etc. given the parameters of a distribution (which are often output tensors of an NN). TF.Distributions probably implements the same thing so that we should try to replace our code by using the TF counterpart.

ryanjulian · 2018-05-22T17:10:51Z

If possible, remove rllab Distributions entirely.

silrep2 · 2018-05-30T16:30:14Z

@ryanjulian @eric-heiden

Is there a log_likelihood_sym function in tensorflow distribution?
distribution is created by policy, then is used by batch polopt, passing parameters as dictionary.
But the tensorflow distribution requires parameters placeholder at initiation(which is in policy), not at call time(BatchPolopt). And we need one more distribution instance(not only a dictionary) to call the kl function. Should I create one more distribution based on the policy.dist type? Is there a way good to achieve that and avoid big change?
the calculation of kl is based on a empty placeholder. Where can I find the data input source?

ryanjulian · 2018-05-30T17:24:36Z

I think it should be tf.Distribution.log_prob. Be sure to test an example and compare outputs to verify.
Fixing the placeholders will probably require a larger refactor, but that's okay. The way the current code is organized for this case is pretty bad. There are two patterns which immediately come to mind to use. (1) as you mentioned, create placeholders outside of Policy and pass it to the Policy constructor or (2) have the Policy constructor create its own placeholders in its constructor, and provide an API for BatchPolopt to access them. I prefer 2.
KL divergence is an operation defined on two distributions. In BatchPolopt it is used by the optimizer to calculate the distance between the old policy distribution and an updated policy distribution, to make sure the updated policy does not diverge too much from the old one.

The short answer to your question is that old_dist_info_vars is part of the data generated by the sampler while the agent interacts with the environment. It is a list of distributions over actions: one distribution $Pr(a_i)$ for each action $a_i$ taken. dist_info_vars is calculated on the fly by the optimizer, by taking the sequence of states $[s_1, s_2, ..., s_i]$ and passing it to an updated policy $\pi_{new}(a_i|s_i)$ . More info on why this is below.

Sampled KL-divergence

It is intractable to evaluate KL divergence directly on our policies, so we sample KL divergence. This leads to the complicated structure in the KL parts of BatchPolopt.

The policy $\pi(a_i|s_i) = Pr(a_i|s_i)$ is a distribution which gives the probability of taking an action $a_i$ while in state $s_i$ . While rllab collects rollouts, at every step it passes the current state $s_i$ to the policy $\pi(a_i|s_i)$ . This yields a particular non-conditioned distribution $Pr(a_i)$ for that time step. In the case of BatchPolopt, we just choose an action $a_i$ at random with underlying probability $Pr(a_i)$ and perform that in the environment, but we also store that distribution $Pr(a_i)$ for later. The sequence of distributions $[Pr(a_1), Pr(a_2), ..., Pr(a_i)]$ is denoted old_dist_info_vars in BatchPolopt.

Of course, because of how TensorFlow works, old_dist_info_vars is a placeholder, and rather than representing distributions directly, it represents their underlying parameters (vars), which are then fed to TensorFlow during the optimization.

Later in BatchPolopt, during the optimization, we would like to calculate the KL divergence between the old policy $\pi_{old}(a_i|s_i)$ and a hypothetical new policy $\pi_{new}(a_i|s_i)$ created by a parameter update. We already have a bunch of sampled distributions from the old policy $old_dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{old}$ . To generate samples from the new policy, we just have to feed the old states $[s_1, s_2, ..., s_i]$ to the policy $pi_{new}(a_i|s_i)$ . This yields $dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{new}]$

Finally, equipped with $old_dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{old}$ and $dist_info_vars = [Pr(a_1), Pr(a_2), ..., Pr(a_i)]_{new}]$ , we can calculate a sampled KL divergence. It is tractable to calculate the KL divergence between the sampled distributions $KL[Pr_{old}(a_i) || Pr_{new}(a_i)]$ . So we calculate this quantity for each pair of distributions, and then take their average to estimate the overall KL divergence.

silrep2 · 2018-06-05T17:10:20Z

Hi @ryanjulian

I write some test code to verify my idea but find some problems.

As you recommended, I create two distributions inside policy and expose their placeholder.
Then I try to replace the old_dist_info_vars and dist_info_vars with my new distributions.
It seems that the placeholder of dist_info_vars was created by policy.dist_info_sym. So I just assign the dist_info_vars["mean"], which is the original placeholder of mean, to my new placeholder. But it said You must feed a value for placeholder tensor 'policy/norm.mean'
I think dist_info_vars["mean"] is fed, the self.policy._mean will be fed. What's the problem?
I have tested the tf.Distribution.log_prob function, which is the same as log_likelihood_sym, in addition to an extra reduce_sum. If I uncomment the code to use tf_dist.log_prob. There will be a value error. Do you have any hint of that?

ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients, between variables ["<tf.Variable 'policy/mean_network/hidden_0/W:0' shape=(4, 32) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_0/b:0' shape=(32,) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_1/W:0' shape=(32, 32) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/hidden_1/b:0' shape=(32,) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/output/W:0' shape=(32, 1) dtype=float32_ref>", "<tf.Variable 'policy/mean_network/output/b:0' shape=(1,) dtype=float32_ref>", "<tf.Variable 'policy/output_std_param/param:0' shape=(1,) dtype=float32_ref>"] and loss Tensor("Neg:0", shape=(), dtype=float32).

silrep2 · 2018-06-06T06:58:43Z

@ryanjulian
Please ignore last comment, I already solved it.
Now my solution is: create two distributions in policy. if it's old distribution, create placeholders for it. if it's new distribution, use the existing placeholder which is from a layer output. all changes are here

It is runnable but still test code. And I still keep the legacy DiagonalGaussian because the sampler will use the distribution to calc entropy. So it may need a copy of sampler in tensorflow sandbox? If you think this solution is ok, I will replace the remaining.

ryanjulian · 2018-06-11T05:16:09Z

See rlworkgroup/garage#15

ryanjulian · 2018-06-11T05:16:33Z

@silrep2 can you show me what you mean with a pull request?

ryanjulian added feature refactor and removed feature labels May 15, 2018

ryanjulian mentioned this issue May 15, 2018

Support for conditional distributions #3

Closed

ryanjulian added this to the Week of May 21 milestone May 15, 2018

eric-heiden assigned silrep2 May 17, 2018

ryanjulian added the big big multi-feature projects label May 23, 2018

ryanjulian removed this from the Week of May 21 milestone May 29, 2018

ryanjulian mentioned this issue Jun 11, 2018

Replace Distributions with tf.Distributions rlworkgroup/garage#15

Closed

ryanjulian closed this as completed Jun 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try replacing Distributions with tf.Distributions #52

Try replacing Distributions with tf.Distributions #52

ryanjulian commented Apr 15, 2018

eric-heiden commented May 13, 2018

ryanjulian commented May 22, 2018

silrep2 commented May 30, 2018

ryanjulian commented May 30, 2018 •

edited

silrep2 commented Jun 5, 2018

silrep2 commented Jun 6, 2018

ryanjulian commented Jun 11, 2018

ryanjulian commented Jun 11, 2018

Try replacing Distributions with tf.Distributions #52

Try replacing Distributions with tf.Distributions #52

Comments

ryanjulian commented Apr 15, 2018

eric-heiden commented May 13, 2018

ryanjulian commented May 22, 2018

silrep2 commented May 30, 2018

ryanjulian commented May 30, 2018 • edited

Sampled KL-divergence

silrep2 commented Jun 5, 2018

silrep2 commented Jun 6, 2018

ryanjulian commented Jun 11, 2018

ryanjulian commented Jun 11, 2018

ryanjulian commented May 30, 2018 •

edited