Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question #2

Open
jpeg729 opened this issue Apr 2, 2018 · 4 comments
Open

Question #2

jpeg729 opened this issue Apr 2, 2018 · 4 comments

Comments

@jpeg729
Copy link

jpeg729 commented Apr 2, 2018

According to my understanding, the controller C at time t has two inputs. zt and ht, now ht is the prediction produced by M from zt-1, ht-1 and at-1.

So basically, the controller makes its predictions based on...

  1. zt: the current state of the world,
  2. M(zt-1, at-1, ht-1): i.e. what it thought the current state of the world would be given the previous state of the world + its previous chosen action + the previous hidden state of M.

It seems counter-intuitive to me that the current state of the world together with the expectation of the current state of the world should be a sufficient basis for a strong controller. It doesn't seem to make full use of the predictors capabilities.

Is this correct?

@hardmaru
Copy link
Contributor

hardmaru commented Apr 3, 2018

Hi @jpeg729,

That is a great question!

You are correct, the controller's calculation of at is based on zt and ht, and it doesn't use ht+1, since ht+1 needs at to be calculated first, so kind of a chicken and egg problem.

In that sense, one can view ht as a compressed representation of all of the zi and ai for i ∈ {0 ... t-1}. Thus in addition to the current observation zt, the controller's decision will be based on this compressed representation of the entire history up to the point in which it has to make a decision at.

This is something I have thought about when constructing the setup and the algorithm, since ideally we want to use the current h. For the experiments I tried this seemed to be good enough, though the fact that a more complicated controller (with an extra hidden layer) in the Car Racing setup improves the results quite a bit suggests that there is more we can do here.

One thing I have thought about trying, but haven't gotten to it, is to have the controller calculate a temporary ā = controller.action([z, h]), and by rolling forward using this temporary ā to arrive at a temporary = rnn.forward([ā, z, h]), and see if we can get a policy with this rolled-forward temporary hidden state. It doesn't look as elegant as the current approach though. Other methods of rolling forward to do planning might also help, at the expense of complexity and elegance.

Alternatively, we can modify the RNN's roll-forward operation to only depend on z, and not a, but have the MDN-layer's prediction based on a instead, so you can roll-forward h and use it in the same time step to make the prediction. This might be the best option I feel, if we really want to use the forward hidden state. While this will allow the use of a more current h, the MDN (currently just a linear layer) will need to be have more capacity to compensate for the extra processing needed there.

# modification to use the forward h:
def rollout(controller):
  ''' env, rnn, vae are '''
  ''' global variables  '''
  obs = env.reset()
  h = rnn.initial_state()
  done = False
  cumulative_reward = 0
  while not done:
    z = vae.encode(obs)
    h = rnn.forward([z, h])
    # (Note: MDN-layer modified to rely on a and h, not just h)
    a = controller.action([z, h])
    obs, reward, done = env.step(a)
    cumulative_reward += reward
  return cumulative_reward

If you come up with a more elegant way to calculate the forward state, feel free to share!

Best.

@jpeg729
Copy link
Author

jpeg729 commented Apr 4, 2018

I must revise my opinion of the usefulness of the RNN+MDN to the controller. If we conceptually separate the RNN and the MDN, then we can simplistically consider zt to be an encoding of object positions, and RNN(zt, ht-1, at-1) to be an encoding of object velocities and accelerations.

One remark: If the RNN has no knowledge of previous actions, then it will be confused by any changes that result directly from the players action. It would seem more logical to do RNN(zt, at-1, ht-1), since that will allow the RNN to more accurately calculate the velocity and acceleration of the player's glyph.

I have been digging into the demo source code to verify certain details, and I noticed this...

  • In both demos get_action doesn't receive the true zt as input, only the expected_zt. Is this because simulating the entire game in the browser is too hard? (My old laptop won't play the demos at all.)
  • In VizDoom get_action also receives the LSTM cell state. Presumably it isn't needed for the Car Racing demo.

In an adversarial setting, it may make sense to provide the controller with zt and expected_zt. This sort of information could be valuable since it allows the controller to measure the unexpectedness of the opponents actions "I thought he was going to do this, but he actually did that".

@hardmaru
Copy link
Contributor

hardmaru commented Apr 4, 2018

Hi @jpeg729

Thanks for your comments and reply.

I just want to clarify one thing with you. The controller receives the sampled zt, rather than the expected zt.

The sampling is achieved in two parts. For example, in the DoomRNN code:

(1) sample the individual z's inside each of the 5 mixtures, in line 622:

var zs = math.add(mu, math.multiply(std, epsilon)) // 5 possible z's

(doing this inside GPU ops instead was faster, that's why the sampling in line 682 was commented out)

(2) sampling which mixture we should choose, in line 673-679:

idx = sample_softmax(normalize(sub_p));
if (idx < 0) {// sampling error (due to bad precision in gpu mode)
     idx = randi(0, num_mixture);
}

k = num_mixture*i+idx;
next_z[i] = zs[k]; // + std[k] * epsilon[i]; (no need to sample here, already done inside deeplearn.js op

I originally intended to sample the idx (which mixture we use) inside GPU mode as well for efficiency, but at the time of development, there was a weird bug with deeplearn.js where it would only work for Chrome but not for iOS/Safari, so had to resort to sampling in normal JS outside of GPU/deeplearn.js. I think older laptops that don't have support for WebGL (v1) will not run these demos unfortunately, as pure JS on raw CPU didn't seem fast enough. The deeplearn.js (now tensorflow.js) engineers know about this bug and workaround I did for iOS/Safari and it should be solved in the (near) future.

Let me know if you have if any of this is unclear, or if you have any further insights and comments!

Best.

(btw, please don't "close" this issue since I want your comments and this discussion to be clearly visible to other readers in the future)

@AliBaheri
Copy link

AliBaheri commented Oct 25, 2018

Hi @hardmaru
In you first response for the first question in this tread there is an statement which has confused me:

In that sense, one can view ht as a compressed representation of all of the zi and ai for i ∈ {0 ... t-1}. Thus in addition to the current observation zt, the controller's decision will be based on this compressed representation of the entire history up to the point in which it has to make a decision at.

If ht is just a compressed representation of what has been done in past and if zt is the current observation come from the V, then which component has the role to compute the future prediction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants