New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculation method of value estimate #23
Comments
Estimated value comes from the average output from the critic. The real value is computed by resetting the state of the MuJoCo simulator to states sampled from the replay buffer, and then following the trajectory to completion, starting from the state (and corresponding action) sampled from the buffer. Alternatively, you could just estimate the real value by running trajectories from the initial start states (i.e. just reset the env and run). And compute the corresponding value from the critic on those start states. |
Hey Scott, when you say
Which critic do you use to compute the estimate? do you use both of the two critics and average between them, or this averaging is more amongst the starting states? |
For TD3 it's the min of the critics. Both the estimated and real value are averages over state-action pairs in the replay buffer. So for the estimated value, it was: sample state-action pairs from the buffer -> evaluate both critics on these pairs -> take the min of the two critics. |
Thank you for your outstanding work. In your paper, the estimated value and the real value are mentioned. I would like to ask about the specific calculation method. Thank you。
The text was updated successfully, but these errors were encountered: