Calculation method of value estimate #23

wayunderfoot · 2020-03-16T13:02:44Z

Thank you for your outstanding work. In your paper, the estimated value and the real value are mentioned. I would like to ask about the specific calculation method. Thank you。

sfujim · 2020-03-30T23:34:23Z

Estimated value comes from the average output from the critic. The real value is computed by resetting the state of the MuJoCo simulator to states sampled from the replay buffer, and then following the trajectory to completion, starting from the state (and corresponding action) sampled from the buffer.

Alternatively, you could just estimate the real value by running trajectories from the initial start states (i.e. just reset the env and run). And compute the corresponding value from the critic on those start states.

geyang · 2021-12-03T01:11:44Z

Hey Scott, when you say

Estimated value comes from the average output from the critic.

Which critic do you use to compute the estimate? do you use both of the two critics and average between them, or this averaging is more amongst the starting states?

sfujim · 2021-12-03T18:12:40Z

For TD3 it's the min of the critics. Both the estimated and real value are averages over state-action pairs in the replay buffer. So for the estimated value, it was: sample state-action pairs from the buffer -> evaluate both critics on these pairs -> take the min of the two critics.

sfujim closed this as completed Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculation method of value estimate #23

Calculation method of value estimate #23

wayunderfoot commented Mar 16, 2020

sfujim commented Mar 30, 2020

geyang commented Dec 3, 2021

sfujim commented Dec 3, 2021

Calculation method of value estimate #23

Calculation method of value estimate #23

Comments

wayunderfoot commented Mar 16, 2020

sfujim commented Mar 30, 2020

geyang commented Dec 3, 2021

sfujim commented Dec 3, 2021