You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Should help with striking a balance between MC and TD(n) learning from #353. Can later also try TD(λ) method.
Would require a lot more tracking on either the game worker or TF worker side, leaning towards the TF worker since it can simplify the current process of generating experience and reduce the passing of buffers back and forth between threads.
The text was updated successfully, but these errors were encountered:
Close#353.
For now, replace Monte Carlo (MC) method, which takes the total discount
reward sum (i.e. returns) as the learning target, with a 1-step temporal
difference (TD(1)) method, which processes experiences as they come in
and uses its own biased estimate of the value function as the learning
target.
May add back option for MC learning later in the form of TD(n) support
from #354.
Add config for target network and double Q learning with target net.
Also move experience config out of rollout for better grouping.
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Add/rewrite some configs/metrics code to mesh with above.
Also reorganize source tree, general housekeeping.
Should help with striking a balance between MC and TD(n) learning from #353. Can later also try TD(λ) method.
Would require a lot more tracking on either the game worker or TF worker side, leaning towards the TF worker since it can simplify the current process of generating experience and reduce the passing of buffers back and forth between threads.
The text was updated successfully, but these errors were encountered: