You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current Monte Carlo approach to reward target calcs seems to have way too much variance due to the random nature in both the game itself and its starting points, as well as the 2-player aspect. Rewrite the training algorithm to use temporal difference (TD) learning instead, which trains on rewards/next-states directly using a target network to improve stability.
The text was updated successfully, but these errors were encountered:
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Improvement on #353 and setup for #354.
Rewrite training algorithm (again) to remove the concept of episodes and
instead focus on pure learning steps according to the DQN algorithm.
Also add a proper replay buffer implementation.
Add/rewrite some configs/metrics code to mesh with above.
Also reorganize source tree, general housekeeping.
The current Monte Carlo approach to reward target calcs seems to have way too much variance due to the random nature in both the game itself and its starting points, as well as the 2-player aspect. Rewrite the training algorithm to use temporal difference (TD) learning instead, which trains on rewards/next-states directly using a target network to improve stability.
The text was updated successfully, but these errors were encountered: