Implement TD learning #353

taylorhansen · 2023-01-26T23:00:23Z

The current Monte Carlo approach to reward target calcs seems to have way too much variance due to the random nature in both the game itself and its starting points, as well as the 2-player aspect. Rewrite the training algorithm to use temporal difference (TD) learning instead, which trains on rewards/next-states directly using a target network to improve stability.

Improvement on #353 and setup for #354. Rewrite training algorithm (again) to remove the concept of episodes and instead focus on pure learning steps according to the DQN algorithm. Also add a proper replay buffer implementation.

Improvement on #353 and setup for #354. Rewrite training algorithm (again) to remove the concept of episodes and instead focus on pure learning steps according to the DQN algorithm. Also add a proper replay buffer implementation. Add/rewrite some configs/metrics code to mesh with above. Also reorganize source tree, general housekeeping.

taylorhansen added enhancement Something should be changed training Has to do with the training script labels Jan 26, 2023

taylorhansen self-assigned this Jan 26, 2023

taylorhansen mentioned this issue Jan 27, 2023

Implement multi-step learning #354

Closed

taylorhansen closed this as completed in c5df424 Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TD learning #353

Implement TD learning #353

taylorhansen commented Jan 26, 2023

Implement TD learning #353

Implement TD learning #353

Comments

taylorhansen commented Jan 26, 2023