Please free feel to use Issues to ask questions.
A policy gradient algorithm where, instead of using Monte-Carlo returns, we use a state-conditioned value function for computing advantage of actions and bootstrapping n-step returns.
Features:
- Written in PyTorch, uses weights and biases to track learning
- Minimal requirements: numpy, pytorch, gym, wandb (for logging)
- Synchronous (use several episodes for each update)
- n-step can be changed (e.g., 1, 5 or even 200); set longer than timeout to get Monte-Carlo returns
- Support discrete actions only
- Tested on CartPole only
- Optional: delay reward (a cumulative reward is given every 40 steps)
- Optional: self-imitation learning for sparse rewards
Different n-steps | Effect of SIL in the delay-reward setting |
---|---|
Here I plotted 3 seeds.