Skip to content

Synchronous n-step advantage actor-critic (A2C) with self-imitation learning in PyTorch ๐Ÿš€

License

Notifications You must be signed in to change notification settings

zhihanyang2022/pytorch-a2c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Advantage actor-critic

Please free feel to use Issues to ask questions.

What is A2C?

A policy gradient algorithm where, instead of using Monte-Carlo returns, we use a state-conditioned value function for computing advantage of actions and bootstrapping n-step returns.

Features of this implementation

Features:

  • Written in PyTorch, uses weights and biases to track learning
  • Minimal requirements: numpy, pytorch, gym, wandb (for logging)
  • Synchronous (use several episodes for each update)
  • n-step can be changed (e.g., 1, 5 or even 200); set longer than timeout to get Monte-Carlo returns
  • Support discrete actions only
  • Tested on CartPole only
  • Optional: delay reward (a cumulative reward is given every 40 steps)
  • Optional: self-imitation learning for sparse rewards

Learning curves

Different n-steps Effect of SIL in the delay-reward setting

Here I plotted 3 seeds.

About

Synchronous n-step advantage actor-critic (A2C) with self-imitation learning in PyTorch ๐Ÿš€

Topics

Resources

License

Stars

Watchers

Forks

Languages