Advantage actor-critic

Please free feel to use Issues to ask questions.

What is A2C?

A policy gradient algorithm where, instead of using Monte-Carlo returns, we use a state-conditioned value function for computing advantage of actions and bootstrapping n-step returns.

Features of this implementation

Features:

Written in PyTorch, uses weights and biases to track learning
Minimal requirements: numpy, pytorch, gym, wandb (for logging)
Synchronous (use several episodes for each update)
n-step can be changed (e.g., 1, 5 or even 200); set longer than timeout to get Monte-Carlo returns
Support discrete actions only
Tested on CartPole only
Optional: delay reward (a cumulative reward is given every 40 steps)
Optional: self-imitation learning for sparse rewards

Learning curves

Different n-steps	Effect of SIL in the delay-reward setting

Here I plotted 3 seeds.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
graph_n_step.png		graph_n_step.png
graph_sil.png		graph_sil.png
params_pool.py		params_pool.py
replay_buffer.py		replay_buffer.py
test_replay_buffer.py		test_replay_buffer.py
train_cartpole.py		train_cartpole.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advantage actor-critic

What is A2C?

Features of this implementation

Learning curves

About

Languages

License

zhihanyang2022/pytorch-a2c

Folders and files

Latest commit

History

Repository files navigation

Advantage actor-critic

What is A2C?

Features of this implementation

Learning curves

About

Topics

Resources

License

Stars

Watchers

Forks

Languages