pytorch-DDPG

pytorch implementation of DDPG, a continous space algorithm

DDPG

DDPG takes the struture similar to actor-critic, where a policy network is optimized to maximize the output of a value network and value network interacting with the environment tries to find the true expected value of any given state. There is four networks in DDPG, two updating networks, target and value network and two target networks which are copies of the policy and value network. In optimization, value network use an approach similar to DQN(see https://github.com/williamium3000/pytorch-DQN for more details and implementation) while policy network use an approach similar to PG, where the network updates torwards the direction where value function geive highest value. In other words, policy network updates under the supervision of value network.

Implementation

agent.py: agent interacting with environments and performing DDPG
env.py: code from paddle, modify CartPole into a continuous space problem
experience_replay.py: experience replay in DQN
train.py: training entrance
network.py: value and policy network in pytorch To start training, modify the environment and hyperparameters in the train.py as you want and starts training

python train.py

We records the reward each episode in tensorboard, feel free to check

tensorboard --logdir=runs

Experiments and results

Environment

CartPole-v0

Traditional CartPole-v0 involves a pole on a car and an agent is required to hold control of the car(left/right) in order to prevent the pole from falling. Each step, the agent is rewarded 1 if the pole fails to fall, with a maximum steps of 200. In our experiment, since DDPG works on continuous space, we turn the traditional one into continuous one via paddle open source code.

Pendulum-v0

Pendulum-v0 involves a pendulum starting in a random position, and the goal is to swing it up so it stays upright.

Results

Both agents are traind for 2000 episodes using a learning rate of 0.001, BATCH_SIZE of 32, learning frequency of 5 steps and gamma of 0.95. It is worth mentioning that since policy network gives out a direct continuous value(an action given states), we need to add some noise to push the agents into exploring. We use a Gaussian noise with mean of 0.01 and std of 1.

CartPole-v0

We plot the reward of each episode A demo of trained agent is followed

Pendulum-v0

We plot the reward of each episode A demo of trained agent is followed

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
images		images
runs		runs
.gitignore		.gitignore
CartPole-v0.pth		CartPole-v0.pth
LICENSE		LICENSE
Pendulum-v0.pth		Pendulum-v0.pth
README.md		README.md
agent.py		agent.py
env.py		env.py
experience_replay.py		experience_replay.py
network.py		network.py
nohup.out		nohup.out
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytorch-DDPG

DDPG

Implementation

Experiments and results

Environment

CartPole-v0

Pendulum-v0

Results

CartPole-v0

Pendulum-v0

About

Releases

Packages

Languages

License

williamium3000/pytorch-DDPG

Folders and files

Latest commit

History

Repository files navigation

pytorch-DDPG

DDPG

Implementation

Experiments and results

Environment

CartPole-v0

Pendulum-v0

Results

CartPole-v0

Pendulum-v0

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages