Skip to content
A collection of reinforcement learning algorithms to solve the cart-pole problem
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
images Add: images of algorithm in README Apr 24, 2019
.gitignore Initial commit Mar 18, 2019
LICENSE Initial commit Mar 18, 2019
README.md Add: images of algorithm in README Apr 24, 2019
SharedAdam.py Add: compare performance for all algorithms Apr 16, 2019
actor_critic.py Add: compare performance for all algorithms Apr 16, 2019
adqn.py Improve: epsilon Apr 16, 2019
compare.ipynb Add: compare performance for all algorithms Apr 16, 2019
ddqn.py Add: ddqn Apr 16, 2019
dqn.py Add: keras -> pytorch Apr 16, 2019
reinforce.py
run.sh Add: compare performance for all algorithms Apr 16, 2019

README.md

cartpole

A collection of deep reinforcement learning algorithms implemented in Pytorch and gym to solve the cart-pole problem which has continuous state space but discrete actions.

Problem definition

From gym.openai.com

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

A state or observation consists of 4 continuous variants: Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip. There are 2 discrete actions: 0: Push cart to the left, 1: Push cart to the right. The difference between v0 and v1 is that the max episode steps for v0 and v1 is 200 and 500 respectively.

Algorithms

DQN (Deep Q Networks)

ref: Mnih, Human-level control through deep reinforcement learning, Algorithm 1

Compared with Q learning, DQN represents the action value function Q by a network which is called Q-netowrk. The network is trained to minimize the temporal difference, i.e., the loss function is MSE. It is worth noting that the implementation details may vary a lot from the original paper. For example, current implementation uses two networks: one is called value network which computes action values for control and updates at each step and the other called target network which computes target and updates less frequently by copying parameters from the value network.

DQN

Double DQN

ref: Van Hasselt, Deep Reinforcement Learning with Double Q-learning

This algorithm improves the DQN by changing the target from

to

Asynchronous one step Q learning

ref: Mnih, Asynchronous Methods for Deep Reinforcement Learning, Algorithm 1

Instead of using experience replay as DQN does, this algorithm asynchronously executes multiple agents in parallel (multiprocess), on multiple instances of the environment.

REINFORCE or Monte-Carlo Policy Gradient

ref: Sutton, Reinforcement Learning: An Introduction (second edition), Chapter 13.3

The agent runs through an entire episode and then update the policy based on the rewards obtained.

REINFORCE

Actor-Critic

ref: Sutton, Reinforcement Learning: An Introduction (second edition), Chapter 13.5

This implementation uses the same network to compute both policy (actor) and state value (critic).

Actor-Critic

Todo

  • unify code style and parameters
  • tune performance
  • compare the performance of different algorithms in one notebook
  • upgrade from v0 to v1
You can’t perform that action at this time.