Skip to content

A Sample-Efficient Variance Reduction based Experience Replay Method for Policy Optimization Algorithms

License

Notifications You must be signed in to change notification settings

zhenghuazx/vrer_policy_gradient

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VRER: A Sample-Efficient Variance Reduction based Experience Replay Method for Policy Optimization Algorithms in tf2

zheng GitHub issues GitHub license GitHub stars

vrer-pg, short for variance reduction based policy gradient, is a library of variance reduction based experience replay method proposed in the paper "Zheng, H., et al., 2022. Variance Reduction based Experience Replay for Reinforcement Learning Policy Optimization."

The research was conducted by Hua Zheng and supervised by Professor Wei Xie, Professor Ben Feng. We would appreciate a citation if you use the code or results!

Acknowledge: This library is built on top of rlalgorithms-tf2 which no longer exists on Github. But I would like to acknowledge its author unsignedrant.

1. Introduction


Experience replay allows agents to remember and reuse historical transitions. However, the uniform reuse strategy regardless of their significance is implicitly biased toward out-of-date observations. To overcome this limitation, we propose a general variance reduction based experience reply (VRER) approach, which allows policy optimization algorithms to selectively reuse the most relevant samples and improve policy gradient estimation. It tends to put more weight on historical observations that are more likely sampled from the target distribution. Different from other ER methods VRER is a theoretically justified and simple-to-use approach. Our theoretical and empirical studies demonstrate that the proposed VRER can accelerate the learning of optimal policy and enhance the performance of state-of-the-art policy optimization approaches.

1.1. Description


vrer-pg is a tensorflow based AI library which facilitates experimentation with existing reinforcement learning algorithms with variance reduction based policy optimization. It provides well tested components that can be easily modified or extended. The available selection of algorithms can be used directly or through command line.

1.2. Installation


pip install git+https://github.com/zhenghuazx/vrer_policy_gradient.git

Verify installation*

vrer-pg

OUT:

vrer-pg 1.0.1

Usage:
	vrer-pg <command> <agent> [options] [args]

Available commands:
	train      Train given an agent and environment
	play       Play a game given a trained agent and environment
	tune       Tune hyperparameters given an agent, hyperparameter specs, and environment

Use vrer-pg <command> to see more info about a command
Use vrer-pg <command> <agent> to see more info about command + agent

1.3. Results


The performance improvement of state-of-the-art PO algorithms after using VRER. Results are described by the mean performance curves and 95% confidence intervals of PPO(-VRER), TRPO(-VRER) and VPG(-VRER).

convergence-VPG

1.4. Reproduction

All results of PPO(-VRER) and TRPO(-VRER) can be reproduced by scripts in reproduction and results of VPG(-VRER) can be reproduced by scripts in vrer_policy_gradient/vpgvrer-script and vrer_policy_gradient/vpg-script.

2. Features


2.1. Command line options

All features are available through the command line. For more command line info, check command line options

installation

2.2. Intuitive hyperparameter tuning from cli

Command line tuning interface based on optuna, which provides many hyperparameter features and types. 3 types are currently used by vrer-pg:

  • Categorical:

    vrer-pg tune <agent> --env <env> --interesting-param <val1> <val2> <val3> # ...
    
  • Int / log uniform:

    vrer-pg tune <agent> --env <env> --interesting-param <min-val> <max-val>
    

And in both examples if --interesting-param is not specified, it will have the default value, or a fixed value, if only 1 value is specified.

2.3. Early stopping / reduce on plateau.

Early train stopping usually when plateau is reached for a pre-specified n number of times without any improvement. Learning rate is reduced by some pre-determined factor. To activate these features:

--divergence-monitoring-steps <train-steps-at-which-should-monitor>

2.4. Models are loaded from .cfg files

To facilitate experimentation, and eliminate redundancy, all agents support loading models by passing either --model <model.cfg> or --actor-model <actor.cfg> and --critic-model <critic.cfg>. If no models were passed, the default ones will be loaded. A typical model.cfg file would look like:

[convolutional-0]
filters=32
size=8
stride=4
activation=relu
initializer=orthogonal
gain=1.4142135

[convolutional-1]
filters=64
size=4
stride=2
activation=relu
initializer=orthogonal
gain=1.4142135

[convolutional-2]
filters=64
size=3
stride=1
activation=relu
initializer=orthogonal
gain=1.4142135

[flatten-0]

[dense-0]
units=512
activation=relu
initializer=orthogonal
gain=1.4142135
common=1

[dense-1]
initializer=orthogonal
gain=0.01
output=1

[dense-2]
initializer=orthogonal
gain=1.0
output=1

Which should generate a keras model similar to this one with output units 6, and 1 respectively:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 84, 84, 1)]  0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 20, 20, 32)   2080        input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 9, 9, 64)     32832       conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 7, 7, 64)     36928       conv2d_1[0][0]                   
__________________________________________________________________________________________________
flatten (Flatten)               (None, 3136)         0           conv2d_2[0][0]                   
__________________________________________________________________________________________________
dense (Dense)                   (None, 512)          1606144     flatten[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 6)            3078        dense[0][0]                      
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense[0][0]                      
==================================================================================================
Total params: 1,681,575
Trainable params: 1,681,575
Non-trainable params: 0
__________________________________________________________________________________________________

Notes

  • You don't have to worry about this if you're going to use the default models, which are loaded automatically.
  • common=1 marks a layer to be reused by the following layers, which means dense-1 and dense-2 are called on the output of dense-0.

2.5. Training history checkpoints

Saving training history is available for further benchmarking / visualizing results. This is achieved by specifying --history-checkpoint <history.parquet> which will result in a .parquet that will be updated at each episode end. A sample data point will have these columns:

  • mean_reward most recent mean of agent episode rewards.
  • best_reward most recent best of agent episode rewards.
  • episode_reward most recent episode reward.
  • step most recent agent step.
  • time training elapsed time.

2.6. Reproducible results

All operation results are reproducible by passing --seed <some-seed> or seed=some_seed to agent constructor.

2.7. Gameplay output to .jpg frames

Gameplay visual output can be saved to .jpg frames by passing --frame-dir <some-dir> to play command.

2.8. Resume training / history

Weights are saved to .tf by specifying --checkpoints <ckpt1.tf> <ckpt2.tf>. To resume training, --weights <ckpt1.tf> <ckpt2.tf> should load the weights saved earlier. If --history-checkpoint <ckpt.parquet> is specified, the file is looked for and if found, further training history will be saved to the same history ckpt.parquet and the agent metrics will be updated with the most recent ones contained in the history file.

3. Usage


All agents / commands are available through the command line.

vrer-pg <command> <agent> [options] [args]

Note: Unless called from command line with --weights passed, all models passed to agents in code, should be loaded with weights beforehand, if called for resuming training or playing.

Command line

vrer-pg train trpovrer

Out

vrer-pg 1.0.1

Usage:
	vrer-pg <command> <agent> [options] [args]

Available commands:
	train      Train given an agent and environment
	play       Play a game given a trained agent and environment
	tune       Tune hyperparameters given an agent, hyperparameter specs, and environment

Use vrer-pg <command> to see more info about a command
Use vrer-pg <command> <agent> to see more info about command + agent

4. Algorithms


General notes

  • All the default hyperparameters don't work for all environments. Which means you either need to tune them according to the given environment, or pass previously tuned ones, in order to get good results.
  • --model <model.cfg> or --actor-model <actor.cfg> and --critic-model <critic.cfg> are optional which means, if not specified, the default model(s) will be loaded, so you don't have to worry about it.
  • You can also use external models by passing them to agent constructor. If you do, you will have to ensure your models outputs match what the implementation expects, or modify it accordingly.
  • For atari environments / the ones that return an image by default, use the --preprocess flag for image preprocessing.
  • For checkpoints to be saved, --checkpoints <checkpoint1.tf> <checkpoint2.tf> should be specified for the model(s) to be saved. The number of passed checkpoints should match the number of models the agent accepts.
  • For loading weights either for resuming training or for playing a game --weights <weights1.tf> <weights2.tf> and same goes for the weights, they should match the number of agent models.
  • For using a random seed, a seed=some_seed should be passed to agent constructor and ModelReader constructor if specified from code. If from the command line, all you need is to pass --seed <some-seed>
  • To save training history, history_checkpoint=some_history.parquet should be specified to agent constructor or alternatively using --history-checkpoint <some-history.parquet>. If the history checkpoint exists, training metrics will automatically start from where it left.

4.1. TRPO-VRER

  • Number of models: 1
  • Action spaces: discrete, continuous
flags help default hp_type
--actor-iterations Actor optimization iterations per train step 10 int
--actor-model Path to actor model .cfg file - -
--advantage-epsilon Value added to estimated advantage 1e-08 log_uniform
--beta1 Beta1 passed to a tensorflow.keras.optimizers.Optimizer 0.9 log_uniform
--beta2 Beta2 passed to a tensorflow.keras.optimizers.Optimizer 0.999 log_uniform
--buffer_size Maximum capacity of replay buffer 100 categorical
--cg-damping Gradient conjugation damping parameter 0.001 log_uniform
--cg-iterations Gradient conjugation iterations per train step 10 -
--cg-residual-tolerance Gradient conjugation residual tolerance parameter 1e-10 log_uniform
--checkpoints Path(s) to new model(s) to which checkpoint(s) will be saved during training - -
--clip-norm Clipping value passed to tf.clip_by_value() 0.1 log_uniform
--critic-iterations Critic optimization iterations per train step 3 int
--critic-model Path to critic model .cfg file - -
--display-precision Number of decimals to be displayed 2 -
--divergence-monitoring-steps Steps after which, plateau and early stopping are active - -
--early-stop-patience Minimum plateau reduces to stop training 3 -
--entropy-coef Entropy coefficient for loss calculation 0 log_uniform
--env gym environment id - -
--fvp-n-steps Value used to skip every n-frames used to calculate FVP 5 int
--gamma Discount factor 0.99 log_uniform
--grad-norm Gradient clipping value passed to tf.clip_by_value() 0.5 log_uniform
--history-checkpoint Path to .parquet file to save training history - -
--lam GAE-Lambda for advantage estimation 1.0 log_uniform
--log-frequency Log progress every n games - -
--lr Learning rate passed to a tensorflow.keras.optimizers.Optimizer 0.0007 log_uniform
--max-frame If specified, max & skip will be applied during preprocessing - categorical
--max-kl Maximum KL divergence used for calculating Lagrange multiplier 0.001 log_uniform
--max-steps Maximum number of environment steps, when reached, training is stopped - -
--mini-batches Number of mini-batches to use per update 4 categorical
--monitor-session Wandb session name - -
--n-envs Number of environments to create 1 categorical
--n-steps Transition steps 512 categorical
--num_reuse_each_iter Number of randomly sampled transition from each behavioral policy in the reuse set 3 categorical
--opt-epsilon Epsilon passed to a tensorflow.keras.optimizers.Optimizer 1e-07 log_uniform
--plateau-reduce-factor Factor multiplied by current learning rate when there is a plateau 0.9 -
--plateau-reduce-patience Minimum non-improvements to reduce lr 10 -
--ppo-epochs Gradient updates per training step 4 categorical
--preprocess If specified, states will be treated as atari frames - -
and preprocessed accordingly
--quiet If specified, no messages by the agent will be displayed - -
to the console
--reward-buffer-size Size of the total reward buffer, used for calculating 100 -
mean reward value to be displayed.
--seed Random seed - -
--target-reward Target reward when reached, training is stopped - -
--value-loss-coef Value loss coefficient for value loss calculation 0.5 log_uniform
--weights Path(s) to model(s) weight(s) to be loaded by agent output_models - -

Run TRPO with VRER with command line

vrer-pg train trpovrer --n-env 4 --target-reward 195 --env CartPole-v0
number of reuse:  0
time: 0:00:15.744019, steps: 2048, games: 101, speed: 130 steps/s, mean reward: 20.1, best reward: -inf
number of reuse:  1
Best reward updated: -inf -> 20.1
time: 0:00:17.821029, steps: 4096, games: 183, speed: 986 steps/s, mean reward: 24.3, best reward: 20.1
number of reuse:  1
Best reward updated: 20.1 -> 24.3
time: 0:00:20.266888, steps: 6144, games: 266, speed: 837 steps/s, mean reward: 24.25, best reward: 24.3
number of reuse:  3
time: 0:00:22.747528, steps: 8192, games: 345, speed: 826 steps/s, mean reward: 25.9, best reward: 24.3
number of reuse:  0
Best reward updated: 24.3 -> 25.9
time: 0:00:24.894729, steps: 10240, games: 410, speed: 954 steps/s, mean reward: 29.28, best reward: 25.9
number of reuse:  5
Best reward updated: 25.9 -> 29.28
time: 0:00:26.956820, steps: 12288, games: 481, speed: 993 steps/s, mean reward: 30.03, best reward: 29.28
number of reuse:  4
Best reward updated: 29.28 -> 30.03
time: 0:00:29.242799, steps: 14336, games: 548, speed: 896 steps/s, mean reward: 31.22, best reward: 30.03
number of reuse:  0
Best reward updated: 30.03 -> 31.22
time: 0:00:31.301686, steps: 16384, games: 600, speed: 995 steps/s, mean reward: 34.61, best reward: 31.22
number of reuse:  5
Best reward updated: 31.22 -> 34.61
time: 0:00:33.378286, steps: 18432, games: 648, speed: 986 steps/s, mean reward: 41.5, best reward: 34.61
number of reuse:  0
Best reward updated: 34.61 -> 41.5
time: 0:00:35.461156, steps: 20480, games: 691, speed: 983 steps/s, mean reward: 45.26, best reward: 41.5
number of reuse:  0
Best reward updated: 41.5 -> 45.26
time: 0:00:37.535053, steps: 22528, games: 729, speed: 988 steps/s, mean reward: 50.42, best reward: 45.26
number of reuse:  11
Best reward updated: 45.26 -> 50.42
time: 0:00:39.798340, steps: 24576, games: 768, speed: 905 steps/s, mean reward: 50.69, best reward: 50.42
number of reuse:  12
Best reward updated: 50.42 -> 50.69
time: 0:00:42.106897, steps: 26624, games: 801, speed: 887 steps/s, mean reward: 56.27, best reward: 50.69
number of reuse:  13
Best reward updated: 50.69 -> 56.27
time: 0:00:44.179972, steps: 28672, games: 833, speed: 988 steps/s, mean reward: 59.49, best reward: 56.27

4.1.1 TRPO Non-command line

''' TRPO '''
from tensorflow.keras.optimizers import Adam
import os
import vrer_policy_gradient
from vrer_policy_gradient import TRPO
from vrer_policy_gradient.utils.common import ModelReader, create_envs
    
for i in range(5):
    seed = i + 2021
    n_envs = 4
    n_steps = 128
    clip_norm = 0.2
    entropy_coef = 0.0
    mini_batches = 32
    lr = 0.0003
    max_steps = 400000
    target_reward = -70
    problem = 'Acrobot-v1'
    checkpoints = ['trpo-actor-acrobot-v1-seed-{}.tf'.format(i), 'trpo-critic-acrobot-v1-seed-{}.tf'.format(i)]
    history_checkpoint = 'trpo-acrobot-v1-seed-{}.parquet'.format(i)
    path = "TRPO/{}/approx-2nd-seed-{}-id-{}/".format(problem, seed, i)
    isExist = os.path.exists(path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
    checkpoints = [path + i for i in checkpoints]
    history_checkpoint = path + history_checkpoint
    
    envs = create_envs(problem, n_envs, False)
    actor_model_cfg = 'TRPO/models/ann-actor.cfg'  # rlalgorithms_tf2.agents['trpo']['actor_model']['ann'][0]
    critic_model_cfg = 'TRPO/models/ann-critic.cfg'  # rlalgorithms_tf2.agents['trpo']['critic_model']['ann'][0]
    optimizer = Adam(learning_rate=lr)
    actor_model = ModelReader(
        actor_model_cfg,
        output_units=[envs[0].action_space.n],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    critic_model = ModelReader(
        actor_model_cfg,
        output_units=[1],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    agent = TRPO(envs, actor_model, critic_model, seed=seed, n_steps=n_steps, entropy_coef=entropy_coef, mini_batches=mini_batches,
                 clip_norm=clip_norm, checkpoints=checkpoints, history_checkpoint=history_checkpoint, log_frequency=4)
    agent.fit(target_reward=target_reward, max_steps=max_steps)

4.1.2 TRPO-VRER Non-command line

''' TRPO-VRER '''
from tensorflow.keras.optimizers import Adam
import os
from vrer_policy_gradient import PPO, TRPOVRER
from vrer_policy_gradient.utils.common import ModelReader, create_envs
from vrer_policy_gradient.utils.buffers import ReplayBuffer1
import vrer_policy_gradient

for i in range(5):
    seed = i + 2021
    n_envs = 4
    n_steps = 128
    clip_norm = 0.2
    entropy_coef = 0.0
    mini_batches = 32
    lr = 0.0003
    max_steps = 400000
    target_reward = -70
    buffer_size = 100
    problem = 'Acrobot-v1'
    checkpoints = ['trpovrer-actor-problem-{}-buffer_size-{}-seed-{}.tf'.format(problem, buffer_size, i),
                   'trpovrer-critic-problem-{}-buffer_size-{}-seed-{}.tf'.format(problem, buffer_size, i)]
    history_checkpoint = 'trpovrer-problem-{}-buffer_size-{}-seed-{}.parquet'.format(problem, buffer_size, i)
    path = "trpovrer/{}/approx-2nd-buffer_size-{}-seed-{}-id-{}/".format(problem, buffer_size, seed, i)
    isExist = os.path.exists(path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
    checkpoints = [path + i for i in checkpoints]
    history_checkpoint = path + history_checkpoint

    envs = create_envs(problem, n_envs, False)

    actor_model_cfg = vrer_policy_gradient.agents['trpovrer']['actor_model']['ann'][0]
    critic_model_cfg = vrer_policy_gradient.agents['trpovrer']['critic_model']['ann'][0]
    optimizer = Adam(learning_rate=lr)
    actor_model = ModelReader(
        actor_model_cfg,
        output_units=[envs[0].action_space.n],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    critic_model = ModelReader(
        actor_model_cfg,
        output_units=[1],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()

    agent = TRPOVRER(envs, actor_model, critic_model, seed=seed, n_steps=n_steps, entropy_coef=entropy_coef, mini_batches=mini_batches,
                 clip_norm=clip_norm, checkpoints=checkpoints, history_checkpoint=history_checkpoint, log_frequency=4, buffer_size=buffer_size)
    agent.fit(target_reward=target_reward, max_steps=max_steps)

4.2. PPOVRER and PPO

  • Number of models: 1
  • Action spaces: discrete, continuous
flags help default hp_type
--advantage-epsilon Value added to estimated advantage 1e-08 log_uniform
--beta1 Beta1 passed to a tensorflow.keras.optimizers.Optimizer 0.9 log_uniform
--beta2 Beta2 passed to a tensorflow.keras.optimizers.Optimizer 0.999 log_uniform
--buffer_size Maximum capacity of replay buffer 100 categorical
--checkpoints Path(s) to new model(s) to which checkpoint(s) will be saved during training - -
--clip-norm Clipping value passed to tf.clip_by_value() 0.1 log_uniform
--display-precision Number of decimals to be displayed 2 -
--divergence-monitoring-steps Steps after which, plateau and early stopping are active - -
--early-stop-patience Minimum plateau reduces to stop training 3 -
--entropy-coef Entropy coefficient for loss calculation 0.01 log_uniform
--env gym environment id - -
--gamma Discount factor 0.99 log_uniform
--grad-norm Gradient clipping value passed to tf.clip_by_value() 0.5 log_uniform
--history-checkpoint Path to .parquet file to save training history - -
--lam GAE-Lambda for advantage estimation 0.95 log_uniform
--log-frequency Log progress every n games - -
--lr Learning rate passed to a tensorflow.keras.optimizers.Optimizer 0.0007 log_uniform
--max-frame If specified, max & skip will be applied during preprocessing - categorical
--max-steps Maximum number of environment steps, when reached, training is stopped - -
--mini-batches Number of mini-batches to use per update 4 categorical
--model Path to model .cfg file - -
--monitor-session Wandb session name - -
--n-envs Number of environments to create 1 categorical
--n-steps Transition steps 128 categorical
--num_reuse_each_iter Number of randomly sampled transition from each behavioral policy in the reuse set 3 categorical
--opt-epsilon Epsilon passed to a tensorflow.keras.optimizers.Optimizer 1e-07 log_uniform
--plateau-reduce-factor Factor multiplied by current learning rate when there is a plateau 0.9 -
--plateau-reduce-patience Minimum non-improvements to reduce lr 10 -
--ppo-epochs Gradient updates per training step 4 categorical
--preprocess If specified, states will be treated as atari frames - -
and preprocessed accordingly
--quiet If specified, no messages by the agent will be displayed - -
to the console
--reward-buffer-size Size of the total reward buffer, used for calculating 100 -
mean reward value to be displayed.
--seed Random seed - -
--target-reward Target reward when reached, training is stopped - -
--value-loss-coef Value loss coefficient for value loss calculation 0.5 log_uniform
--weights Path(s) to model(s) weight(s) to be loaded by agent output_models - -

Command line

vrer-pg train ppo --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints ppo-pong.tf

or

vrer-pg train ppovrer --env BipedalWalker-v3 --target-reward 200 --n-envs 16 --checkpoints ppo-bipedal-walker.tf

4.2.1 PPO Non-command line

''' PPO '''
from tensorflow.keras.optimizers import Adam
import os
from vrer_policy_gradient import PPO
from vrer_policy_gradient.utils.common import ModelReader, create_envs

for i in range(5):
    seed = i + 2021
    n_envs = 4
    n_steps = 128
    clip_norm = 0.2
    entropy_coef = 0.01
    mini_batches = 128
    lr = 0.0003
    max_steps = 1000000
    target_reward = -70
    problem = 'Acrobot-v1'
    checkpoints = 'ppo-acrobot-v1-seed-{}.tf'.format(i)
    history_checkpoint = 'ppo-acrobot-v1-seed-{}.parquet'.format(i)
    path = "PPO/{}/approx-2nd-seed-{}-id-{}/".format(problem, seed, i)
    isExist = os.path.exists(path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
    checkpoints = path + checkpoints
    history_checkpoint = path + history_checkpoint

    envs = create_envs(problem, n_envs, False)
    model_cfg = 'model/ann-actor-critic.cfg' # rlalgorithms_tf2.agents['ppo']['model']['ann'][0]
    optimizer = Adam(learning_rate=lr)
    model = ModelReader(
        model_cfg,
        seed=seed,
        output_units=[envs[0].action_space.n, 1],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    agent = PPO(envs, model, seed=seed, n_steps=n_steps, entropy_coef=entropy_coef, mini_batches=mini_batches, clip_norm=clip_norm, checkpoints=[checkpoints], history_checkpoint=history_checkpoint, log_frequency=4)
    agent.fit(target_reward=target_reward, max_steps=max_steps)

4.2.2 PPO-VRER Non-command line

''' PPO-VRER '''
from tensorflow.keras.optimizers import Adam
import os
from vrer_policy_gradient import PPO, PPOVRER
from vrer_policy_gradient.utils.common import ModelReader, create_envs
from vrer_policy_gradient.utils.buffers import ReplayBuffer1
import vrer_policy_gradient

for i in range(5):
    seed = i + 2021
    n_envs = 4
    n_steps = 128
    clip_norm = 0.2
    entropy_coef = 0.01
    mini_batches = 128
    lr = 0.0003
    max_steps = 400000
    buffer_size = 400
    target_reward = 201
    problem = 'CartPole-v0'
    checkpoints = 'ppo-problem-{}-buffer_size-{}-seed-{}.tf'.format(problem, buffer_size, i)
    history_checkpoint = 'ppo-problem-{}-buffer_size-{}-seed-{}.parquet'.format(problem, buffer_size, i)
    path = "PPO/{}/approx-2nd-buffer_size-{}-seed-{}-id-{}/".format(problem, buffer_size, seed, i)
    isExist = os.path.exists(path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
    checkpoints = path + checkpoints
    history_checkpoint = path + history_checkpoint

    envs = create_envs(problem, n_envs, False)

    model_cfg = vrer_policy_gradient.agents['ppovrer']['model']['ann'][0]
    optimizer = Adam(learning_rate=lr)
    optimizer = SGD(learning_rate=lr)
    model = ModelReader(
        model_cfg,
        seed=seed,
        output_units=[envs[0].action_space.n, 1],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    agent = PPOVRER(envs, model, seed=seed, n_steps=n_steps, entropy_coef=entropy_coef, mini_batches=mini_batches, clip_norm=clip_norm, checkpoints=[checkpoints], history_checkpoint=history_checkpoint, log_frequency=4, buffer_size=buffer_size)
    agent.fit(target_reward=target_reward, max_steps=max_steps)

5. Contact


Website: https://zhenghuazx.github.io/hua.zheng/

Email: hua.zheng0908@gmail.com

Project link: https://github.com/zhenghuazx/vrer_policy_gradient

6. Cite Us


The paper is under review. Please check the arXiv page: https://arxiv.org/abs/2110.08902.

About

A Sample-Efficient Variance Reduction based Experience Replay Method for Policy Optimization Algorithms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published