-
Notifications
You must be signed in to change notification settings - Fork 991
Closed
Description
I've tried running CleanRL/PPO on Pendulum-v1 and I find it impossible to get an optimal solution. However, similar experiments on stable-baselines3/ppo worked.
I've used the CleanRL/ppo command below and several hyperparameter variants. I have tried toggling the annealing, clip_vloss but I just don't see where I am going wrong.
python -m cleanrl.ppo_continuous_action --env-id Pendulum-v1 --exp-name ppo --seed 1 --track --wandb-project-name cleanRL --gamma 0.99 --vf-coef 0.5 --ent-coef 0.01 --norm-adv --num-envs
4 --clip-coef 0.2 --num-steps 2048 --clip-vloss --gae-lambda 0.95 --learning-rate 3e-4 --anneal-lr --max-grad-norm 0.5 --update-epochs 20 --num-minibatches 4 --total-timesteps 5000000 --torch-deterministic --capture-video --exp-name testing_video --target-kl 0.01The script below with stable-baselines3 works achieves the optimal solution. However, sb3 is too complex to understand which is why I opted in for cleanRL.
import os
import gymnasium as gym
from stable_baselines3 import PPO
def train():
# Create environment (no render during training)
env = gym.make("Pendulum-v1")
# Create the PPO model with tuned hyperparameters
model = PPO(
"MlpPolicy",
env,
# Larger network
policy_kwargs=dict(net_arch=dict(pi=[256, 256], vf=[256, 256])),
# Learning parameters
learning_rate=1e-3,
n_steps=2048,
batch_size=64,
n_epochs=20,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
# Increase entropy coefficient to encourage exploration
ent_coef=0.01,
verbose=1,
device="auto",
)
# Train for longer (200k steps)
model.learn(total_timesteps=200000)
# Save the model
os.makedirs("models", exist_ok=True)
model.save("models/ppo_pendulum")
env.close()In addition, cleanRL/rpo_continuous works too.
python -m cleanrl.rpo_continuous_action --no-cuda --track --wandb-project-name cleanRL --env-id Pendulum-v1 --seed 5
So what's the issue with cleanRL/ppo_continuous_action? Am I missing something?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels