Skip to content

ppo_continuous_action doesn't work on Pendulum-v1? #504

@RishiMalhotra920

Description

@RishiMalhotra920

I've tried running CleanRL/PPO on Pendulum-v1 and I find it impossible to get an optimal solution. However, similar experiments on stable-baselines3/ppo worked.

I've used the CleanRL/ppo command below and several hyperparameter variants. I have tried toggling the annealing, clip_vloss but I just don't see where I am going wrong.

python -m cleanrl.ppo_continuous_action --env-id Pendulum-v1 --exp-name ppo --seed 1 --track --wandb-project-name cleanRL --gamma 0.99 --vf-coef 0.5 --ent-coef 0.01 --norm-adv --num-envs
4  --clip-coef 0.2 --num-steps 2048 --clip-vloss --gae-lambda 0.95 --learning-rate 3e-4 --anneal-lr --max-grad-norm 0.5 --update-epochs 20 --num-minibatches 4 --total-timesteps 5000000 --torch-deterministic --capture-video --exp-name testing_video --target-kl 0.01

The script below with stable-baselines3 works achieves the optimal solution. However, sb3 is too complex to understand which is why I opted in for cleanRL.

import os

import gymnasium as gym
from stable_baselines3 import PPO


def train():
    # Create environment (no render during training)
    env = gym.make("Pendulum-v1")

    # Create the PPO model with tuned hyperparameters
    model = PPO(
        "MlpPolicy",
        env,
        # Larger network
        policy_kwargs=dict(net_arch=dict(pi=[256, 256], vf=[256, 256])),
        # Learning parameters
        learning_rate=1e-3,
        n_steps=2048,
        batch_size=64,
        n_epochs=20,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        # Increase entropy coefficient to encourage exploration
        ent_coef=0.01,
        verbose=1,
        device="auto",
    )

    # Train for longer (200k steps)
    model.learn(total_timesteps=200000)

    # Save the model
    os.makedirs("models", exist_ok=True)
    model.save("models/ppo_pendulum")

    env.close()

In addition, cleanRL/rpo_continuous works too.

python -m cleanrl.rpo_continuous_action --no-cuda --track --wandb-project-name cleanRL --env-id Pendulum-v1 --seed 5

So what's the issue with cleanRL/ppo_continuous_action? Am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions