Why it does not success even when the score is high? #378

alexxchen · 2022-08-14T06:13:08Z

I tested some environment with different task settings. For example on reach-v2, I think it is strange that it does not success even when the score is 4700+. I tried to add an extra score when it success, but it didn't work.
Is there any tricks to slove this problem?

avnishn · 2022-08-24T18:51:16Z

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

alexxchen · 2022-08-30T06:40:08Z

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

Yes, reach-v2 is very easy to optimize. The score rises up fast under evolutionary algorithm. But it seems that the high score does not necessarily lead to final success.

gunnxx · 2022-11-09T11:06:24Z

I have similar issue. I only tested on the MT10 tasks and train a policy on each of the tasks. I run evaluation every 160k steps in the environment where each evaluation I run 50 episodes and compute the success percentage from those 50 episodes. Higher reward does not always positively correlate with higher success rate.

reach-v2

window-open-v2 (red) and window-close-v2 (green)

drawer-open-v2 (orange) and drawer-close-v2 (blue)

I trained using stable_baselines3.SAC.

gunnxx · 2022-11-10T05:59:39Z

To reproduce, here is the code

from typing import Any, Dict, List, Tuple

import gym
import metaworld
import numpy as np
import os.path
import random

from gym.wrappers import TimeLimit
from metaworld.envs.mujoco.sawyer_xyz.sawyer_xyz_env import SawyerXYZEnv
from seals.util import AbsorbAfterDoneWrapper
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import get_latest_run_id


class ChangingTaskAndDoneTerminalWrapper(gym.Wrapper):
    """
    Option to change task every time the environment is reset
    and `done` returns `True` when `info["success"]=True`.

    `done` signal is needed to reset the environment in the sb3.
    Sb3 can handle `done` signal due to terminal states or time limit.
    """

    def __init__(self,
        env: SawyerXYZEnv,
        tasks: List[metaworld.Task],
        change_task: bool = False) -> None:
        """
        Constructor.

        :param env: Environment.
        :param tasks: List of metaworld task objects.
        :param change_task: Whether to change task every reset.
        """
        super().__init__(env)
        self.tasks = tasks
        self.change_task = change_task
        self.env.set_task(self.tasks[0])
    

    def reset(self, **kwargs) -> Any:
        """
        Reset the environment.
        """
        if self.change_task:
            task = random.choice(self.tasks)
            self.env.set_task(task)

        return super().reset(**kwargs)
    

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict[str, Any]]:
        """
        Step in the environment.
        """
        obs, rew, _, info = super().step(action)
        ## `sb3.EvalCallback` look at "is_success" to log success rate
        info["is_success"] = info["success"]
        return obs, rew, info["success"], info


def create_single_env(
    env_name: str,
    changing_inner_task: bool,
    handle_absorb_state: bool = False) -> gym.Env:
    """
    Create single gym environment.
    Note that we use `seals.util.AbsorbAfterDoneWrapper` and it should be before `gym.wrappers.TimeLimit`

    :param env_name: Name of the environment. Only `metaworld.MT10` tasks!
    :param changing_inner_task: Whether to change inner task every reset.
    :param handle_absorb_state: Whether to use `seals.util.AbsorbAfterDoneWrapper` or not.

    Return:
        Single gym environment.
    """
    mt1 = metaworld.MT1(env_name)
    env = mt1.train_classes[env_name]()
    env = ChangingTaskAndDoneTerminalWrapper(env, mt1.train_tasks, changing_inner_task)
    if handle_absorb_state: env = AbsorbAfterDoneWrapper(env)
    env = TimeLimit(env, env.max_path_length)
    return env


"""
Main function.
"""

env_name = "reach-v2"
changing_inner_task = True
seed = 0
device = "cuda:0"

## set log directory
tb_logdir = "logs/%s" % env_name
tb_logname = "sac"

if changing_inner_task: tb_logdir = tb_logdir + "/changing_goal"
else: tb_logdir = tb_logdir + "/static_goal"

latest_run_id = get_latest_run_id(tb_logdir, tb_logname)

## create env
vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 8,
    seed    = seed
)

eval_vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 10,
    seed    = seed + 1
)

## create algo
algo = SAC(
    policy          = "MlpPolicy",
    env             = vecenv,
    learning_rate   = 0.0003,
    buffer_size     = int(1e6),
    learning_starts = int(1e3),
    batch_size      = 512,
    tau             = 0.005,
    gamma           = 0.99,
    train_freq      = 512,              ## train every `train_freq * vecenv.num_envs`
    gradient_steps  = 128,
    policy_kwargs   = {"net_arch": {"pi": [256, 256], "qf": [256, 256]}},
    tensorboard_log = tb_logdir,
    device          = device,
    seed            = seed,
    verbose         = 1
)

## create callback
eval_callback = EvalCallback(
    eval_env                = eval_vecenv,
    n_eval_episodes         = 50,
    eval_freq               = 20000,    ## eval every `eval_freq * vecenv.num_envs`
    best_model_save_path    = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
    log_path                = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
)

algo.learn(
    total_timesteps = int(1e8),
    callback        = eval_callback,
    tb_log_name     = tb_logname
)

seolhokim · 2022-11-11T07:33:46Z

Same issue. even in fixed goal environment.

gunnxx · 2022-11-28T13:27:17Z

I found the culprit.

The metaworld.Benchmark object should only be created once. Recreating the benchmark object will sample a new tasks (ie. start/object/goal positions). In my code, I call the create_single_env multiple times hence the metaworld.MT1 is created multiple times. It will mess up your evaluation.
Try to not stop the algorithm even though it successfully reach the goal ie. dont stop when info["success"] == True. Somehow it destabilizes training.

Orange: I let the environment runs until 500 timesteps even though it successfully reaches the goal in the middle of the episode.
Blue: I stop it when info["success"] == True.

krzentner · 2023-01-26T19:59:05Z

Yeah, every time you create a benchmark object it samples 50 new goal locations. That's intended behavior, so I'm going to close this. I have added another bug for the button-press environments, which do genuinely have this problem.

krzentner · 2023-01-26T20:03:58Z

Also worth mentioning is that the training instability when success is achieved is know, which is one reason why success states are not terminal states. Adding a wrapper to make them terminal states is not recommended.

goldbird5 · 2023-07-27T02:24:19Z

Maybe this could be an additional insight, I guess.

I just trained single task with termination when it successes(info["success"]==True) using SAC, I found the same situation.
By rendering, it shows that the agent stops right before goal and just remaining still.

This may be because it is obviously the best way to get maximum 'return', gaining more rewards(between 0 and 10) until max_path_length rather than just terminating with getting single reward 10.

pseudo-rnd-thoughts · 2023-07-29T20:51:19Z

@reginald-mclean is there any point to modifying the reward function such that this behaviour is not optimal

reginald-mclean · 2023-07-29T21:04:21Z

Yeah there could be some value there. I would just be concerned about what that reward is to avoid this behaviour, tuning that reward function would mean giving a very large value to the agent when success=True and that large value could destabilize value function training.

krzentner · 2023-07-31T23:41:59Z

One thing to note is that Meta-World intentionally does not terminate the episode on success, to avoid this problem. You should always use length 500 episodes, without any terminal states. Formally, every meta-world task is an infinite horizon MDP, from which a finite 500 state sequence is sampled for each episode.

To reliably avoid this problem, the reward on termination would need to be two orders of magnitude larger than at any other timestep (technically, 5000 vs 10), which would destabilize most value function training.

alexxchen changed the title ~~Why it does success even when the score is high?~~ Why it does not success even when the score is high? Aug 14, 2022

krzentner closed this as completed Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why it does not success even when the score is high? #378

Why it does not success even when the score is high? #378

alexxchen commented Aug 14, 2022

avnishn commented Aug 24, 2022

alexxchen commented Aug 30, 2022

gunnxx commented Nov 9, 2022

gunnxx commented Nov 10, 2022

seolhokim commented Nov 11, 2022 •

edited

gunnxx commented Nov 28, 2022

krzentner commented Jan 26, 2023 •

edited

krzentner commented Jan 26, 2023

goldbird5 commented Jul 27, 2023

pseudo-rnd-thoughts commented Jul 29, 2023

reginald-mclean commented Jul 29, 2023

krzentner commented Jul 31, 2023 •

edited

Why it does not success even when the score is high? #378

Why it does not success even when the score is high? #378

Comments

alexxchen commented Aug 14, 2022

avnishn commented Aug 24, 2022

alexxchen commented Aug 30, 2022

gunnxx commented Nov 9, 2022

gunnxx commented Nov 10, 2022

seolhokim commented Nov 11, 2022 • edited

gunnxx commented Nov 28, 2022

krzentner commented Jan 26, 2023 • edited

krzentner commented Jan 26, 2023

goldbird5 commented Jul 27, 2023

pseudo-rnd-thoughts commented Jul 29, 2023

reginald-mclean commented Jul 29, 2023

krzentner commented Jul 31, 2023 • edited

seolhokim commented Nov 11, 2022 •

edited

krzentner commented Jan 26, 2023 •

edited

krzentner commented Jul 31, 2023 •

edited