Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why it does not success even when the score is high? #378

Closed
alexxchen opened this issue Aug 14, 2022 · 12 comments
Closed

Why it does not success even when the score is high? #378

alexxchen opened this issue Aug 14, 2022 · 12 comments

Comments

@alexxchen
Copy link

I tested some environment with different task settings. For example on reach-v2, I think it is strange that it does not success even when the score is 4700+. I tried to add an extra score when it success, but it didn't work.
Is there any tricks to slove this problem?

@alexxchen alexxchen changed the title Why it does success even when the score is high? Why it does not success even when the score is high? Aug 14, 2022
@avnishn
Copy link
Contributor

avnishn commented Aug 24, 2022

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

@alexxchen
Copy link
Author

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

Yes, reach-v2 is very easy to optimize. The score rises up fast under evolutionary algorithm. But it seems that the high score does not necessarily lead to final success.

@gunnxx
Copy link

gunnxx commented Nov 9, 2022

I have similar issue. I only tested on the MT10 tasks and train a policy on each of the tasks. I run evaluation every 160k steps in the environment where each evaluation I run 50 episodes and compute the success percentage from those 50 episodes. Higher reward does not always positively correlate with higher success rate.

reach-v2
image

window-open-v2 (red) and window-close-v2 (green)
image

drawer-open-v2 (orange) and drawer-close-v2 (blue)
image

I trained using stable_baselines3.SAC.

@gunnxx
Copy link

gunnxx commented Nov 10, 2022

To reproduce, here is the code

from typing import Any, Dict, List, Tuple

import gym
import metaworld
import numpy as np
import os.path
import random

from gym.wrappers import TimeLimit
from metaworld.envs.mujoco.sawyer_xyz.sawyer_xyz_env import SawyerXYZEnv
from seals.util import AbsorbAfterDoneWrapper
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import get_latest_run_id


class ChangingTaskAndDoneTerminalWrapper(gym.Wrapper):
    """
    Option to change task every time the environment is reset
    and `done` returns `True` when `info["success"]=True`.

    `done` signal is needed to reset the environment in the sb3.
    Sb3 can handle `done` signal due to terminal states or time limit.
    """

    def __init__(self,
        env: SawyerXYZEnv,
        tasks: List[metaworld.Task],
        change_task: bool = False) -> None:
        """
        Constructor.

        :param env: Environment.
        :param tasks: List of metaworld task objects.
        :param change_task: Whether to change task every reset.
        """
        super().__init__(env)
        self.tasks = tasks
        self.change_task = change_task
        self.env.set_task(self.tasks[0])
    

    def reset(self, **kwargs) -> Any:
        """
        Reset the environment.
        """
        if self.change_task:
            task = random.choice(self.tasks)
            self.env.set_task(task)

        return super().reset(**kwargs)
    

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict[str, Any]]:
        """
        Step in the environment.
        """
        obs, rew, _, info = super().step(action)
        ## `sb3.EvalCallback` look at "is_success" to log success rate
        info["is_success"] = info["success"]
        return obs, rew, info["success"], info


def create_single_env(
    env_name: str,
    changing_inner_task: bool,
    handle_absorb_state: bool = False) -> gym.Env:
    """
    Create single gym environment.
    Note that we use `seals.util.AbsorbAfterDoneWrapper` and it should be before `gym.wrappers.TimeLimit`

    :param env_name: Name of the environment. Only `metaworld.MT10` tasks!
    :param changing_inner_task: Whether to change inner task every reset.
    :param handle_absorb_state: Whether to use `seals.util.AbsorbAfterDoneWrapper` or not.

    Return:
        Single gym environment.
    """
    mt1 = metaworld.MT1(env_name)
    env = mt1.train_classes[env_name]()
    env = ChangingTaskAndDoneTerminalWrapper(env, mt1.train_tasks, changing_inner_task)
    if handle_absorb_state: env = AbsorbAfterDoneWrapper(env)
    env = TimeLimit(env, env.max_path_length)
    return env


"""
Main function.
"""

env_name = "reach-v2"
changing_inner_task = True
seed = 0
device = "cuda:0"

## set log directory
tb_logdir = "logs/%s" % env_name
tb_logname = "sac"

if changing_inner_task: tb_logdir = tb_logdir + "/changing_goal"
else: tb_logdir = tb_logdir + "/static_goal"

latest_run_id = get_latest_run_id(tb_logdir, tb_logname)

## create env
vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 8,
    seed    = seed
)

eval_vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 10,
    seed    = seed + 1
)

## create algo
algo = SAC(
    policy          = "MlpPolicy",
    env             = vecenv,
    learning_rate   = 0.0003,
    buffer_size     = int(1e6),
    learning_starts = int(1e3),
    batch_size      = 512,
    tau             = 0.005,
    gamma           = 0.99,
    train_freq      = 512,              ## train every `train_freq * vecenv.num_envs`
    gradient_steps  = 128,
    policy_kwargs   = {"net_arch": {"pi": [256, 256], "qf": [256, 256]}},
    tensorboard_log = tb_logdir,
    device          = device,
    seed            = seed,
    verbose         = 1
)

## create callback
eval_callback = EvalCallback(
    eval_env                = eval_vecenv,
    n_eval_episodes         = 50,
    eval_freq               = 20000,    ## eval every `eval_freq * vecenv.num_envs`
    best_model_save_path    = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
    log_path                = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
)

algo.learn(
    total_timesteps = int(1e8),
    callback        = eval_callback,
    tb_log_name     = tb_logname
)

@seolhokim
Copy link

seolhokim commented Nov 11, 2022

Same issue. even in fixed goal environment.

@gunnxx
Copy link

gunnxx commented Nov 28, 2022

I found the culprit.

  1. The metaworld.Benchmark object should only be created once. Recreating the benchmark object will sample a new tasks (ie. start/object/goal positions). In my code, I call the create_single_env multiple times hence the metaworld.MT1 is created multiple times. It will mess up your evaluation.
  2. Try to not stop the algorithm even though it successfully reach the goal ie. dont stop when info["success"] == True. Somehow it destabilizes training.
    Screenshot from 2022-11-11 21-48-07
    Orange: I let the environment runs until 500 timesteps even though it successfully reaches the goal in the middle of the episode.
    Blue: I stop it when info["success"] == True.

@krzentner
Copy link
Contributor

krzentner commented Jan 26, 2023

Yeah, every time you create a benchmark object it samples 50 new goal locations. That's intended behavior, so I'm going to close this. I have added another bug for the button-press environments, which do genuinely have this problem.

@krzentner
Copy link
Contributor

Also worth mentioning is that the training instability when success is achieved is know, which is one reason why success states are not terminal states. Adding a wrapper to make them terminal states is not recommended.

@goldbird5
Copy link

Maybe this could be an additional insight, I guess.

I just trained single task with termination when it successes(info["success"]==True) using SAC, I found the same situation.
By rendering, it shows that the agent stops right before goal and just remaining still.

This may be because it is obviously the best way to get maximum 'return', gaining more rewards(between 0 and 10) until max_path_length rather than just terminating with getting single reward 10.

@pseudo-rnd-thoughts
Copy link
Member

@reginald-mclean is there any point to modifying the reward function such that this behaviour is not optimal

@reginald-mclean
Copy link
Collaborator

Yeah there could be some value there. I would just be concerned about what that reward is to avoid this behaviour, tuning that reward function would mean giving a very large value to the agent when success=True and that large value could destabilize value function training.

@krzentner
Copy link
Contributor

krzentner commented Jul 31, 2023

One thing to note is that Meta-World intentionally does not terminate the episode on success, to avoid this problem. You should always use length 500 episodes, without any terminal states. Formally, every meta-world task is an infinite horizon MDP, from which a finite 500 state sequence is sampled for each episode.

To reliably avoid this problem, the reward on termination would need to be two orders of magnitude larger than at any other timestep (technically, 5000 vs 10), which would destabilize most value function training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants