New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why it does not success even when the score is high? #378
Comments
Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper. |
Yes, reach-v2 is very easy to optimize. The score rises up fast under evolutionary algorithm. But it seems that the high score does not necessarily lead to final success. |
I have similar issue. I only tested on the MT10 tasks and train a policy on each of the tasks. I run evaluation every 160k steps in the environment where each evaluation I run 50 episodes and compute the success percentage from those 50 episodes. Higher reward does not always positively correlate with higher success rate.
I trained using |
To reproduce, here is the code from typing import Any, Dict, List, Tuple
import gym
import metaworld
import numpy as np
import os.path
import random
from gym.wrappers import TimeLimit
from metaworld.envs.mujoco.sawyer_xyz.sawyer_xyz_env import SawyerXYZEnv
from seals.util import AbsorbAfterDoneWrapper
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import get_latest_run_id
class ChangingTaskAndDoneTerminalWrapper(gym.Wrapper):
"""
Option to change task every time the environment is reset
and `done` returns `True` when `info["success"]=True`.
`done` signal is needed to reset the environment in the sb3.
Sb3 can handle `done` signal due to terminal states or time limit.
"""
def __init__(self,
env: SawyerXYZEnv,
tasks: List[metaworld.Task],
change_task: bool = False) -> None:
"""
Constructor.
:param env: Environment.
:param tasks: List of metaworld task objects.
:param change_task: Whether to change task every reset.
"""
super().__init__(env)
self.tasks = tasks
self.change_task = change_task
self.env.set_task(self.tasks[0])
def reset(self, **kwargs) -> Any:
"""
Reset the environment.
"""
if self.change_task:
task = random.choice(self.tasks)
self.env.set_task(task)
return super().reset(**kwargs)
def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict[str, Any]]:
"""
Step in the environment.
"""
obs, rew, _, info = super().step(action)
## `sb3.EvalCallback` look at "is_success" to log success rate
info["is_success"] = info["success"]
return obs, rew, info["success"], info
def create_single_env(
env_name: str,
changing_inner_task: bool,
handle_absorb_state: bool = False) -> gym.Env:
"""
Create single gym environment.
Note that we use `seals.util.AbsorbAfterDoneWrapper` and it should be before `gym.wrappers.TimeLimit`
:param env_name: Name of the environment. Only `metaworld.MT10` tasks!
:param changing_inner_task: Whether to change inner task every reset.
:param handle_absorb_state: Whether to use `seals.util.AbsorbAfterDoneWrapper` or not.
Return:
Single gym environment.
"""
mt1 = metaworld.MT1(env_name)
env = mt1.train_classes[env_name]()
env = ChangingTaskAndDoneTerminalWrapper(env, mt1.train_tasks, changing_inner_task)
if handle_absorb_state: env = AbsorbAfterDoneWrapper(env)
env = TimeLimit(env, env.max_path_length)
return env
"""
Main function.
"""
env_name = "reach-v2"
changing_inner_task = True
seed = 0
device = "cuda:0"
## set log directory
tb_logdir = "logs/%s" % env_name
tb_logname = "sac"
if changing_inner_task: tb_logdir = tb_logdir + "/changing_goal"
else: tb_logdir = tb_logdir + "/static_goal"
latest_run_id = get_latest_run_id(tb_logdir, tb_logname)
## create env
vecenv = make_vec_env(
env_id = lambda: create_single_env(env_name, changing_inner_task),
n_envs = 8,
seed = seed
)
eval_vecenv = make_vec_env(
env_id = lambda: create_single_env(env_name, changing_inner_task),
n_envs = 10,
seed = seed + 1
)
## create algo
algo = SAC(
policy = "MlpPolicy",
env = vecenv,
learning_rate = 0.0003,
buffer_size = int(1e6),
learning_starts = int(1e3),
batch_size = 512,
tau = 0.005,
gamma = 0.99,
train_freq = 512, ## train every `train_freq * vecenv.num_envs`
gradient_steps = 128,
policy_kwargs = {"net_arch": {"pi": [256, 256], "qf": [256, 256]}},
tensorboard_log = tb_logdir,
device = device,
seed = seed,
verbose = 1
)
## create callback
eval_callback = EvalCallback(
eval_env = eval_vecenv,
n_eval_episodes = 50,
eval_freq = 20000, ## eval every `eval_freq * vecenv.num_envs`
best_model_save_path = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
log_path = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
)
algo.learn(
total_timesteps = int(1e8),
callback = eval_callback,
tb_log_name = tb_logname
) |
Same issue. even in fixed goal environment. |
Yeah, every time you create a benchmark object it samples 50 new goal locations. That's intended behavior, so I'm going to close this. I have added another bug for the button-press environments, which do genuinely have this problem. |
Also worth mentioning is that the training instability when success is achieved is know, which is one reason why success states are not terminal states. Adding a wrapper to make them terminal states is not recommended. |
Maybe this could be an additional insight, I guess. I just trained single task with termination when it successes(info["success"]==True) using SAC, I found the same situation. This may be because it is obviously the best way to get maximum 'return', gaining more rewards(between 0 and 10) until max_path_length rather than just terminating with getting single reward 10. |
@reginald-mclean is there any point to modifying the reward function such that this behaviour is not optimal |
Yeah there could be some value there. I would just be concerned about what that reward is to avoid this behaviour, tuning that reward function would mean giving a very large value to the agent when success=True and that large value could destabilize value function training. |
One thing to note is that Meta-World intentionally does not terminate the episode on success, to avoid this problem. You should always use length 500 episodes, without any terminal states. Formally, every meta-world task is an infinite horizon MDP, from which a finite 500 state sequence is sampled for each episode. To reliably avoid this problem, the reward on termination would need to be two orders of magnitude larger than at any other timestep (technically, 5000 vs 10), which would destabilize most value function training. |
I tested some environment with different task settings. For example on reach-v2, I think it is strange that it does not success even when the score is 4700+. I tried to add an extra score when it success, but it didn't work.
Is there any tricks to slove this problem?
The text was updated successfully, but these errors were encountered: