-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReturnnForwardJobV2, wait for checkpoint a bit #464
Conversation
Why is this a problem? The job should only be runnable if the checkpoint exists. And then why is it a problem only for this job and not any other job that uses checkpoints? If your problem is filesystem sync related, could it be solved (more generally) by tuning |
Good question. I was also asking on Slack. You actually answered this hypothesis:
I was checking the code now. In the def _sis_runnable(self):
"""True if all inputs are available, also checks if new inputs are requested"""
if not self._sis_update_possible():
# Short cut used for most jobs
return self._sis_all_path_available()
... and def _sis_all_path_available(self):
"""True if all current inputs are available no update of the inputs is done"""
for path in list(self._sis_inputs):
if not path.available(debug_info=self):
return False
return True And @finished_results_cache.caching(get_key=lambda self, debug_info=None: ("available", self.rel_path()))
def available(self, debug_info=None):
"""Returns True if the computations creating the path are completed
:return:
"""
# Use custom set function, check hasattr for backwards compatibility
if hasattr(self, "_available") and self._available:
return self._available(self)
path = self.get_path()
if self.creator is None:
return os.path.isfile(path) or os.path.isdir(path)
else:
job_path_available = self.creator.path_available(self)
if self.creator._sis_finished() and not job_path_available:
if debug_info:
logging.warning(
"Job marked as finished but requested output is not available: %s %s" % (self, debug_info)
)
else:
logging.warning("Job marked as finished but requested output is not available: %s" % self)
return job_path_available
def path_available(self, path):
"""Returns True if given path is available yet
:param path: path to check
:return:
"""
assert isinstance(path, AbstractPath)
assert path.creator == self
return self._sis_finished() So, this logic is exactly as your hypothesis. However, the def path_available(self, path):
# if job is finished the path is available
res = super().path_available(path)
if res:
return res
# learning rate files are only available at the end
if path == self.out_learning_rates:
return super().path_available(path)
# maybe the file already exists
res = os.path.exists(path.get_path())
if res:
return res
# maybe the model is just a pretrain model
file = os.path.basename(path.get_path())
directory = os.path.dirname(path.get_path())
if file.startswith("epoch."):
segments = file.split(".")
pretrain_file = ".".join([segments[0], "pretrain", segments[1]])
pretrain_path = os.path.join(directory, pretrain_file)
return os.path.exists(pretrain_path)
return False If this job finished, it would again possibly lead to the problem you were describing. If it is not finished, it should have checked though that the file exists. I wonder a bit because I think this was actually the common case where I saw this problem. When the error occurred, and I was checking it, the checkpoint did exist. Specifically, this is a mini task, which should run on my local engine, i.e. the same host that the Sisyphus manager runs on. So it should not really be possible that the Sisyphus manager sees the file and then this local task does not.
Good question. I only saw it for this job.
I think it's already high enough. I never saw this problem for any other job so far. |
Oh yes, I remember. And I agree with your investigation on the runnabillity of Jobs and the conclusion
Which leads again to the question: But why? I lean towards a "weak reject" of this PR, because the problem does not seem intrinsically related to this Job and then we should have a more generic implementation for any Job to "wait a bit" after it is "theoretically runnable" until it is "practically runnable". Which is actually what I also found this code:
Maybe you can tune |
Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case? I really don't like the Then, the task itself can check if inputs are available, and if not, wait a bit. Thus, in the ideal case, it would directly run, and only if not available, it would wait a bit. |
yes, I do agree and like this suggestion. |
So, I opened rwth-i6/sisyphus#159 to discuss such generic solution. It's a bit unclear though if such generic solution would always work or not. |
@albertz is this still open? |
It's fixed, so this here is obsolete. |
No description provided.