ReturnnForwardJobV2, wait for checkpoint a bit #464

albertz · 2023-11-28T07:58:20Z

No description provided.

michelwi · 2023-11-28T09:45:43Z

Why is this a problem? The job should only be runnable if the checkpoint exists.

And then why is it a problem only for this job and not any other job that uses checkpoints?

If your problem is filesystem sync related, could it be solved (more generally) by tuning gs.WAIT_PERIOD_JOB_FS_SYNC?
https://github.com/rwth-i6/sisyphus/blob/450ed58ec6ec964db0b8377b2b685d37b77fa3db/sisyphus/global_settings.py#L183

albertz · 2023-11-28T10:25:50Z

Why is this a problem?

Good question. I was also asking on Slack. You actually answered this hypothesis:

If I had to hazard a guess: the manager sees the finished.task.i file and assumes all outputs are available. The asr-raid-X fileserver but needs another second to also propagate the visibility of the model checkpoint file from the compute node the training job runs on to your local node

Why now? Maybe more than usual load, admins tweaking stuff, network maintenance, cosmic rays,... 🤷

The job should only be runnable if the checkpoint exists.

I was checking the code now. In the Job, we have this logic:

    def _sis_runnable(self):
        """True if all inputs are available, also checks if new inputs are requested"""

        if not self._sis_update_possible():
            # Short cut used for most jobs
            return self._sis_all_path_available()
        ...

and

    def _sis_all_path_available(self):
        """True if all current inputs are available no update of the inputs is done"""
        for path in list(self._sis_inputs):
            if not path.available(debug_info=self):
                return False
        return True

And AbstractPath.available is implemented like this:

    @finished_results_cache.caching(get_key=lambda self, debug_info=None: ("available", self.rel_path()))
    def available(self, debug_info=None):
        """Returns True if the computations creating the path are completed
        :return:
        """

        # Use custom set function, check hasattr for backwards compatibility
        if hasattr(self, "_available") and self._available:
            return self._available(self)

        path = self.get_path()
        if self.creator is None:
            return os.path.isfile(path) or os.path.isdir(path)
        else:
            job_path_available = self.creator.path_available(self)
            if self.creator._sis_finished() and not job_path_available:
                if debug_info:
                    logging.warning(
                        "Job marked as finished but requested output is not available: %s %s" % (self, debug_info)
                    )
                else:
                    logging.warning("Job marked as finished but requested output is not available: %s" % self)
            return job_path_available

self.creator is the Job, and Job.path_available is implemented like:

    def path_available(self, path):
        """Returns True if given path is available yet

        :param path: path to check
        :return:
        """
        assert isinstance(path, AbstractPath)
        assert path.creator == self
        return self._sis_finished()

So, this logic is exactly as your hypothesis.

However, the ReturnnTrainingJob has a special path_available, implemented like this:

    def path_available(self, path):
        # if job is finished the path is available
        res = super().path_available(path)
        if res:
            return res

        # learning rate files are only available at the end
        if path == self.out_learning_rates:
            return super().path_available(path)

        # maybe the file already exists
        res = os.path.exists(path.get_path())
        if res:
            return res

        # maybe the model is just a pretrain model
        file = os.path.basename(path.get_path())
        directory = os.path.dirname(path.get_path())
        if file.startswith("epoch."):
            segments = file.split(".")
            pretrain_file = ".".join([segments[0], "pretrain", segments[1]])
            pretrain_path = os.path.join(directory, pretrain_file)
            return os.path.exists(pretrain_path)

        return False

If this job finished, it would again possibly lead to the problem you were describing.

If it is not finished, it should have checked though that the file exists. I wonder a bit because I think this was actually the common case where I saw this problem.

When the error occurred, and I was checking it, the checkpoint did exist.

Specifically, this is a mini task, which should run on my local engine, i.e. the same host that the Sisyphus manager runs on. So it should not really be possible that the Sisyphus manager sees the file and then this local task does not.

And then why is it a problem only for this job and not any other job that uses checkpoints?

Good question. I only saw it for this job.

If your problem is filesystem sync related, could it be solved (more generally) by tuning gs.WAIT_PERIOD_JOB_FS_SYNC? https://github.com/rwth-i6/sisyphus/blob/450ed58ec6ec964db0b8377b2b685d37b77fa3db/sisyphus/global_settings.py#L183

I think it's already high enough. I never saw this problem for any other job so far.

michelwi · 2023-11-28T10:54:09Z

You actually answered this hypothesis

Oh yes, I remember. And I agree with your investigation on the runnabillity of Jobs and the conclusion

So it should not really be possible that the Sisyphus manager sees the file and then this local task does not.

Which leads again to the question: But why?

I lean towards a "weak reject" of this PR, because the problem does not seem intrinsically related to this Job and then we should have a more generic implementation for any Job to "wait a bit" after it is "theoretically runnable" until it is "practically runnable". Which is actually what gs.WAIT_PERIOD_JOB_FS_SYNC should do. (Maybe there is some bug in sisyphus around this option?)

I also found this code:
https://github.com/rwth-i6/sisyphus/blob/cbf529c36c1f247dc0400d5f5485fde240c6e612/sisyphus/task.py#L140-L146

I think [gs.WAIT_PERIOD_JOB_FS_SYNC is] already high enough. I never saw this problem for any other job so far.

Maybe you can tune gs.WAIT_PERIOD_MTIME_OF_INPUTS as well?

albertz · 2023-11-28T11:23:24Z

I also found this code:
https://github.com/rwth-i6/sisyphus/blob/cbf529c36c1f247dc0400d5f5485fde240c6e612/sisyphus/task.py#L140-L146

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

I really don't like the WAIT_PERIOD_JOB_FS_SYNC. I want to get rid of any arbitrary sleeps before doing anything. See also rwth-i6/sisyphus#146. Ideally, if sth is ready, it should directly execute the next task without any delay.

Then, the task itself can check if inputs are available, and if not, wait a bit. Thus, in the ideal case, it would directly run, and only if not available, it would wait a bit.

michelwi · 2023-11-28T11:27:55Z

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

yes, I do agree and like this suggestion.

albertz · 2023-11-28T11:36:19Z

So, I opened rwth-i6/sisyphus#159 to discuss such generic solution. It's a bit unclear though if such generic solution would always work or not.

christophmluscher · 2024-01-11T18:53:00Z

@albertz is this still open?

albertz · 2024-01-11T18:54:12Z

It's fixed, so this here is obsolete.

ReturnnForwardJobV2, wait for checkpoint a bit

ef8e46a

albertz requested review from mmz33, michelwi, Atticus1806, curufinwe, christophmluscher and JackTemaki November 28, 2023 07:58

Atticus1806 approved these changes Nov 28, 2023

View reviewed changes

albertz mentioned this pull request Nov 28, 2023

Task.run, wait for inputs rwth-i6/sisyphus#159

Closed

albertz closed this Jan 11, 2024

albertz deleted the albert-forward-wait-for-file branch January 11, 2024 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReturnnForwardJobV2, wait for checkpoint a bit #464

ReturnnForwardJobV2, wait for checkpoint a bit #464

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

christophmluscher commented Jan 11, 2024

albertz commented Jan 11, 2024

ReturnnForwardJobV2, wait for checkpoint a bit #464

ReturnnForwardJobV2, wait for checkpoint a bit #464

Conversation

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

michelwi commented Nov 28, 2023

albertz commented Nov 28, 2023

christophmluscher commented Jan 11, 2024

albertz commented Jan 11, 2024