Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReturnnForwardJobV2, wait for checkpoint a bit #464

Closed
wants to merge 1 commit into from

Conversation

albertz
Copy link
Member

@albertz albertz commented Nov 28, 2023

No description provided.

@michelwi
Copy link
Contributor

Why is this a problem? The job should only be runnable if the checkpoint exists.

And then why is it a problem only for this job and not any other job that uses checkpoints?

If your problem is filesystem sync related, could it be solved (more generally) by tuning gs.WAIT_PERIOD_JOB_FS_SYNC?
https://github.com/rwth-i6/sisyphus/blob/450ed58ec6ec964db0b8377b2b685d37b77fa3db/sisyphus/global_settings.py#L183

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

Why is this a problem?

Good question. I was also asking on Slack. You actually answered this hypothesis:

If I had to hazard a guess: the manager sees the finished.task.i file and assumes all outputs are available. The asr-raid-X fileserver but needs another second to also propagate the visibility of the model checkpoint file from the compute node the training job runs on to your local node

Why now? Maybe more than usual load, admins tweaking stuff, network maintenance, cosmic rays,... 🤷

The job should only be runnable if the checkpoint exists.

I was checking the code now. In the Job, we have this logic:

    def _sis_runnable(self):
        """True if all inputs are available, also checks if new inputs are requested"""

        if not self._sis_update_possible():
            # Short cut used for most jobs
            return self._sis_all_path_available()
        ...

and

    def _sis_all_path_available(self):
        """True if all current inputs are available no update of the inputs is done"""
        for path in list(self._sis_inputs):
            if not path.available(debug_info=self):
                return False
        return True

And AbstractPath.available is implemented like this:

    @finished_results_cache.caching(get_key=lambda self, debug_info=None: ("available", self.rel_path()))
    def available(self, debug_info=None):
        """Returns True if the computations creating the path are completed
        :return:
        """

        # Use custom set function, check hasattr for backwards compatibility
        if hasattr(self, "_available") and self._available:
            return self._available(self)

        path = self.get_path()
        if self.creator is None:
            return os.path.isfile(path) or os.path.isdir(path)
        else:
            job_path_available = self.creator.path_available(self)
            if self.creator._sis_finished() and not job_path_available:
                if debug_info:
                    logging.warning(
                        "Job marked as finished but requested output is not available: %s %s" % (self, debug_info)
                    )
                else:
                    logging.warning("Job marked as finished but requested output is not available: %s" % self)
            return job_path_available

self.creator is the Job, and Job.path_available is implemented like:

    def path_available(self, path):
        """Returns True if given path is available yet

        :param path: path to check
        :return:
        """
        assert isinstance(path, AbstractPath)
        assert path.creator == self
        return self._sis_finished()

So, this logic is exactly as your hypothesis.

However, the ReturnnTrainingJob has a special path_available, implemented like this:

    def path_available(self, path):
        # if job is finished the path is available
        res = super().path_available(path)
        if res:
            return res

        # learning rate files are only available at the end
        if path == self.out_learning_rates:
            return super().path_available(path)

        # maybe the file already exists
        res = os.path.exists(path.get_path())
        if res:
            return res

        # maybe the model is just a pretrain model
        file = os.path.basename(path.get_path())
        directory = os.path.dirname(path.get_path())
        if file.startswith("epoch."):
            segments = file.split(".")
            pretrain_file = ".".join([segments[0], "pretrain", segments[1]])
            pretrain_path = os.path.join(directory, pretrain_file)
            return os.path.exists(pretrain_path)

        return False

If this job finished, it would again possibly lead to the problem you were describing.

If it is not finished, it should have checked though that the file exists. I wonder a bit because I think this was actually the common case where I saw this problem.

When the error occurred, and I was checking it, the checkpoint did exist.

Specifically, this is a mini task, which should run on my local engine, i.e. the same host that the Sisyphus manager runs on. So it should not really be possible that the Sisyphus manager sees the file and then this local task does not.

And then why is it a problem only for this job and not any other job that uses checkpoints?

Good question. I only saw it for this job.

If your problem is filesystem sync related, could it be solved (more generally) by tuning gs.WAIT_PERIOD_JOB_FS_SYNC? https://github.com/rwth-i6/sisyphus/blob/450ed58ec6ec964db0b8377b2b685d37b77fa3db/sisyphus/global_settings.py#L183

I think it's already high enough. I never saw this problem for any other job so far.

@michelwi
Copy link
Contributor

You actually answered this hypothesis

Oh yes, I remember. And I agree with your investigation on the runnabillity of Jobs and the conclusion

So it should not really be possible that the Sisyphus manager sees the file and then this local task does not.

Which leads again to the question: But why?

I lean towards a "weak reject" of this PR, because the problem does not seem intrinsically related to this Job and then we should have a more generic implementation for any Job to "wait a bit" after it is "theoretically runnable" until it is "practically runnable". Which is actually what gs.WAIT_PERIOD_JOB_FS_SYNC should do. (Maybe there is some bug in sisyphus around this option?)

I also found this code:
https://github.com/rwth-i6/sisyphus/blob/cbf529c36c1f247dc0400d5f5485fde240c6e612/sisyphus/task.py#L140-L146

I think [gs.WAIT_PERIOD_JOB_FS_SYNC is] already high enough. I never saw this problem for any other job so far.

Maybe you can tune gs.WAIT_PERIOD_MTIME_OF_INPUTS as well?

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

I also found this code:
https://github.com/rwth-i6/sisyphus/blob/cbf529c36c1f247dc0400d5f5485fde240c6e612/sisyphus/task.py#L140-L146

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

I really don't like the WAIT_PERIOD_JOB_FS_SYNC. I want to get rid of any arbitrary sleeps before doing anything. See also rwth-i6/sisyphus#146. Ideally, if sth is ready, it should directly execute the next task without any delay.

Then, the task itself can check if inputs are available, and if not, wait a bit. Thus, in the ideal case, it would directly run, and only if not available, it would wait a bit.

@michelwi
Copy link
Contributor

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

yes, I do agree and like this suggestion.

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

So, I opened rwth-i6/sisyphus#159 to discuss such generic solution. It's a bit unclear though if such generic solution would always work or not.

@christophmluscher
Copy link
Contributor

@albertz is this still open?

@albertz
Copy link
Member Author

albertz commented Jan 11, 2024

It's fixed, so this here is obsolete.

@albertz albertz closed this Jan 11, 2024
@albertz albertz deleted the albert-forward-wait-for-file branch January 11, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants