Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReturnnTrainingJob: model in config is an absolute path #495

Open
albertz opened this issue Apr 8, 2024 · 9 comments
Open

ReturnnTrainingJob: model in config is an absolute path #495

albertz opened this issue Apr 8, 2024 · 9 comments

Comments

@albertz
Copy link
Member

albertz commented Apr 8, 2024

self.returnn_config.post_config["model"] = os.path.join(self.out_model_dir.get_path(), "epoch")

Once you move the job dir to a new location, this will thus break.

More annoyingly, RETURNN automatically silently recursively creates non-existing directories for model, so it will not crash but still run without errors. When you moved the job with existing checkpoints, it will not find the old checkpoints but start training again from scratch. However, it will also overwrite the learning rates file, so afterwards, the old checkpoints can not really be used anymore (if you care about having a corresponding correct learning rates file), and the learning rates file will have mixed values from the old and new training run.

I'm not sure if we consider this a bug that we have absolute paths here? We could fix this by using a relative path. There might be a number of similar issues here and probably in other jobs as well.

@michelwi
Copy link
Contributor

michelwi commented Apr 8, 2024

replacing

os.path.join(self.out_model_dir.get_path(), "epoch")

with

self.out_model_dir.join_right("epoch")

will not be resolving the path in the manager (which would trigger an error if anybody ever activated the check), but I am afraid that once the config is written to file, there is no way to change the path.

One could write a relative path as you said, but many other paths (e.g. inputs to the job) in the config file will still break. This holds true for any Job and not just ReturnnTrainingJob. I guess there is no way to move any job dir while the job is running.

Other idea: manually rerun the task create_files after the move?

@JackTemaki
Copy link
Contributor

This is more a general issue of absolute paths in Sisyphus, not only relevant for the ReturnnTrainingJob. For full transfer-ability during "active" jobs all paths need to be relative, which is currently not the case. A much bigger problem than with Returnn configs is actually with bliss corpora, because they are wrong across jobs, not only within a job.

In general, I would consider this is not a bug, but more an inconvenient design decision. At least I never expected that you can move Sisyphus setups without manual intervention (on the contrary, I was surprised by how good it does work and how few thinks broke for my last setup).

I think the first step needed towards relative setups would be a Path.get_relative_path. But I will not actively contribute towards relative Sisyphus setups as I do not see an urgent need. Only for invalid job outputs (like the bliss files) I see a it as a strong issue that should be fixed.

@michelwi
Copy link
Contributor

michelwi commented Apr 9, 2024

Actually, now that you mentioned incorrect bliss files: imagine a setup that does audio processing and then updates the corpus file to contain the paths to the processed audio files.

No way Sisyphus could know that the file needs to be regenerated. We would need to implement specific move_job functions for each job specifically that would rerun specific tasks as needed.

@albertz
Copy link
Member Author

albertz commented Apr 9, 2024

If all paths are relative, then this would not be so much a problem.

Despite, in this example, I would argue that the job which generates the new Bliss corpus should also own the processed audio files, i.e. have a copy of them (or hardlink).

@michelwi
Copy link
Contributor

michelwi commented Apr 9, 2024

Despite, in this example, I would argue that the job which generates the new Bliss corpus should also own the processed audio files, i.e. have a copy of them (or hardlink).

Yes, but still if the Jobs output is

output/corpus.xml.gz
output/audio/file1.wav
output/audio/file2.wav
...

and in the corpus we write the audio path as audio/fileX.wav then the path would only be correct if the working directory of any subsequent Job that uses the corpus was the corpus creation jobdir.. so we kind of need absolute paths here.

@albertz
Copy link
Member Author

albertz commented Apr 9, 2024

then the path would only be correct if the working directory of any subsequent Job that uses the corpus was the corpus creation jobdir

I would say that we should fix RASR that the path is relative w.r.t. the bliss XML file.

@christophmluscher
Copy link
Contributor

christophmluscher commented Apr 9, 2024

For RASR I think you can already specify relative paths within the bliss XML file. You would need to specify the audio dir.

class CorpusObject(tk.Object):
    """
    A simple container object to track additional information for a bliss corpus
    """

    def __init__(self):
        self.corpus_file: Optional[tk.Path] = None  # bliss corpus xml
        self.audio_dir: Optional[tk.Path] = None  # audio directory if paths are relative (usually not needed)
        self.audio_format: Optional[str] = None  # format type of the audio files, see e.g. get_input_node_type()
        self.duration: Optional[float] = None  # duration of the corpus, is used to determine job time

Gets set here:

def crp_set_corpus(crp, corpus):
    """
    :param CommonRasrParameters crp:
    :param meta.CorpusObject corpus: object with corpus_file, audio_dir, audio_format, duration
    """
    config = RasrConfig()
    config.file = corpus.corpus_file
    config.audio_dir = corpus.audio_dir
    config.warn_about_unexpected_elements = True
    config.capitalize_transcriptions = False
    config.progress_indication = "global"

    crp.corpus_config = config
    crp.audio_format = corpus.audio_format
    crp.corpus_duration = corpus.duration

@michelwi
Copy link
Contributor

michelwi commented Apr 9, 2024

For RASR I think you can already specify relative paths within the bliss XML file.

Fine, then bundle files.. those do not support rel. paths

@albertz
Copy link
Member Author

albertz commented Apr 9, 2024

Fine, then bundle files.. those do not support rel. paths

I would always argue, whenever some software does not support this properly, we should fix this.

I'm also not sure whether this is really such a common problem. Despite Bliss or bundle files, what else is there? I guess some other similar training jobs referencing into themselves, e.g. run_cmd += ["checkpoint.save_dir=" + self.out_checkpoint_dir.get_path()] in FairseqHydraTrainingJob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants