Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReturnnConfig ignores file caching #310

Open
albertz opened this issue Sep 13, 2022 · 6 comments
Open

ReturnnConfig ignores file caching #310

albertz opened this issue Sep 13, 2022 · 6 comments
Assignees

Comments

@albertz
Copy link
Member

albertz commented Sep 13, 2022

When you put a tk.Path(..., cached=True) object somewhere into your ReturnnConfig, it will not use file caching.

Is this expected behavior? At least I did not expect this.

The reason for that:

config = instanciate_delayed(config)

This code recursively goes through config, and does:

if isinstance(o, DelayedBase):
    o = o.get()

Now, tk.Path derives from DelayedBase, and has this get:

def get(self):
    return self.get_path()

And get_path returns the uncached file path.

So, in the written RETURNN config, you will not have any hashed file paths.

(Btw, in the RasrConfig, I think it calls str(path) instead, and Path.__str__ returns self.get_cached_path().)

Another note: I just realize, the RETURNN config writing is also a separate Task anyway, which likely runs on a different node, so file caching could not be done directly there. However, the common file_caching function is implemented like:

def file_caching(path):
  return f"`cf {path}`"

So, it anyway does not return the file path of a cached file, but just this special formatted string, which only RASR can really handle properly. So Path.__str__ likely would break other tools.

I don't really have a good solution or suggestion currently.

@albertz
Copy link
Member Author

albertz commented Sep 13, 2022

If it is expected behavior, I think at least the documentation in ReturnnConfig should warn about this.

But then somewhat my open question remains: How do I do caching here? I see that other people often use caching via RETURNN, e.g. HDFDataset and others support it directly, or ExternSprintDataset, where the RASR config gets written via RasrConfig, thus has cached paths. So is this the suggested solution? Implement it on RETURNN side anyway, also wherever else it is currently missing?

@critias
Copy link

critias commented Sep 13, 2022

Path.get was introduced to have the same interface between Variable and Path. Not accessing the cached version by default does feel unintuitive to me as well given that __str__ uses the cached version.

The described problem that it's written in an external file and read on a different node requires of cause support by the used toolkit, like in the given example with RASR, but that isn't possible with most toolkits.

So I don't have any suggestions for the write on one node read on another setup, changing the default behavior of get could be considered and would probably be the more intuitive solution (with a warning that the default behavior changed, and an option to switch back to the old behavior or to disable the warning).

Any other opinions?

Side note: I considered a while ago to deprecating the current tk.Path(..., cached=True) in favor of something like CachedPath(tk.Path(...)) to make the difference more explicit and more flexible (this would allow different caching methods for the same file or even support it for writing), but since it wasn't a pressing issue and the current solution worked I didn't do it in the end. Would anybody find this change useful?

@albertz
Copy link
Member Author

albertz commented Sep 13, 2022

I'm a bit afraid that changing the behavior of Path.get now would break lots of setups, esp how the file_cashing function above works, i.e. not returning a real path, but this special "`cf <path>`" string, where many tools or other code would probably expect a real path.

We discussed this a bit more internally. And it looks like there is no real good generic solution to this problem. For ReturnnConfig, maybe we can introduce some solution, somewhat similar to RasrConfig. I was thinking about returning sth like _DelayedCodeFormat("cf({!r})", path), and then making sure there is a cf function in the ReturnnConfig, via python_epilog or so.
Example:

    args = [
        _DelayedCodeFormat("lambda: '--config=' + cf({!r})", files["config"]),
        _DelayedCodeFormat("lambda: '--*.corpus.file=' + cf({!r})", files["corpus"]),
        _DelayedCodeFormat(
            "lambda: '--*.corpus.segments.file=' + cf({!r})", (files["segments"] if "segments" in files else "")),
        _DelayedCodeFormat("lambda: '--*.feature-cache-path=' + cf({!r})", files["features"]),
        _DelayedCodeFormat("lambda: '--*.feature-extraction.file=' + cf({!r})", files["feature_extraction_config"]),
        "--*.log-channel.file=/dev/null",
        "--*.window-size=1",
    ]
    args += [
        "--*.corpus.segment-order-shuffle=true",
        "--*.segment-order-sort-by-time-length=true",
        "--*.segment-order-sort-by-time-length-chunk-size=%i" % {"train": epoch_split * 1000}.get(data, -1),
    ]
    d = {
        "class": "ExternSprintDataset",
        "sprintTrainerExecPath": tools_paths.get_rasr_exe("nn-trainer"),
        "sprintConfigStr": args,
        "suppress_load_seqs_print": True,  # less verbose
        "input_stddev": 3.,
        "orth_vocab": self.vocab.get_opts() if self.vocab else None,
    }

And:

class _DelayedCodeFormat(DelayedFormat):
    """Delayed code"""

    def get(self) -> CodeWrapper:
        """get"""
        return CodeWrapper(super(_DelayedCodeFormat, self).get())

@michelwi
Copy link
Contributor

I don't understand what is the issue here: returnn does not support (in general) the "cf /path/to/file" file caching method we commonly use, so ignoring it is the correct solution at the moment.

Also the structure of the recipes (writing the config on a different node) does not support file caching where sisyphus would do the copying. So ignoring it is the correct solution.

I propose two solutions to enable general file caching:
A) returnn gets an implementation where for general parameters where a string path is expected also "cf /path/to/file" is allowed and then returnn would do the file caching.
B) during config writing we inspect every tk.Path object if the cached property is set and then serialize it as cf("/path/to/file") + make sure that a cf function exists in the prolog.

@albertz
Copy link
Member Author

albertz commented Sep 13, 2022

Well, one thing is just that the behavior to me was unexpected (that the tk.Path cached is just ignored). I didn't really expect that this cached option is really only for RasrConfig, maybe some shell scripts wrapped as Sisyphus jobs, but nothing else really.

But you are right, the current way the gs.file_caching works is really specific for RASR or for shell scripts, and would not work in other tools, unless they would add special handling for that.

However, there are potential solutions to still do it in a generic way.

A) returnn gets an implementation where for general parameters where a string path is expected also "cf /path/to/file" is allowed and then returnn would do the file caching.

This would be tricky, or basically almost impossible, as there is no central place for file handling in RETURNN. It's rather spread over the whole code, and even some other user code could do file handling. And then also TensorFlow and other libraries would do file handling.

B) during config writing we inspect every tk.Path object if the cached property is set and then serialize it as cf("/path/to/file") + make sure that a cf function exists in the prolog.

This is similar to my solution now. I don't do this automatically but rather manually. It would have been a bit tricky to do automatically, as it would have to go recursively through objects, e.g. DelayedFormat (not just list/tuple/dict, as instanciate_delayed currently is doing it). I introduced _DelayedCodeFormat as a replacement, see above.

@michelwi
Copy link
Contributor

This is similar to my solution now. I don't do this automatically but rather manually.

Yes, the user manually taking care of it however they want is of cause also an option.

It would have been a bit tricky to do automatically, as it would have to go recursively through objects, e.g. DelayedFormat (not just list/tuple/dict, as instanciate_delayed currently is doing it).

I see. But then there is the question of whether or not the addition of cf {} would be wanted if a user already is doing some "post processing" of the generated string. I see some use cases where it would break and others where it would be the intended behaviour.
And I see already the arguments "you should have used a tk.Path object with cached=False instead" and "then you can add 'cf' to your DelayedFormat as well".
We can discuss what is the most expected behavior and then change the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants