Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ReturnnRasrDataInput to produce valid Dataset config dict #130

Merged
merged 8 commits into from
Apr 25, 2023

Conversation

DanEnergetics
Copy link
Contributor

Similar to the OggZipHdfDataInput the ReturnnRasrDataInput gets a working get_data_dict implementation. This is useful for replacing the ReturnnRasrTrainingJob with the standard ReturnnTrainingJob for more functionality, e.g. producing different feature flow files for train and dev (see also this discussion)

Copy link
Contributor

@JackTemaki JackTemaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs further offline discussion, as there are multiple efforts simultaneously that try to solve the same things. I will try to organize this as soon as possible.

@Marvin84
Copy link
Contributor

My understanding was that we would for now include all different subroutines needed for everyone, leaving the freedom of choosing, and once everything is available we will decide how to merge?

@JackTemaki
Copy link
Contributor

Okay, I do not like this approach here, but we can discuss the PR if you want. Just be aware that once this gets merged and uses refactoring gets difficult, so any alternative has to be at a new location/under a new name, and the code that gets merge here can not be deleted/cleaned up easily. I leave this also to @christophmluscher and @michelwi to decide.

My comments are then:

  • I would like if the parameters that are specifically for the dataset are not just pasted here, but grouped as they are somewhat independent (e.g. in a dataclass)
  • I would really prefer if we do not include any code from the ReturnRasrTrainingJob at all. The purpose of our work should be to get rid of that job, so only take the minimal code from there that is needed and put it somewhere else.
  • if possible, please also check how much parameters can be removed/are not necessary. It would be unfortunate if they get hashed without reason.

@JackTemaki
Copy link
Contributor

Maybe a little bit more background information: We want to have RETURNN dataset creation helpers independent at a single place in the end, so also the OggZipHdf input is already somewhat problematic. But maybe we just implement these two inputs now and then just deprecate them once we have something nicer. I just hope that will work out without creating a mess...

@Marvin84
Copy link
Contributor

Marvin84 commented Apr 18, 2023 via email

@JackTemaki
Copy link
Contributor

The. main point for having this is to be able to have separate feature flows for train and dev. Do other approaches offer this possibility?

It am not discussing the what, just the how! I already said in my opinion this can be merged, as long as you get rid of the ReturnnRasrTrainingJob dependency here.

@DanEnergetics
Copy link
Contributor Author

I'm also okay with more (offline) discussions and would even welcome it. Regarding your comments, @JackTemaki, I'm unsure about these things:

  • Where to copy ReturnnRasrTrainingJob.create_config function? Another module? As a static method of the ReturnnRasrDataInput class?
  • Which parameters to drop? I know which one I'm not using but I cannot be sure that they aren't relevant to someone. Right now exactly those parameters are hashed that the ReturnnRasrTrainingJob uses for its hash.

@michelwi
Copy link
Contributor

I leave this also to [...] @michelwi to decide.

I will most likely not use or work with this code, so you may do however you want.

@JackTemaki
Copy link
Contributor

JackTemaki commented Apr 18, 2023

  • I know which one I'm not using but I cannot be sure that they aren't relevant to someone

Exactly this is the problem, this should be cleaned up so that we actually know what the parameters are doing. But I understand this is a time critical issue, and you just want working code. Then please add a comment to the "get" functions that this is outdated code triggering unknown parameters of RASR via the ReturnnRasrTrainingJob and should be used with care.

I would still require to unify the input features for the dataset separately.

@JackTemaki
Copy link
Contributor

JackTemaki commented Apr 19, 2023

Sorry that I did not formulate this well: If you really just copy the exact logic without cleaning up, then you should keep just calling the code from the ReturnnRasrTrainingJob instead!

Copy link
Contributor

@JackTemaki JackTemaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, you even changed it to frozen dataclass :) I did not want to complain about everything so I skipped on that one.

Maybe some extra docstrings, otherwise good to go

@DanEnergetics
Copy link
Contributor Author

Yes no problem :D it was mostly necessary because the ReturnnRasrTrainingJob.create_config method does not have any default values.
The docstrings will be added by tomorrow.

@@ -1,15 +1,18 @@
__all__ = [
"RasrTrainingArgs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer ReturnnRasrTrainingArgs to match the job class name.

@@ -39,6 +59,7 @@ def __init__(
shuffle_data: bool = True,
stm: Optional[tk.Path] = None,
glm: Optional[tk.Path] = None,
rasr_training_args: Optional[RasrTrainingArgs] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rasr_training_args: Optional[RasrTrainingArgs] = None,
returnn_rasr_training_args: Optional[RasrTrainingArgs] = None,

Same here

@christophmluscher christophmluscher merged commit b10adea into main Apr 25, 2023
@christophmluscher christophmluscher deleted the mann-extend-returnn-rasr-data-input branch April 25, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants