Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetBestRecogTrainExp job and related helpers #108

Open
albertz opened this issue Dec 2, 2022 · 4 comments
Open

GetBestRecogTrainExp job and related helpers #108

albertz opened this issue Dec 2, 2022 · 4 comments
Assignees

Comments

@albertz
Copy link
Member

albertz commented Dec 2, 2022

On a high-level, my training setup works like:

  • Run training, keep some fixed epochs (either via the default predefined pattern, or custom), and the N best epochs per train/dev scores.
  • Run recog (or translation or whatever inference) on fixed epochs + M best epochs on some other dev set (e.g. Hub500 for Switchboard).
  • Select the best epoch from the recog results.
  • Run recog for all relevant eval sets on the selected best epoch.

I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via Job.update. For the other epochs, this needs the final learning-rate-file with the scores, so it depends on that. This is then also via Job.update, to dynamically add some recogs. Note that the number of epochs where recog is performed on is variable, because there might be overlaps between those sets.

I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.

I think it's good if we have some common pipeline or helper code for this, and not that everyone has its own custom solution.

So I want to discuss this here. We can implement sth new, or use some existing code. For example, I have implemented exactly that already. See my GetBestRecogTrainExp job, the recog_training_exp function, and related code.

@michelwi
Copy link
Contributor

michelwi commented Dec 5, 2022

I think for this to consistently work, we would either have to have a Job that does a variable number of recognitions (and scorings) in itself or use the async workflow in sisyphus that then would make the graph dependent on some intermediate output (i.e. the scores of the epochs).
I think I would propose using the second solution, although I am unfamiliar with async workflows.

I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.

Well, I do in fact only run recognitions on a fixed set of epochs independent of their score. This might be sub-optimal, but then my experiments have shown that once you are in the converged phase of model training WER and cv-score become decorrelated. Therefore I always have a fixed graph that is pre-determined ab initio.

I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via Job.update.

Job.update is not even necessary for this, as the ReturnnTrainingJob internally sets the checkpoints as available as soon as they are written to disk.

@albertz
Copy link
Member Author

albertz commented Dec 5, 2022

I think for this to consistently work, we would either have to have a Job that does a variable number of recognitions (and scorings) in itself or use the async workflow in sisyphus that then would make the graph dependent on some intermediate output (i.e. the scores of the epochs).

See my solution in my current implementation. I just pass a generic function, like (model -> score), which is then executed for all relevant models, i.e. all relevant epochs.

This does not need the async workflow. I'm personally also not a fan of the async workflow. (See here.)

I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via Job.update.

Job.update is not even necessary for this, as the ReturnnTrainingJob internally sets the checkpoints as available as soon as they are written to disk.

Yea, that was a bit misleading. I only use Job.update to check once the learning-rate-scores file becomes available, and then to parse it and do the logic.

I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.

Well, I do in fact only run recognitions on a fixed set of epochs independent of their score.

This could be an option to the GetBestRecogTrainExp job. The GetBestRecogTrainExp does also a bit more: It collects the recog scores (WERs) and summarizes the final results, and links to the best epoch.

@michelwi
Copy link
Contributor

michelwi commented Dec 5, 2022

See my solution in my current implementation. I just pass a generic function, like (model -> score), which is then executed for all relevant models, i.e. all relevant epochs.

This does not need the async workflow. I'm personally also not a fan of the async workflow. (See here.)

Yes, you do not explicitly use the async workflow functions of sisyphus, but you are still passing a callback function that only gets executed once the corresponding ReturnnTtrainingJob finishes (via Job.update) and then adds new Jobs to the sis graph. Or in terms of the Blog Post you linked: Your code is still red.

Even on a high level your proposed workflow would not work without some form of asynchronicity, because the number of epochs that would need to be recognized are not known until some part of the graph has already been computed.
By not using the async workflow one might run the risk of accessing one Jobs output before they are computed. Although I admit in your implementation it is very nicely made sure that the train_job finishes first. But once we start to enforce check_is_worker more strictly, your approach does not work any more, as you always need to access the file with the scores inside the manager.

Note that the number of epochs where recog is performed on is variable, because there might be overlaps between those sets.

There is a possibility, if we allow to give up this bit of efficiency. Then would always know exactly the Jobs that need running. Pseudo-pipeline code:

train_job = ReturnnTrainingJob(**training_kwargs)

# these run immediately when epoch becomes available
for epoch in fixed_recognition_epochs:
    run_recognition(train_job.out_checkpoints[epoch])

# these only run once the train_job is finished
# as the learning_rates file needs to be completed
N = 5
nbest_job = GetNBestEpochsByScoreJob(train_job.out_checkpoints, train_job.out_learning_rates, N)
for epoch_idx in range(N):
    run_recognition(nbest_job .out_checkpoints[epoch_idx])

Where GetNBestEpochsByScoreJob just copies (/links) the N best epochs according to the scores provided

@albertz
Copy link
Member Author

albertz commented Dec 5, 2022

Yes, you do not explicitly use the async workflow functions of sisyphus, but you are still passing a callback function that only gets executed once the corresponding ReturnnTrainingJob finishes (via Job.update) and then adds new Jobs to the sis graph.

Why is this problematic? I don't see any problem with this.

Or in terms of the Blog Post you linked: Your code is still red.

No, it's a bit different in the blog post. This is a well known metaphor, the coloring of functions w.r.t. async. It means that functions are distinguished as either "async" or "normal". See also here, here, many more follow-up thoughts on this aspect.

The idea is that async functions can call any function, but from a normal function, you can’t call an async function. This creates an artificial divide between functions, complicating… everything.

These blog posts and many others as well provide many arguments of why and how this complicates everything and other downsides.

A callback function is just a normal function. It is not an async function. So the callback can be called everywhere without any complicated overhead.

I'm not really speaking about asynchronous execution in general (which we have all the time with Sisyphus) but really about the Python async concept, and the Sisyphus async workflow uses that. I don't have any issue with asynchronous execution in general. I don't see a problem with that.

Even on a high level your proposed workflow would not work without some form of asynchronicity, because the number of epochs that would need to be recognized are not known until some part of the graph has already been computed.

Yes I know. Why is that a problem? My code handles that just fine.

By not using the async workflow one might run the risk of accessing one Jobs output before they are computed

I don't see how that could happen. A job will only run when its inputs are available. So I cannot access anything before it is computed.

... as you always need to access the file with the scores inside the manager.

When you want it to be dynamic as I outlined in my initial post, this logic must be within the manager. There is not really any way around it. You could potentially defer it to another job, to determine the list of relevant epochs, but then, again in the manager, you need to read that list of relevant epochs. At some point in the manager, you need to get that dynamic list of epochs.

But again, if this is done careful, I don't see too much a problem in that. I think in Job.update, this might be a valid thing to do. And regarding check_is_worker, I'm sure we can find some solution to still do that cleanly. E.g. there could be an explicit context scope like allow_reading_files_in_manager or so.

... Where GetNBestEpochsByScoreJob just copies (/links) the N best epochs according to the scores provided

Multiple problems with this:

  • As you mention, it is inefficient, as you potentially run some recogs multiple times. This will actually be quite common for the last N epochs.
  • There can be multiple scores, for multiple losses, or maybe you calculate some other metrics. Let's say you have M metrics. With this scheme, to really run all potential relevant ones, you need to run M * N recogs.

And I don't really see the advantage with this. The code is maybe slightly simpler, but what else? And this issue is about having some common code which everyone can use so that by using such common pipeline code, it's also just as simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants