-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enables running of multiple tasks in batches
Sometimes it's more efficient to run a group of tasks all at once rather than one at a time. With luigi, it's difficult to take advantage of this because your batch size will also be the minimum granularity you're able to compute. So if you have a job that runs hourly, you can't combine their computation when many of them get backlogged. When you have a task that runs daily, you can't get hourly runs. In order to gain efficiency when many jobs are queued up, this change allows workers to provide details of how jobs can be batched to the scheduler. If you have several hourly jobs of the same type in the scheduler, it can combine them into a single job for the worker. We allow parameters to be combined in three ways: we can combine all the arguments in a csv, take the min and max to form a range, or just provide the min or max. The csv gives the most specificity, but range and min/max are available for when that's all you need. In particular, the max function provides an implementation of #570, allowing for jobs that overwrite eachother to be grouped by just running the largest one. In order to implement this, the scheduler will create a new task based on the information sent by the worker. It's possible (as in the max/min case) that the new task already exists, but if it doesn't it will be cleaned up at the end of the run. While this new task is running, any other tasks will be marked as BATCH_RUNNING. When the head task becomes DONE or FAILED, the BATCH_RUNNING tasks will also be updated accordingly. They'll also have their tracking urls updated to match the batch task. This is a fairly big change to how the scheduler works, so there are a few issues with it in the initial implementation: - newly created batch tasks don't show up in dependency graphs - the run summary doesn't know what happened to the batched tasks - we can't limit how big batches can be (how should we handle ranges?) - batching takes quadratic time for simplicity of implementation - I'm not sure what would happen if there was a yield in a batch run function On the worker side, batching is accomplished by setting a batch_class, batcher_args and batcher_aggregate_args. The batch class is the Python class that runs the batched version of the job. This can be set equal to the current class by overriding the class method get_batch_class. The batcher_args are the arguments passed from the current class to the batch class. These come in pairs. So if the original class has parameters machine and filename that need to go to host and files in the batcher, you'll use [('machine', 'host'), ('filename', 'files')] for batcher_args. The final value is batcher_aggregate_args, which explains which arguments are to be aggregated and how. So using the machine, filename example, we might want to batch multiple files together for the same machine. For that, we could do something like {'filename': 'csv'} to combine them all as comma-separated values. Now if we have multiple machine, filename pairs such as ('m1', 'f1'), ('m1', 'f2'), ('m2', 'f3'), ('m2', 'f4'), we'd end up with batch jobs with host, files pairs of ('m1', 'f1,f2') and ('m2', 'f3,f4'). The worker will send the batch class, batcher args and batcher aggregate args to the worker once per class, which is why these are class methods. It doesn't make sense to have different ways to batch per individual task, so that's not allowed.
- Loading branch information
1 parent
f771622
commit bcde8bb
Showing
11 changed files
with
637 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.