Allow release of resources during task running. #2346

riga · 2018-02-03T20:05:42Z

Description

At scheduling time, the luigi scheduler needs to be aware of the maximum resource consumption a task might have once it runs. For some tasks, however, it can be beneficial to release or reduce a resource between two steps within their run method (e.g. after some heavy computation). In this case, a different task waiting for that particular resource can already be scheduled.

I simply added another method to the TaskStatusReporter which is forwarded to the task in TaskProcess._run_get_new_deps. The scheduler also got two new rpc methods set_running_task_resources and get_running_task_resources.

(Maybe the TaskStatusReporter should be renamed since it's doing more than just reporting the status now.)

Within a task, the resources can be updated via set_running_resource() which gives you complete access to the resource dict held by the scheduler for that task. Another approach could be something like a method release_resource() which is used to just reduce resource values. Right now, one can also increase resources which is certainly dangerous. Which way do you prefer?

Motivation and Context

We have some long running tasks which control remote jobs on some computing grids. Although those grids have batch systems / queues, we want to use luigi's resource system to ensure we don't exceed a certain number (few k) of running jobs.

I also added test cases and some docs =)

edit: In the third commit, I propagated the running resources to the visualizer (didn't think about that when I opened the PR).

riga · 2018-02-04T06:21:02Z

The nonhdfs tests fail, the others are fine. I guess it‘s because a remote scheduler is used in one of the test cases. Any advise?

Tarrasch · 2018-02-06T20:13:07Z

The nonhdfs tests fail, the others are fine. I guess it‘s because a remote scheduler is used in one of the test cases. Any advise?

I would simply disable the test case for the remote scheduler if it's to hard to get it working, as long as the in-memory scheduler test works. I think there's already a couple of tests like that.

Which way do you prefer?

I think you're the best person to form this API, since you are the one using/needing it. Some things to consider:

You can still make the luigi Task not do the invalid changes that the code says. For example if the start resource is 5, make changes to >5 no
What would make your code easier? To have set_resources(), release_resources() or decrease_resources(). I slightly lean towards the decrease version, so I could just specify the one resource I want to decrease and not respecify the others, like decrease_resources({'my_resource': 3}). But I let you pick. :)

Tarrasch

Nice.

@daveFNbuck, since you created the resources feature. Do you mind reviewing this PR?

Tarrasch · 2018-02-06T20:13:47Z

luigi/scheduler.py

-                    for resource, amount in six.iteritems(getattr(task, 'resources_running', task.resources)):
+                resources_running = getattr(task, "resources_running", task.resources)
+                if resources_running:
+                    for resource, amount in six.iteritems(resources_running):


Nice code improvement :)

Tarrasch · 2018-02-06T20:22:12Z

test/task_running_resources_test.py

+        if self.reduce_foo:
+            self.set_running_resources({"foo": 1})
+
+        time.sleep(2)


I suppose it was too hard to make it work without using time.sleep right?

Yeah, this time also interferes with the workers waint_interval. I couldn't come up yet with a better approach to check that the scheduler really allows to run the two tasks in parallel...

daveFNbuck · 2018-02-07T02:54:02Z

The API needs to be designed so that you can't increase any of the resources, as that can push you over the limit. We want to be able to guarantee that you never exceed a resource limit.

riga · 2018-02-07T11:27:10Z

I disabled the remote scheduler tests on travis. Some other tests fail, I'm not entirely sure if my changes cause the problems...

riga · 2018-02-13T15:25:11Z

Could you maybe start the tests again? I'm curious whether this was just a glitch. Thanks! =)

dlstadther · 2018-02-13T15:26:35Z

Travis build has been restarted

riga · 2018-02-13T15:38:59Z

Mh, again the nonhdfs tests fail. In particular:

contrib.docker_runner_test.TestDockerTask
contrib.spark_test.PySparkTaskTest

I don't see how these tests are related to the changes in this PR :/

dlstadther · 2018-02-13T16:08:00Z

@riga I think this is a Travis issue. We didn't merge anything with failing tests, but these two tests have been failing for all new PRs.

riga · 2018-03-02T18:03:04Z

Hi! Is there any progress on this matter?

dlstadther · 2018-03-02T18:12:07Z

@riga These tests were fixed in #2356

Mind pulling those changes in here? Thanks!

riga · 2018-03-02T19:03:24Z

One test still fails, but it is not clear to me what is actually causing this ...

Tarrasch

Since you are adding new rpc methods, did you consider adding tests here?

https://github.com/spotify/luigi/blob/master/test/scheduler_api_test.py

Tarrasch · 2018-03-02T22:53:29Z

test/task_running_resources_test.py

+            luigi.build([task_a, task_b], self.factory, workers=2,
+                        scheduler_port=self.get_http_port())
+
+    @skipOnTravis("https://travis-ci.org/spotify/luigi/jobs/338398994")


There was really no way to get these tests to run on Travis? I'm not sure having them helps at all then. This is a luigi core feature you are adding, such important functionalities should be properly tested even on Travis.

Agreed. Maybe the travis errors I've seen at the beginning were also related to #2356. I'll check again.

riga · 2018-03-03T08:53:51Z

I noticed that the same task resource dicts are used in multiple places, esp.:

when setting task.resources_running to task.resources before a task starts running
when setting the resources_running of batch tasks

As this PR introduces dynamic resources_running, the initial resource dict needs to be copied at those places → 036b813.

riga · 2018-03-16T17:10:27Z

I fixed one of the two failing tests in this PR.

I wasn't aware that there were hardcoded attributes in Task.no_unpicklable_properties. Those attributes consist only of temporary task callbacks added in TaskProcess, right before the task's run method is called. As a new callback is added in this PR (decrease_running_resources), the spark task tests failed.

What do you think about making this dynamic (depending on a mapping defined on class level)? In addition, Task.no_unpicklable_properties can use this mapping to obtain the list of unpicklable props at runtime. I implemented this in dc2a74a and c319bf1.

I'm about the solve the remaining failure.

Tarrasch · 2018-03-16T17:56:20Z

Awesome to see progress. I commented on the commits you linked too.

riga · 2018-03-16T18:12:26Z

Sounds good to me. If you're tired on being stuck with #2346, you can submit this as a separate PR.

I think I'm close to fixing the issue, just getting a tornado error

RuntimeError: Cannot share PollIOLoops across processes

This is closely related to what you guys experienced as well ;)
https://github.com/spotify/luigi/blob/master/test/server_test.py#L45-L53

Tarrasch · 2018-03-16T18:50:55Z

test/task_running_resources_test.py

+import server_test
+
+
+luigi.notifications.DEBUG = True


Is this copied from anywhere? Is it necessary? You just want to delete this line

Tarrasch · 2018-03-16T18:53:45Z

test/task_running_resources_test.py

+            luigi.build([task_c, task_d], self.factory, workers=2,
+                        scheduler_port=self.get_http_port())
+
+    def test_local_scheduler(self):


btw, Maybe this test can be outside of server_test.ServerTestBase?

Tarrasch · 2018-03-16T18:55:09Z

test/task_running_resources_test.py

+        # the total "foo" resource (3) is sufficient to run both tasks in parallel shortly after
+        # the first task started, so the entire process should not exceed 4 seconds
+        task_c = ResourceTestTask(param="c", reduce_foo=True)
+        task_d = ResourceTestTask(param="d")


You shouldn't need to use new task names right? might as well call them "a" and "b"

Tarrasch · 2018-03-16T18:57:40Z

test/task_running_resources_test.py

+        task_d = ResourceTestTask(param="d")
+
+        with self.assert_duration(max_duration=4):
+            luigi.build([task_c, task_d], self.factory, workers=2,


I wonder if you get strange behavior because of the simplistic and unintuitive way luigi.build is implemented. I think it just runs those two guys serially (it won't attempt task_d before task_c is complete. But you might just want to test to build with a single WrapperTask that depends on both of those tasks.

Yep, you're right, will be simplified in the next commit.

riga · 2018-03-16T20:13:21Z

The test that demonstrates how two tasks with concurrent resources can be accelerated via resource reduction within the run() method required a remote scheduler. I adapted some of the server_test._ServerTest features to setup the server, remote scheduler, etc.

riga · 2018-03-26T10:08:31Z

Anything else I can/should do? =)

Tarrasch · 2018-04-08T12:07:18Z

test/task_running_resources_test.py

+
+class ConcurrentRunningResourcesTest(unittest.TestCase):
+
+    def get_app(self):


Can you remove this in a follow-up? You don't need it anymore right?

Tarrasch · 2018-04-08T12:17:27Z

Awesome! Well done!

tiamot · 2018-11-09T18:18:50Z

Where is the documentation on how this is used?
I am attempting to reduce resources for long running tasks to allow more tasks to run in parallel.
Based on the current documentation https://luigi.readthedocs.io/en/stable/luigi_patterns.html?highlight=decrease#decreasing-resources-of-running-tasks

I attempted to do the following with the expectation that after 10 seconds the first task would release the resource and allow the second task to run in parallel:

import luigi


class Y(luigi.Task):
    wait = luigi.IntParameter(default=10)
    n = luigi.IntParameter()

    resources = ({'A': 1})

    def output(self):
        return luigi.LocalTarget('out{}.txt'.format(self.n))

    def run(self):
        sleep(self.wait)
        self.decrease_running_resources({'A': 0})
        sleep(self.wait)
        f = self.output().open('w')
        f.close()


class Z(luigi.WrapperTask):
    tasks = [Y(n=x) for x in range(20)]

    def requires(self):
        yield self.tasks

$ luigid
$ luigi --module x Z --workers 20

Tarrasch · 2018-11-09T19:54:59Z

See discussion in #2576 which you created.

riga added 2 commits February 3, 2018 21:02

Add scheduler and task methods to control running resources.

b445be4

Enable debug notifications in test again.

5701454

Prefer resources_running over resources in visualizer.

8fa2f4e

Tarrasch approved these changes Feb 6, 2018

View reviewed changes

riga added 2 commits February 7, 2018 10:18

Use decrease_running_resources instead of set_running_resouces.

cb79e3a

Skip remote scheduler tests on travis.

f2620c2

Merge branch 'master' into dynamicResources.

e7967af

Tarrasch reviewed Mar 2, 2018

View reviewed changes

riga added 2 commits March 3, 2018 09:46

Copy task resource dicts in scheduler.

036b813

Add test for decreasing task resources in to scheduler API tests.

20e3033

riga added 3 commits March 3, 2018 10:08

Enable running_resources test again on travis.

d530d47

Make task callbacks in TaskProcess dynamic.

dc2a74a

Avoid hardcoded unpicklable_properties in task.

c319bf1

Tarrasch reviewed Mar 16, 2018

View reviewed changes

Refactor running resources tests.

2939a52

Merge branch 'master' into dynamicResources.

41c9b37

Tarrasch approved these changes Apr 8, 2018

View reviewed changes

Tarrasch merged commit 65dc89c into spotify:master Apr 8, 2018

Tarrasch mentioned this pull request Apr 9, 2018

use task's last updated time for scheduler rank #2379

Closed

riga mentioned this pull request Apr 14, 2018

Remove unnecessary method in ConcurrentRunningResourcesTest. #2400

Merged

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed


		class ConcurrentRunningResourcesTest(unittest.TestCase):

		def get_app(self):

Allow release of resources during task running. #2346

Allow release of resources during task running. #2346

Conversation

riga commented Feb 3, 2018 • edited

Description

Motivation and Context

riga commented Feb 4, 2018

Tarrasch commented Feb 6, 2018

Tarrasch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daveFNbuck commented Feb 7, 2018

riga commented Feb 7, 2018

riga commented Feb 13, 2018

dlstadther commented Feb 13, 2018

riga commented Feb 13, 2018

dlstadther commented Feb 13, 2018

riga commented Mar 2, 2018

dlstadther commented Mar 2, 2018

riga commented Mar 2, 2018

Tarrasch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riga commented Mar 3, 2018

riga commented Mar 16, 2018

Tarrasch commented Mar 16, 2018

riga commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riga commented Mar 16, 2018 • edited

riga commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tarrasch commented Apr 8, 2018

tiamot commented Nov 9, 2018

Tarrasch commented Nov 9, 2018

riga commented Feb 3, 2018 •

edited

riga commented Mar 16, 2018 •

edited