add parallel support for orchestrations #46023

mattp- · 2018-02-14T00:54:45Z

What does this PR do?

originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).

I'm opening this now as a review of the patch, as I'm not sure on the full repercussions of my changes. In particular, providing orchestration_jid for __pub_jid. Additionally, tests (submitting this now to see how jenkins responds).

Finally: I noticed there is a parallel_runners feature recently implemented by @smarsching. Wondering about the reasoning for a runner specific parallel versus the state engine feature?

Previous Behavior

parallel: True did not work in orch.

New Behavior

parallel: True works in orch; parallel: works on nested orch runners.

Tests written?

No

Commits signed with GPG?

No

Please review Salt's Contributing Guide for best practices.

See GitHub's page on GPG signing for more information about signing commits with GPG.

mattp- · 2018-02-15T15:31:12Z

added commit 30166e1 which should fix #43668 and #39832

smarsching · 2018-02-15T17:14:41Z

@mattp I think that your patch is very useful and can replace the salt.parallel_runners state for some, but not all use cases.

The parallel = True attribute has the effect that the state that is decorated with it does not block other states from running. In contrast, the salt.parallel_runners state does block other states from running, but executes a number of sub-tasks (in the form of runners) in parallel. This makes it possible to have another state in the orchestration sequence depend on the completion of all the parallel tasks in a previous state.

With your patch, defining several states with parallel = True should work fine when one only cares about the tasks being run in parallel. However, if one needs some sequential logic in combination with parallel sub-tasks, I do not think it will help. The only way to make such a scenario work would be to put the parallel tasks into a separate file and use state.orchestrate with that file. In that case, state.orchestrate would act as the synchronization point for all the parallel tasks defined in the other file. However, I think that it is less user-friendly (and also less readable) if one is forced to use a separate state file for the parallel tasks and cannot embed them into the same state file as the sequential tasks.

The reason, why I specifically added the feature for runners, is that this was the place where I needed it. Think of the following example:

You want to setup some clustered database service by first installing the necessary software on a number of nodes and then creating a database on the cluster. The software installation can happen on all nodes in parallel, but you can only configure the the database after the installation has finished on all nodes.

There might be similar cases in the regular state (non-orchestration) logic, but I have not encountered them so far. This might be due to the fact that the logic on a single node is typically state-based while runner / orchestration logic is more action-based (though it internally uses the state engine as well).

I hope this information gives you some insight into why I implemented the salt.parallel_runners feature.

mattp- · 2018-02-15T21:01:57Z

@smarsching that was informative, thank you. I think you could maybe achieve similar with requires/requires_in and parallel: True'ing the statemod.runner(s), no? more options are usually better either way :)

austinpapp · 2018-02-15T22:16:14Z

so i think they accomplish the same thing just differently
https://gist.github.com/austinpapp/e2775271305731327ae335c5c7e0ad17

however, the distinction is really threading v forking @mattp-

regardless, i'd be happy to have either or both. it may prove itself useful down the road.

smarsching · 2018-02-17T19:40:07Z

@mattp- & @austinpapp You are right, I did not think about combining the parallel attribute with require. This way, one should be able to get the desired behavior. I definitely like that idea. 👍

cachedout · 2018-02-21T14:06:15Z

salt/state.py

@@ -2326,8 +2326,7 @@ def check_requisite(self, low, running, chunks, pre=False):
                    continue
                if run_dict[tag].get('proc'):
                    # Run in parallel, first wait for a touch and then recheck
-                    time.sleep(0.01)
-                    return self.check_requisite(low, running, chunks, pre)
+                    run_dict[tag].get('proc').join()


Eeeeeek. Will this block?

This will create a zombie until parent quits.

@cachedout yes
@isbm according to https://docs.python.org/2/library/multiprocessing.html#all-platforms join will waitpid and reap, unless im misreading

Yes. But .join() will block everything. And if you did not set it daemonic, it will anyway be joined after the start.

mattp- · 2018-03-27T01:27:27Z

this is now rebased to target 2017.7.5.
@isbm I revisited the proc.join() behavior of this and found a solution that doesn't involve using it along the way. Also included a fix for measuring duration for parallel processes as well as another fix from #39832.
It's been a bit of work but I believe this should fix all outstanding parallel issues, whether that be minion or master side.

mattp- · 2018-03-27T14:28:10Z

this also fixes #44828

mattp- · 2018-04-03T19:52:28Z

accidentally introduced a regression when I revisited the proc.wait behavior ; should be fixed now but I'll wait to see what jenkins has to say.

rallytime · 2018-04-04T20:22:01Z

@mattp- Looks like your new tests is failing on the Python 3 runs.

originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).

its not clear to me why the recursive calls were chosen originally. this should address saltstack#43668

rather than join()'ing on each running process, we instead use check_running to assert completion of our processes. This should provide more correct timing results, as measuring durations of a longer running join()'d process could trample a shorter parallel process that just happened to be checked after instead of before.

previously durations were only recording the time to spawn the multiprocessing proc, not the actual time of completion, which is completely wrong. This should capture the full duration correctly now. We unfortunately have to duplicate start & complete times instead of using the passed in start_time attr, as that value is only a time (not date), so it is impossible to re-calculate the full duration based on that alone (ie, what happens if start_time is 23:59 with a roll-over to the next day). This fixes saltstack#44828

@ninja-

@ninja- noticed there was some useful code already in _call_parallel_target to mitigate KeyErrors for potentially empty cdata, but it wasnt being executed due to the invoking method making the same mistake before calling it. this moves that code up to eliminate that potential stacktrace. this should close saltstack#39832

this should hopefully exercise a few different facets of parallel that were previously not covered in the code base.

seems to be encountering unrelated preexisting failures in the functionality unrelated to my changes.

parallel: True codepath incompatibilities uncovered by the added tests. additionally use salt.serializers.msgpack to avoid other py2/py3/back compat issues.

rallytime · 2018-04-06T16:07:34Z

@mattp- Thanks for fixing up those tests.

@cachedout and @isbm Can you guys review this again?

isbm · 2018-03-27T08:52:29Z

salt/state.py

+        utc_finish_time = datetime.datetime.utcnow()
+        delta = (utc_finish_time - utc_start_time)
+        # duration in milliseconds.microseconds
+        duration = (delta.seconds * 1000000 + delta.microseconds)/1000.0


I believe lint will cry here: you want spaces around /. Update: Lint cries not, but still please put spaces around /. 😆

likewise, this duration code is actually in a few different spots, all without spaces. will update to pad properly in all cases in salt.state

isbm · 2018-03-27T09:00:11Z

salt/state.py

+        elif 'name' in cdata['kwargs']:
+            name = cdata['kwargs']['name']
+        else:
+            name = low.get('name', low.get('__id__'))


I am not sure here. Can it be that kwargs has a key name while args ist not length of 1? I would place kwargs check prior args then. But I am not sure if I am correct. Also I would simplify this and fix a bug, which can fail, once args contains a False value:

name = (cdata['args'] or [None])[0] or cdata['kwargs'].get('name') if not name: name = low.get('name', low.get('__id__'))

You can also write this all into one line 😉 but this is better when you have "first get it from various arguments, or get it from low, if still not found".

i'll admit, this code is a replication of what occurs in the serial codepath earlier in the file (https://github.com/bloomberg/salt/blob/a9866c7a031ac6b7fe898c2380131e2a6de82c9f/salt/state.py#L1894). I also don't fully know all ways / types of low data that can be thrown at us here :)
I'll update in both places

isbm · 2018-04-06T16:39:07Z

salt/state.py

@@ -1879,7 +1891,7 @@ def call(self, low, chunks=None, running=None, retries=1):
            # enough to not raise another KeyError as the name is easily
            # guessable and fallback in all cases to present the real
            # exception to the user
-            if len(cdata['args']) > 0:
+            if 'args' in cdata and cdata['args']:


Still not nice. Just:

if cdata.get('args'):

fixed in both places

isbm · 2018-04-06T16:39:50Z

salt/state.py

+            while True:
+                if self.reconcile_procs(run_dict):
+                    break
+                time.sleep(0.01)


For what reason you put this time.sleep here?

some amount of sleep time is needed since we are simply looping until the processes are 'complete' inside reconcile_procs to collect their returns and merge them back into the running state tree; if no sleep we'd peg the cpu for no reason

Ah, right. The self.reconcile_procs still will go through checks and pass through so the CPU still in load. But you still can use 0.0001, which will down the CPU load, but will do faster loops.

It seems we do a similar polling loop in call_chunks() that uses 0.01; since these will be polling longrunning processes I think it makes sense to stick with 0.01 (or perhaps change them both); but i think 10ms latency is sufficient. if someone needed more we should perhaps consider not polling/using a child watcher that tornado provides.

I'd actually change that. I forgot where, but we (SUSE) had submitted a bugfix where this number actually dramatically kills performance. 😉 So if you just need to make sure CPU doesn't go crazy, 0.0001 is just fine for that reason too, but small enough.

ah ok, will update

updated to 0.0001

isbm · 2018-04-06T16:41:01Z

salt/utils/files.py

+if six.PY2:
+    from urllib import quote
+elif six.PY3:
+    from urllib.parse import quote  # pylint: disable=no-name-in-module


This is wrong. Correct way is this:

from salt.ext.six.moves.urllib.parse import quote

rallytime · 2018-04-10T13:59:07Z

@cachedout can you review this?

cachedout

This looks right.

rallytime requested review from a team February 15, 2018 19:32

ghost requested review from a team February 15, 2018 19:32

garethgreenaway approved these changes Feb 20, 2018

View reviewed changes

cachedout reviewed Feb 21, 2018

View reviewed changes

mattp- force-pushed the parallel-orch branch 2 times, most recently from 7319051 to 1e43f2b Compare March 27, 2018 01:23

mattp- changed the base branch from develop to 2017.7.5 March 27, 2018 01:24

mattp- force-pushed the parallel-orch branch from 1e43f2b to 93d28bc Compare March 29, 2018 17:31

mattp- changed the base branch from 2017.7.5 to 2017.7 March 29, 2018 17:32

mattp- force-pushed the parallel-orch branch from 93d28bc to b7c526a Compare March 29, 2018 17:34

rallytime requested a review from cachedout April 3, 2018 13:50

mattp- force-pushed the parallel-orch branch from b7c526a to 2b393f4 Compare April 3, 2018 19:51

mattp- force-pushed the parallel-orch branch from 2b393f4 to 7850d1d Compare April 3, 2018 19:56

mattp- added 6 commits April 5, 2018 11:45

add parallel support for orchestrations

6e7007a

originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).

join() parallel process instead of a recursive sleep

f00a359

its not clear to me why the recursive calls were chosen originally. this should address saltstack#43668

add integration test to runners/test_state to exercise parallel

6c8a257

this should hopefully exercise a few different facets of parallel that were previously not covered in the code base.

removing prereq from test orch

6d77308

seems to be encountering unrelated preexisting failures in the functionality unrelated to my changes.

mattp- force-pushed the parallel-orch branch 2 times, most recently from 738518c to ed529e7 Compare April 5, 2018 20:07

fix parallel mode py3 compatibility

a9866c7

parallel: True codepath incompatibilities uncovered by the added tests. additionally use salt.serializers.msgpack to avoid other py2/py3/back compat issues.

mattp- force-pushed the parallel-orch branch from ed529e7 to a9866c7 Compare April 6, 2018 02:31

isbm suggested changes Apr 6, 2018

View reviewed changes

address lint issues raised by @isbm

3d5e696

mattp- force-pushed the parallel-orch branch from 79affaf to 3d5e696 Compare April 9, 2018 12:46

cachedout approved these changes Apr 10, 2018

View reviewed changes

Merge branch '2017.7' into parallel-orch

0ac0b3c

rallytime merged commit 8adaf7f into saltstack:2017.7 Apr 10, 2018

rallytime mentioned this pull request Apr 13, 2018

Fix parallel states throwing KeyError and fix duration tracking #39839

Closed

gtmanfred mentioned this pull request May 7, 2018

salt-runner: RuntimeError: maximum recursion depth exceeded #43668

Closed

Ch3LL mentioned this pull request May 8, 2018

AttributeError: module 'urllib' has no attribute 'quote' #47534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add parallel support for orchestrations #46023

add parallel support for orchestrations #46023

mattp- commented Feb 14, 2018

mattp- commented Feb 15, 2018 •

edited

smarsching commented Feb 15, 2018

mattp- commented Feb 15, 2018

austinpapp commented Feb 15, 2018

smarsching commented Feb 17, 2018

cachedout Feb 21, 2018

isbm Feb 21, 2018

mattp- Feb 21, 2018

isbm Feb 21, 2018

mattp- commented Mar 27, 2018

mattp- commented Mar 27, 2018

mattp- commented Apr 3, 2018

rallytime commented Apr 4, 2018

rallytime commented Apr 6, 2018

isbm Mar 27, 2018 •

edited

mattp- Apr 6, 2018

isbm Mar 27, 2018

mattp- Apr 6, 2018

isbm Apr 6, 2018

mattp- Apr 6, 2018

isbm Apr 6, 2018 •

edited

mattp- Apr 6, 2018

isbm Apr 9, 2018

mattp- Apr 9, 2018

isbm Apr 9, 2018

mattp- Apr 9, 2018

mattp- Apr 9, 2018

isbm Apr 6, 2018

mattp- Apr 6, 2018

rallytime commented Apr 10, 2018

cachedout left a comment

add parallel support for orchestrations #46023

add parallel support for orchestrations #46023

Conversation

mattp- commented Feb 14, 2018

What does this PR do?

Previous Behavior

New Behavior

Tests written?

Commits signed with GPG?

mattp- commented Feb 15, 2018 • edited

smarsching commented Feb 15, 2018

mattp- commented Feb 15, 2018

austinpapp commented Feb 15, 2018

smarsching commented Feb 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattp- commented Mar 27, 2018

mattp- commented Mar 27, 2018

mattp- commented Apr 3, 2018

rallytime commented Apr 4, 2018

rallytime commented Apr 6, 2018

isbm Mar 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isbm Apr 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rallytime commented Apr 10, 2018

cachedout left a comment

Choose a reason for hiding this comment

mattp- commented Feb 15, 2018 •

edited

isbm Mar 27, 2018 •

edited

isbm Apr 6, 2018 •

edited