Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parallel support for orchestrations #46023

Merged
merged 10 commits into from Apr 10, 2018

Conversation

mattp-
Copy link
Contributor

@mattp- mattp- commented Feb 14, 2018

What does this PR do?

originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).

I'm opening this now as a review of the patch, as I'm not sure on the full repercussions of my changes. In particular, providing orchestration_jid for __pub_jid. Additionally, tests (submitting this now to see how jenkins responds).

Finally: I noticed there is a parallel_runners feature recently implemented by @smarsching. Wondering about the reasoning for a runner specific parallel versus the state engine feature?

Previous Behavior

parallel: True did not work in orch.

New Behavior

parallel: True works in orch; parallel: works on nested orch runners.

Tests written?

No

Commits signed with GPG?

No

Please review Salt's Contributing Guide for best practices.

See GitHub's page on GPG signing for more information about signing commits with GPG.

@mattp-
Copy link
Contributor Author

mattp- commented Feb 15, 2018

added commit 30166e1 which should fix #43668 and #39832

@smarsching
Copy link
Contributor

@mattp I think that your patch is very useful and can replace the salt.parallel_runners state for some, but not all use cases.

The parallel = True attribute has the effect that the state that is decorated with it does not block other states from running. In contrast, the salt.parallel_runners state does block other states from running, but executes a number of sub-tasks (in the form of runners) in parallel. This makes it possible to have another state in the orchestration sequence depend on the completion of all the parallel tasks in a previous state.

With your patch, defining several states with parallel = True should work fine when one only cares about the tasks being run in parallel. However, if one needs some sequential logic in combination with parallel sub-tasks, I do not think it will help. The only way to make such a scenario work would be to put the parallel tasks into a separate file and use state.orchestrate with that file. In that case, state.orchestrate would act as the synchronization point for all the parallel tasks defined in the other file. However, I think that it is less user-friendly (and also less readable) if one is forced to use a separate state file for the parallel tasks and cannot embed them into the same state file as the sequential tasks.

The reason, why I specifically added the feature for runners, is that this was the place where I needed it. Think of the following example:

You want to setup some clustered database service by first installing the necessary software on a number of nodes and then creating a database on the cluster. The software installation can happen on all nodes in parallel, but you can only configure the the database after the installation has finished on all nodes.

There might be similar cases in the regular state (non-orchestration) logic, but I have not encountered them so far. This might be due to the fact that the logic on a single node is typically state-based while runner / orchestration logic is more action-based (though it internally uses the state engine as well).

I hope this information gives you some insight into why I implemented the salt.parallel_runners feature.

@rallytime rallytime requested review from a team February 15, 2018 19:32
@ghost ghost requested review from a team February 15, 2018 19:32
@mattp-
Copy link
Contributor Author

mattp- commented Feb 15, 2018

@smarsching that was informative, thank you. I think you could maybe achieve similar with requires/requires_in and parallel: True'ing the statemod.runner(s), no? more options are usually better either way :)

@austinpapp
Copy link
Contributor

so i think they accomplish the same thing just differently
https://gist.github.com/austinpapp/e2775271305731327ae335c5c7e0ad17

however, the distinction is really threading v forking @mattp-

regardless, i'd be happy to have either or both. it may prove itself useful down the road.

@smarsching
Copy link
Contributor

@mattp- & @austinpapp You are right, I did not think about combining the parallel attribute with require. This way, one should be able to get the desired behavior. I definitely like that idea. 👍

salt/state.py Outdated
@@ -2326,8 +2326,7 @@ def check_requisite(self, low, running, chunks, pre=False):
continue
if run_dict[tag].get('proc'):
# Run in parallel, first wait for a touch and then recheck
time.sleep(0.01)
return self.check_requisite(low, running, chunks, pre)
run_dict[tag].get('proc').join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeeeeek. Will this block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a zombie until parent quits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cachedout yes
@isbm according to https://docs.python.org/2/library/multiprocessing.html#all-platforms join will waitpid and reap, unless im misreading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But .join() will block everything. And if you did not set it daemonic, it will anyway be joined after the start.

@mattp- mattp- force-pushed the parallel-orch branch 2 times, most recently from 7319051 to 1e43f2b Compare March 27, 2018 01:23
@mattp- mattp- changed the base branch from develop to 2017.7.5 March 27, 2018 01:24
@mattp-
Copy link
Contributor Author

mattp- commented Mar 27, 2018

this is now rebased to target 2017.7.5.
@isbm I revisited the proc.join() behavior of this and found a solution that doesn't involve using it along the way. Also included a fix for measuring duration for parallel processes as well as another fix from #39832.
It's been a bit of work but I believe this should fix all outstanding parallel issues, whether that be minion or master side.

@mattp-
Copy link
Contributor Author

mattp- commented Mar 27, 2018

this also fixes #44828

@mattp-
Copy link
Contributor Author

mattp- commented Apr 3, 2018

accidentally introduced a regression when I revisited the proc.wait behavior ; should be fixed now but I'll wait to see what jenkins has to say.

@rallytime
Copy link
Contributor

@mattp- Looks like your new tests is failing on the Python 3 runs.

originally the parallel global state requisite did not work correctly when
invoked under an orch - this fixes that; as well as running any other saltmod
state (function, runner, wheel).
its not clear to me why the recursive calls were chosen originally. this should
address saltstack#43668
rather than join()'ing on each running process, we instead use check_running to
assert completion of our processes. This should provide more correct timing results,
as measuring durations of a longer running join()'d process could trample
a shorter parallel process that just happened to be checked after instead of
before.
previously durations were only recording the time to spawn the multiprocessing
proc, not the actual time of completion, which is completely wrong. This should
capture the full duration correctly now. We unfortunately have to duplicate
start & complete times instead of using the passed in start_time attr, as that
value is only a time (not date), so it is impossible to re-calculate the full
duration based on that alone (ie, what happens if start_time is 23:59 with
a roll-over to the next day).
This fixes saltstack#44828
@ninja- noticed there was some useful code already in _call_parallel_target to
mitigate KeyErrors for potentially empty cdata, but it wasnt being executed due
to the invoking method making the same mistake before calling it. this moves
that code up to eliminate that potential stacktrace.

this should close saltstack#39832
this should hopefully exercise a few different facets of parallel that were
previously not covered in the code base.
seems to be encountering unrelated preexisting failures in the functionality
unrelated to my changes.
@mattp- mattp- force-pushed the parallel-orch branch 2 times, most recently from 738518c to ed529e7 Compare April 5, 2018 20:07
parallel: True codepath incompatibilities uncovered by the added tests.
additionally use salt.serializers.msgpack to avoid other py2/py3/back compat
issues.
@rallytime
Copy link
Contributor

@mattp- Thanks for fixing up those tests.

@cachedout and @isbm Can you guys review this again?

salt/state.py Outdated
utc_finish_time = datetime.datetime.utcnow()
delta = (utc_finish_time - utc_start_time)
# duration in milliseconds.microseconds
duration = (delta.seconds * 1000000 + delta.microseconds)/1000.0
Copy link
Contributor

@isbm isbm Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe lint will cry here: you want spaces around /. Update: Lint cries not, but still please put spaces around /. 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, this duration code is actually in a few different spots, all without spaces. will update to pad properly in all cases in salt.state

salt/state.py Outdated
elif 'name' in cdata['kwargs']:
name = cdata['kwargs']['name']
else:
name = low.get('name', low.get('__id__'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here. Can it be that kwargs has a key name while args ist not length of 1? I would place kwargs check prior args then. But I am not sure if I am correct. Also I would simplify this and fix a bug, which can fail, once args contains a False value:

name = (cdata['args'] or [None])[0] or cdata['kwargs'].get('name')
if not name:
    name = low.get('name', low.get('__id__'))

You can also write this all into one line 😉 but this is better when you have "first get it from various arguments, or get it from low, if still not found".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll admit, this code is a replication of what occurs in the serial codepath earlier in the file (https://github.com/bloomberg/salt/blob/a9866c7a031ac6b7fe898c2380131e2a6de82c9f/salt/state.py#L1894). I also don't fully know all ways / types of low data that can be thrown at us here :)
I'll update in both places

salt/state.py Outdated
@@ -1879,7 +1891,7 @@ def call(self, low, chunks=None, running=None, retries=1):
# enough to not raise another KeyError as the name is easily
# guessable and fallback in all cases to present the real
# exception to the user
if len(cdata['args']) > 0:
if 'args' in cdata and cdata['args']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not nice. Just:

if cdata.get('args'):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in both places

salt/state.py Outdated
while True:
if self.reconcile_procs(run_dict):
break
time.sleep(0.01)
Copy link
Contributor

@isbm isbm Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what reason you put this time.sleep here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some amount of sleep time is needed since we are simply looping until the processes are 'complete' inside reconcile_procs to collect their returns and merge them back into the running state tree; if no sleep we'd peg the cpu for no reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The self.reconcile_procs still will go through checks and pass through so the CPU still in load. But you still can use 0.0001, which will down the CPU load, but will do faster loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do a similar polling loop in call_chunks() that uses 0.01; since these will be polling longrunning processes I think it makes sense to stick with 0.01 (or perhaps change them both); but i think 10ms latency is sufficient. if someone needed more we should perhaps consider not polling/using a child watcher that tornado provides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually change that. I forgot where, but we (SUSE) had submitted a bugfix where this number actually dramatically kills performance. 😉 So if you just need to make sure CPU doesn't go crazy, 0.0001 is just fine for that reason too, but small enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 0.0001

if six.PY2:
from urllib import quote
elif six.PY3:
from urllib.parse import quote # pylint: disable=no-name-in-module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Correct way is this:

from salt.ext.six.moves.urllib.parse import quote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@rallytime
Copy link
Contributor

@cachedout can you review this?

Copy link
Contributor

@cachedout cachedout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants