New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add parallel support for orchestrations #46023
Conversation
@mattp I think that your patch is very useful and can replace the The With your patch, defining several states with The reason, why I specifically added the feature for runners, is that this was the place where I needed it. Think of the following example: You want to setup some clustered database service by first installing the necessary software on a number of nodes and then creating a database on the cluster. The software installation can happen on all nodes in parallel, but you can only configure the the database after the installation has finished on all nodes. There might be similar cases in the regular state (non-orchestration) logic, but I have not encountered them so far. This might be due to the fact that the logic on a single node is typically state-based while runner / orchestration logic is more action-based (though it internally uses the state engine as well). I hope this information gives you some insight into why I implemented the |
@smarsching that was informative, thank you. I think you could maybe achieve similar with requires/requires_in and parallel: True'ing the statemod.runner(s), no? more options are usually better either way :) |
so i think they accomplish the same thing just differently however, the distinction is really threading v forking @mattp- regardless, i'd be happy to have either or both. it may prove itself useful down the road. |
@mattp- & @austinpapp You are right, I did not think about combining the |
salt/state.py
Outdated
@@ -2326,8 +2326,7 @@ def check_requisite(self, low, running, chunks, pre=False): | |||
continue | |||
if run_dict[tag].get('proc'): | |||
# Run in parallel, first wait for a touch and then recheck | |||
time.sleep(0.01) | |||
return self.check_requisite(low, running, chunks, pre) | |||
run_dict[tag].get('proc').join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eeeeeek. Will this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a zombie until parent quits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cachedout yes
@isbm according to https://docs.python.org/2/library/multiprocessing.html#all-platforms join will waitpid and reap, unless im misreading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But .join()
will block everything. And if you did not set it daemonic, it will anyway be joined after the start.
7319051
to
1e43f2b
Compare
this is now rebased to target 2017.7.5. |
this also fixes #44828 |
accidentally introduced a regression when I revisited the proc.wait behavior ; should be fixed now but I'll wait to see what jenkins has to say. |
@mattp- Looks like your new tests is failing on the Python 3 runs. |
originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).
its not clear to me why the recursive calls were chosen originally. this should address saltstack#43668
rather than join()'ing on each running process, we instead use check_running to assert completion of our processes. This should provide more correct timing results, as measuring durations of a longer running join()'d process could trample a shorter parallel process that just happened to be checked after instead of before.
previously durations were only recording the time to spawn the multiprocessing proc, not the actual time of completion, which is completely wrong. This should capture the full duration correctly now. We unfortunately have to duplicate start & complete times instead of using the passed in start_time attr, as that value is only a time (not date), so it is impossible to re-calculate the full duration based on that alone (ie, what happens if start_time is 23:59 with a roll-over to the next day). This fixes saltstack#44828
@ninja- noticed there was some useful code already in _call_parallel_target to mitigate KeyErrors for potentially empty cdata, but it wasnt being executed due to the invoking method making the same mistake before calling it. this moves that code up to eliminate that potential stacktrace. this should close saltstack#39832
this should hopefully exercise a few different facets of parallel that were previously not covered in the code base.
seems to be encountering unrelated preexisting failures in the functionality unrelated to my changes.
738518c
to
ed529e7
Compare
parallel: True codepath incompatibilities uncovered by the added tests. additionally use salt.serializers.msgpack to avoid other py2/py3/back compat issues.
@mattp- Thanks for fixing up those tests. @cachedout and @isbm Can you guys review this again? |
salt/state.py
Outdated
utc_finish_time = datetime.datetime.utcnow() | ||
delta = (utc_finish_time - utc_start_time) | ||
# duration in milliseconds.microseconds | ||
duration = (delta.seconds * 1000000 + delta.microseconds)/1000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe lint will cry here: you want spaces around /
. Update: Lint cries not, but still please put spaces around /
. 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, this duration code is actually in a few different spots, all without spaces. will update to pad properly in all cases in salt.state
salt/state.py
Outdated
elif 'name' in cdata['kwargs']: | ||
name = cdata['kwargs']['name'] | ||
else: | ||
name = low.get('name', low.get('__id__')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure here. Can it be that kwargs
has a key name
while args
ist not length of 1
? I would place kwargs
check prior args
then. But I am not sure if I am correct. Also I would simplify this and fix a bug, which can fail, once args
contains a False
value:
name = (cdata['args'] or [None])[0] or cdata['kwargs'].get('name')
if not name:
name = low.get('name', low.get('__id__'))
You can also write this all into one line 😉 but this is better when you have "first get it from various arguments, or get it from low
, if still not found".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll admit, this code is a replication of what occurs in the serial codepath earlier in the file (https://github.com/bloomberg/salt/blob/a9866c7a031ac6b7fe898c2380131e2a6de82c9f/salt/state.py#L1894). I also don't fully know all ways / types of low data that can be thrown at us here :)
I'll update in both places
salt/state.py
Outdated
@@ -1879,7 +1891,7 @@ def call(self, low, chunks=None, running=None, retries=1): | |||
# enough to not raise another KeyError as the name is easily | |||
# guessable and fallback in all cases to present the real | |||
# exception to the user | |||
if len(cdata['args']) > 0: | |||
if 'args' in cdata and cdata['args']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not nice. Just:
if cdata.get('args'):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in both places
salt/state.py
Outdated
while True: | ||
if self.reconcile_procs(run_dict): | ||
break | ||
time.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what reason you put this time.sleep
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some amount of sleep time is needed since we are simply looping until the processes are 'complete' inside reconcile_procs to collect their returns and merge them back into the running state tree; if no sleep we'd peg the cpu for no reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The self.reconcile_procs
still will go through checks and pass through so the CPU still in load. But you still can use 0.0001
, which will down the CPU load, but will do faster loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do a similar polling loop in call_chunks() that uses 0.01; since these will be polling longrunning processes I think it makes sense to stick with 0.01 (or perhaps change them both); but i think 10ms latency is sufficient. if someone needed more we should perhaps consider not polling/using a child watcher that tornado provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually change that. I forgot where, but we (SUSE) had submitted a bugfix where this number actually dramatically kills performance. 😉 So if you just need to make sure CPU doesn't go crazy, 0.0001
is just fine for that reason too, but small enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to 0.0001
salt/utils/files.py
Outdated
if six.PY2: | ||
from urllib import quote | ||
elif six.PY3: | ||
from urllib.parse import quote # pylint: disable=no-name-in-module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. Correct way is this:
from salt.ext.six.moves.urllib.parse import quote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@cachedout can you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right.
What does this PR do?
originally the parallel global state requisite did not work correctly when invoked under an orch - this fixes that; as well as running any other saltmod state (function, runner, wheel).
I'm opening this now as a review of the patch, as I'm not sure on the full repercussions of my changes. In particular, providing orchestration_jid for __pub_jid. Additionally, tests (submitting this now to see how jenkins responds).
Finally: I noticed there is a parallel_runners feature recently implemented by @smarsching. Wondering about the reasoning for a runner specific parallel versus the state engine feature?
Previous Behavior
parallel: True did not work in orch.
New Behavior
parallel: True works in orch; parallel: works on nested orch runners.
Tests written?
No
Commits signed with GPG?
No
Please review Salt's Contributing Guide for best practices.
See GitHub's page on GPG signing for more information about signing commits with GPG.