Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctx.run fails due to bad command exit code #660

Closed
flazzarini opened this issue Sep 5, 2019 · 24 comments
Closed

ctx.run fails due to bad command exit code #660

flazzarini opened this issue Sep 5, 2019 · 24 comments

Comments

@flazzarini
Copy link

flazzarini commented Sep 5, 2019

Scenario

I stumbled upon this issue when using fabric2 where I had a task which would execute alembic to upgrade/downgrade multiple local databases. When executing the task it would sometimes (completely randomly) fail with the following exception.

Please note the following is only happening when pty is set to False.

Version of invoke used 1.3.0.

Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed

Reproduction of the bug

I replicated the fabric task to a pure invoke task like shown here. When executing the following code I get randomly the error mentioned above. I will execute the task multiple times to see that the error occurs in different runs every time.

Code

from invoke import task


@task
def upgrade(ctx):
    for index in range(1, 5):
        try:
            print("Run upgrade")
            ctx.run("./env/bin/alembic upgrade head")
        except Exception as exc:
            print(exc)

First run

$ ./env/bin/invoke upgrade
Run upgrade 1
Run upgrade 2
Run upgrade 3
Run upgrade 4
Run upgrade 5
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 6
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 7
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 8
Run upgrade 9

Second run

$ ./env/bin/invoke upgrade
Run upgrade 1
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 2
Run upgrade 3
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 4
Run upgrade 5
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 6
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 7
Run upgrade 8
Run upgrade 9

Investigation

After investigating the issue with one of my colleagues we might have identified the issue. However this is merely an assumption rather than a certainty. It seems that the threads that are being spawned for the Runners are trying to access Local.returncode function before the process has had a chance to finish without waiting on the process to finish.

I've tried changing the code of returncode by adding Popen wait call which seems to fix the issue I had.

    def returncode(self):
        if self.using_pty:
            # No subprocess.returncode available; use WIFEXITED/WIFSIGNALED to
            # determine whch of WEXITSTATUS / WTERMSIG to use.
            # TODO: is it safe to just say "call all WEXITSTATUS/WTERMSIG and
            # return whichever one of them is nondefault"? Probably not?
            # NOTE: doing this in an arbitrary order should be safe since only
            # one of the WIF* methods ought to ever return True.
            code = None
            if os.WIFEXITED(self.status):
                code = os.WEXITSTATUS(self.status)
            elif os.WIFSIGNALED(self.status):
                code = os.WTERMSIG(self.status)
                # Match subprocess.returncode by turning signals into negative
                # 'exit code' integers.
                code = -1 * code
            return code
            # TODO: do we care about WIFSTOPPED? Maybe someday?
        else:
            self.process.wait(timeout=3)
            return self.process.returncode
$ ./env/bin/invoke upgrade
Run upgrade 1
Run upgrade 2
Run upgrade 3
Run upgrade 4
Run upgrade 5
Run upgrade 6
Run upgrade 7
Run upgrade 8
Run upgrade 9

Please consider the following pull request #661

flazzarini pushed a commit to flazzarini/invoke that referenced this issue Sep 5, 2019
flazzarini pushed a commit to flazzarini/invoke that referenced this issue Sep 5, 2019
@florisla
Copy link
Contributor

florisla commented Sep 13, 2019

Thanks, #661 works for me (Python 3.6, Windows 10).

Invoke 1.3.0 is unusable for some of my tasks which happen do a lot of sub-process calls -- it's hit and miss whether the task completes at all.

Invoke 1.2.0 and the #661 branch don't stumble on those same tasks.

@rsora
Copy link

rsora commented Sep 23, 2019

Thanks @flazzarini, #661 solves also the same issue that we had in our Github Actions CI workflows (arduino/arduino-cli#416), both on win and linux environments.

I hope that this PR will be merged ASAP 😸

@beornling
Copy link

Good work @flazzarini. I'm running into the same basic issue but directly using pyinvoke c.run(). My use case has a failure rate of ~10%. To test some changes to runners.py, I simplified your test to just use Windows "dir" command.

tasks.py

from invoke import task

@task
def test(ctx):
    print('Begging test...')
    failures = 0
    count = 0
    for _ in range(0, 1000):
        count += 1
        try:
            resp = ctx.run('dir', hide='both')
            if not resp.ok:
                failures += 1
                #print(resp)
        except Exception as exc:
            failures += 1
            #print(exc)

    print(f'Finished: ran {count} times with {failures} failures')

Here's the baseline running pyinvoke 1.3.0.

Begging test...
Finished: ran 1000 times with 83 failures

Running pyinvoke 1.3.0 + change from #661.

Begging test...
Finished: ran 1000 times with 0 failures

Since your change in #661 might wait a bit long, I tested a few alternatives. First, instead of waiting for a static timeout, I first tried just returning self.process.poll() as that updates the process returncode as well.

Running pyinvoke 1.3.0 + 'return self.process.poll()'

Begging test...
Finished: ran 1000 times with 6 failures

Lastly, I tested with just setting the wait timeout to the currently set self.input_sleep which defaults to 0.01

Running pyinvoke 1.3.0 + 'self.process.wait(timeout=self.input_sleep)'

Begging test...
Finished: ran 1000 times with 0 failures

As you can see, even a tiny wait is enough to fix this issue. Hopefully this change makes it in soon or we'll have to find an alternative to pyinvoke since this sort of instability is causing all sorts of random errors.

@SwampFalc
Copy link

I'd like to report I'm also running into this issue. A quick fix to this would be very much appreciated.

@cod3monk3y
Copy link

I think this is the same issue:
#683

My analysis and (hack-ish) workaround might also be helpful to the discussion:
cod3monk3y@1123485

@krupan
Copy link

krupan commented Jan 11, 2020

I think I might be seeing the same bug. I explained what I'm seeing here: #683 (comment)

@bitprophet bitprophet added this to the Next bugfix milestone Jan 11, 2020
@bitprophet
Copy link
Member

I'm seeing this myself now, mostly via Travis-CI on other projects using Invoke to run eg pytest. Thanks to @flazzarini for the post & investigation, thread race conditions are the obvious culprit so that's unsurprising (and wearying, the whole thread shutdown thing has already gone through a few mutations).

I'll look at the PR my next block of OSS time.

mawillcockson added a commit to mawillcockson/install-keepass that referenced this issue Jan 24, 2020
Have to wait for this to be merged, or work around the issue.

pyinvoke/invoke#660 (comment)
claims v1.2.0 works

As of right now:
https://giant.gfycat.com/BabyishAggravatingGoldeneye.webm
@bitprophet
Copy link
Member

bitprophet commented Jan 24, 2020

Attempting to reliably repro using a variant of @beornling's snippet. Platforms:

  • macOS 10.14, pyenv-installed Python 3.7.4
  • Ubuntu 19.10 desktop edition (on a Thinkpad), apt-installed Python 3.7.5

When the snippet uses ls and is part of the integration suite (ideally I want this to live there) I cannot repro on either host after a good half dozen tries each.

When I tease it out as a regular standalone task, I can get 1/1000 errors on macOS maybe one every 5 runs, which isn't surprising given I've never experienced this myself locally despite heavy dogfooding.

On Linux, it seems to pretty reliably (Edit: I can still occasionally get 0 errors tho) spit out a low single-digit number of errors (2, 3, 5) per run. This isn't great (it's no 83/1000 per @beornling) but it at least means I may have a way forwards re: testing fixes, whether the subprocess.wait() seen in comments/#661, or something "cleaner".

I may add that task as-is to the CI command set as well - I'd rather it be a real test but since something about running it inside pytest seems to buffer against the problem, pragmatism may have to win out.

@bitprophet
Copy link
Member

bitprophet commented Jan 24, 2020

Also, it sounds like this started happening in Invoke 1.3.0 - on the off chance someone sees this while I am still working & can verify if they saw it earlier, please ring out. I'll try to verify myself as it may provide a clue.

Invoke 1.4 changed up a bunch of how Runner is organized, but I don't believe it actually modified the guts of what's relevant here. I've been testing on master so far, which is ~= 1.4.0.

Tweaking the test from ls to eg sleep 1 seems to induce more errors (36/1000) but of course takes significantly longer (many minutes) to run. Meh. ls is fine for now.

@bitprophet
Copy link
Member

On MacOS, upped test length to 10,000 cycles and still not really getting much ever.

Linux, I checked out v1.2.0 and 3x test runs, no errors. v1.3.0...unfortunately same. Master and it's slightly more reliable. Back to 1.3.0 and I am getting some repros there now.

Got smarter and started using eg seq 8 | parallel -n0 inv -c temp temp (file and task names are actually slightly...cruder 😇 ) and that's a much nicer way to leverage the hardware; I also suspect load may exacerbate the issue as it often does for race conditions.

I then bumped up to seq 24 to basically run 3x batches of 8x jobs in parallel (system is a quad-core w/ HT enabled) which really ought to be enough to conclusively yea/nay the issue. (sadly this is all likely to be moot-ish on CI which has far fewer cores, but I guess we'll see. I suspect the issue is more prevalent on heavily virtualized/containerized platforms anyway, given how often I seem to hit it when pushing updates.)

@SwampFalc
Copy link

I can confirm it only started happening after I upgraded my invoke to 1.3. It's running a cron script once every 10 minutes and while I did not spot it right away, I do know I had done the update at roughly that time (after having run for 2 years without any errors ever).

@bitprophet
Copy link
Member

bitprophet commented Jan 24, 2020

Thanks @SwampFalc! Exactly what I was looking for.

I'm definitely seeing the same here; anytime I git checkout 1.2.0 the issue vanishes, only to reappear on 1.3. Will git log/bisect shortly.

@mawillcockson
Copy link

mawillcockson commented Jan 24, 2020

edit: I added nothing new that @beornling didn't already, just a confirmation that it happens more frequently on windows.

For me, this happens, with or without running the script through pipenv, on Python 3.8.1, with invoke 1.4: https://giant.gfycat.com/BabyishAggravatingGoldeneye.webm

@bitprophet
Copy link
Member

Sadly I don't dev on Windows so I've no offhand idea why it's so much worse there, but I'm relatively confident any fix that kills the issue on Unixes will work there as well. Will definitely be looking for confirmation of any final fix from y'all Windows users 👍

@bitprophet
Copy link
Member

Skimming git/changelog shows that the culprit was probably changes for one of #552 or #539, both of which touched Runner code; I'm guessing #552 as it's more relevant. Diff time.

@bitprophet
Copy link
Member

Got the "regression test" on Travis anyhow. No likey pytest so it's just its own line - in the middle of the big 'build' step that Travis...doesn't short-circuit? which always confuses the hell out of me. Been meaning to switch to CircleCI anyways, which looks like it does the intuitive thing there; maybe I'll fuck with that soon.

Anyway, it does fail basically all cells:
image

@bitprophet
Copy link
Member

bitprophet commented Jan 25, 2020

git bisect confirms the issue occurred in 75c4d5e aka the one that both closed subprocess stdin, and modified what it means for an IO worker thread to be "dead".

The latter is the usual culprit for this class of issue so again, not surprising. I independently tried tweaking just the subprocess stdin closing part, and that doesn't seem to make a difference outside of causing the issue to be worse when it's commented out.

I tried reverting just the ExceptionHandlingThread.is_dead change and poof, the issue immediately disappears 😞

This is why I really hate it when that code changes...


I don't see any material in/related to #552 that clearly explains why this change was directly relevant to the stdin blocking problem. My suspicion is that the OP in that ticket (and I clearly didn't catch this or it wasn't fully relevant at the time) got turned around while debugging and thought the dead-thread checker was part & parcel of the problem.

(It occurs to me today, rereading the code, that this may be partly due to the use of "dead" here to mean "a thread that died unexpectedly"...it's a bit confusing. The fact that the "fixed" tests in that PR had to be "fixed" to continue should have been a sign to both of us that they were failing correctly - that's on me.)


With a sample of one of the triggers for the issue that #552 fixes (Invoke never telling its subprocess about mirrored stdin closing), echo Hi | python -c "from invoke import run; run('cat')":

So I think for now, this is probably the culprit and reverting it seems unlikely to completely undo the good works of #552. I'm willing to revert it and then wait to see whether there's a true corner case I'm missing here that the is_dead tweak actually fixes.

bitprophet added a commit that referenced this issue Jan 25, 2020
See comments in #660 for rationale and testing.

tl;dr I believe this part of the fix for #552 was erroneous
and I cannot currently prove that it's necessary to fix
the main #552 bug; and I can prove that it is causing #660;
so, revert it is. Plus clarifying the tests that were
'fixed' by #552.
bitprophet added a commit that referenced this issue Jan 25, 2020
@bitprophet
Copy link
Member

Travis seems to confirm, regular tests all still pass and the new regression checker does too.

Just merged up thru master - I'd love it if someone else (especially a Windows user) could grab one of the branches (1.3, 1.4 or master) and confirm it's good for them as well. I'll release after that happens.

@cod3monk3y
Copy link

@bitprophet I'm on Windows. I'll have a look right now.

@cod3monk3y
Copy link

@bitprophet master works (8b4206e) on Windows. I built a regression test that fails 100% of the time for me. I'm working on cleaning it up and I'll post it in a new repo for you to look at. But, basically, it's a task file with 26 tasks (a through z) and a call to invoke a b c ... z. Each task outputs its name (e.g. task 'a' outputs 'a') into a shared file. Since it appears the tasks from the command line invocation execute in-order, the last line in the file should be 'z'.

I've found iterating on that test 5 times will consistently produce the failure.

I also built a search (bisector-esque) and it looks like 75c4d5e (8-Jul-2019) introduced the problem.

But the latest on master (8b4206e) is working great on my Widows box.

@bitprophet
Copy link
Member

bitprophet commented Jan 26, 2020

Cool, that sounds like I identified the culprit accurately then - thanks so much @cod3monk3y!

I'll release the fix as 1.3.x and 1.4.x next chance I get, should be this week sometime.

@cod3monk3y
Copy link

@bitprophet Here's a hacked up regression search system with the test I mentioned for this issue:

https://github.com/cod3monk3y/regress

Feel free to take what you need, if you find anything useful!

@florisla
Copy link
Contributor

florisla commented Jan 27, 2020

I can confirm the fix works.

Using a similar 'a to z' tasks file I tested on Windows 10 with various releases of invoke and multiple Python versions.

py.exe -3.8 -m pip install --upgrade invoke==1.2.0
del out.txt && py.exe -3.8 -m invoke a b c d e f g h i j k l m o q r s t u v w x y z && py.exe -3.8 -c "import invoke; print(invoke.__version__, open('out.txt', 'rt').read()[-3:].strip())"

Results:

1.2.0 z
1.3.0 b
1.4.0 c
1.4.0 z

Where the latter 1.4.0 is git master. This is reproducible for Python 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.

Thanks so much for invoke, and for looking into this!

@bitprophet
Copy link
Member

1.3.1 and 1.4.1 are out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants