-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctx.run fails due to bad command exit code #660
Comments
Thanks @flazzarini, #661 solves also the same issue that we had in our Github Actions CI workflows (arduino/arduino-cli#416), both on win and linux environments. I hope that this PR will be merged ASAP 😸 |
Good work @flazzarini. I'm running into the same basic issue but directly using pyinvoke c.run(). My use case has a failure rate of ~10%. To test some changes to runners.py, I simplified your test to just use Windows "dir" command. tasks.py
Here's the baseline running pyinvoke 1.3.0.
Running pyinvoke 1.3.0 + change from #661.
Since your change in #661 might wait a bit long, I tested a few alternatives. First, instead of waiting for a static timeout, I first tried just returning self.process.poll() as that updates the process returncode as well. Running pyinvoke 1.3.0 + 'return self.process.poll()'
Lastly, I tested with just setting the wait timeout to the currently set self.input_sleep which defaults to 0.01 Running pyinvoke 1.3.0 + 'self.process.wait(timeout=self.input_sleep)'
As you can see, even a tiny wait is enough to fix this issue. Hopefully this change makes it in soon or we'll have to find an alternative to pyinvoke since this sort of instability is causing all sorts of random errors. |
I'd like to report I'm also running into this issue. A quick fix to this would be very much appreciated. |
I think this is the same issue: My analysis and (hack-ish) workaround might also be helpful to the discussion: |
I think I might be seeing the same bug. I explained what I'm seeing here: #683 (comment) |
I'm seeing this myself now, mostly via Travis-CI on other projects using Invoke to run eg pytest. Thanks to @flazzarini for the post & investigation, thread race conditions are the obvious culprit so that's unsurprising (and wearying, the whole thread shutdown thing has already gone through a few mutations). I'll look at the PR my next block of OSS time. |
Have to wait for this to be merged, or work around the issue. pyinvoke/invoke#660 (comment) claims v1.2.0 works As of right now: https://giant.gfycat.com/BabyishAggravatingGoldeneye.webm
Attempting to reliably repro using a variant of @beornling's snippet. Platforms:
When the snippet uses When I tease it out as a regular standalone task, I can get 1/1000 errors on macOS maybe one every 5 runs, which isn't surprising given I've never experienced this myself locally despite heavy dogfooding. On Linux, it seems to pretty reliably (Edit: I can still occasionally get 0 errors tho) spit out a low single-digit number of errors (2, 3, 5) per run. This isn't great (it's no 83/1000 per @beornling) but it at least means I may have a way forwards re: testing fixes, whether the I may add that task as-is to the CI command set as well - I'd rather it be a real test but since something about running it inside pytest seems to buffer against the problem, pragmatism may have to win out. |
Also, it sounds like this started happening in Invoke 1.3.0 - on the off chance someone sees this while I am still working & can verify if they saw it earlier, please ring out. I'll try to verify myself as it may provide a clue. Invoke 1.4 changed up a bunch of how Runner is organized, but I don't believe it actually modified the guts of what's relevant here. I've been testing on master so far, which is ~= 1.4.0. Tweaking the test from |
On MacOS, upped test length to 10,000 cycles and still not really getting much ever. Linux, I checked out v1.2.0 and 3x test runs, no errors. v1.3.0...unfortunately same. Master and it's slightly more reliable. Back to 1.3.0 and I am getting some repros there now. Got smarter and started using eg I then bumped up to |
I can confirm it only started happening after I upgraded my invoke to 1.3. It's running a cron script once every 10 minutes and while I did not spot it right away, I do know I had done the update at roughly that time (after having run for 2 years without any errors ever). |
Thanks @SwampFalc! Exactly what I was looking for. I'm definitely seeing the same here; anytime I |
edit: I added nothing new that @beornling didn't already, just a confirmation that it happens more frequently on windows. For me, this happens, with or without running the script through |
Sadly I don't dev on Windows so I've no offhand idea why it's so much worse there, but I'm relatively confident any fix that kills the issue on Unixes will work there as well. Will definitely be looking for confirmation of any final fix from y'all Windows users 👍 |
Got the "regression test" on Travis anyhow. No likey pytest so it's just its own line - in the middle of the big 'build' step that Travis...doesn't short-circuit? which always confuses the hell out of me. Been meaning to switch to CircleCI anyways, which looks like it does the intuitive thing there; maybe I'll fuck with that soon. |
The latter is the usual culprit for this class of issue so again, not surprising. I independently tried tweaking just the subprocess stdin closing part, and that doesn't seem to make a difference outside of causing the issue to be worse when it's commented out. I tried reverting just the This is why I really hate it when that code changes... I don't see any material in/related to #552 that clearly explains why this change was directly relevant to the stdin blocking problem. My suspicion is that the OP in that ticket (and I clearly didn't catch this or it wasn't fully relevant at the time) got turned around while debugging and thought the dead-thread checker was part & parcel of the problem. (It occurs to me today, rereading the code, that this may be partly due to the use of "dead" here to mean "a thread that died unexpectedly"...it's a bit confusing. The fact that the "fixed" tests in that PR had to be "fixed" to continue should have been a sign to both of us that they were failing correctly - that's on me.) With a sample of one of the triggers for the issue that #552 fixes (Invoke never telling its subprocess about mirrored stdin closing),
So I think for now, this is probably the culprit and reverting it seems unlikely to completely undo the good works of #552. I'm willing to revert it and then wait to see whether there's a true corner case I'm missing here that the is_dead tweak actually fixes. |
Travis seems to confirm, regular tests all still pass and the new regression checker does too. Just merged up thru master - I'd love it if someone else (especially a Windows user) could grab one of the branches (1.3, 1.4 or master) and confirm it's good for them as well. I'll release after that happens. |
@bitprophet I'm on Windows. I'll have a look right now. |
@bitprophet master works (8b4206e) on Windows. I built a regression test that fails 100% of the time for me. I'm working on cleaning it up and I'll post it in a new repo for you to look at. But, basically, it's a task file with 26 tasks (a through z) and a call to I've found iterating on that test 5 times will consistently produce the failure. I also built a search (bisector-esque) and it looks like 75c4d5e (8-Jul-2019) introduced the problem. But the latest on master (8b4206e) is working great on my Widows box. |
Cool, that sounds like I identified the culprit accurately then - thanks so much @cod3monk3y! I'll release the fix as 1.3.x and 1.4.x next chance I get, should be this week sometime. |
@bitprophet Here's a hacked up regression search system with the test I mentioned for this issue: https://github.com/cod3monk3y/regress Feel free to take what you need, if you find anything useful! |
I can confirm the fix works. Using a similar 'a to z' tasks file I tested on Windows 10 with various releases of invoke and multiple Python versions.
Results:
Where the latter Thanks so much for invoke, and for looking into this! |
1.3.1 and 1.4.1 are out! |
Scenario
I stumbled upon this issue when using fabric2 where I had a task which would execute alembic to upgrade/downgrade multiple local databases. When executing the task it would sometimes (completely randomly) fail with the following exception.
Please note the following is only happening when
pty
is set toFalse
.Version of invoke used 1.3.0.
Reproduction of the bug
I replicated the fabric task to a pure invoke task like shown here. When executing the following code I get randomly the error mentioned above. I will execute the task multiple times to see that the error occurs in different runs every time.
Code
First run
Second run
Investigation
After investigating the issue with one of my colleagues we might have identified the issue. However this is merely an assumption rather than a certainty. It seems that the threads that are being spawned for the Runners are trying to access
Local.returncode
function before the process has had a chance to finish without waiting on the process to finish.I've tried changing the code of
returncode
by adding Popen wait call which seems to fix the issue I had.Please consider the following pull request #661
The text was updated successfully, but these errors were encountered: