New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xonsh appears to be closing file descriptors in subprocesses #2984

Closed
nkabir opened this Issue Jan 15, 2019 · 28 comments

Comments

Projects
None yet
5 participants
@nkabir
Copy link

nkabir commented Jan 15, 2019

xonfig

+------------------+---------------------+
| xonsh            | 0.8.8               |
| Git SHA          | b64a2126            |
| Commit Date      | Jan 2 19:02:13 2019 |
| Python           | 3.6.7               |
| PLY              | 3.11                |
| have readline    | True                |
| prompt toolkit   | 2.0.7               |
| shell type       | prompt_toolkit2     |
| pygments         | 2.3.1               |
| on posix         | True                |
| on linux         | True                |
| distro           | ubuntu              |
| on darwin        | False               |
| on windows       | False               |
| on cygwin        | False               |
| on msys2         | False               |
| is superuser     | False               |
| default encoding | utf-8               |
| xonsh encoding   | utf-8               |
| encoding errors  | surrogateescape     |
+------------------+---------------------+

Expected Behavior

$ redo hello should produce hello executable

Current Behavior

$ redo hello
redo  hello
Traceback (most recent call last):
  File "/home/nkabir/.local/bin/redo-ifchange", line 11, in <module>
    sys.exit(main())
  File "/home/nkabir/.local/lib/python2.7/site-packages/redo/cmd_ifchange.py", line 41, in main
    jobserver.setup(0)
  File "/home/nkabir/.local/lib/python2.7/site-packages/redo/jobserver.py", line 238, in setup
    'prefix your Makefile rule with a "+"')
ValueError: broken --jobserver-auth from make; prefix your Makefile rule with a "+"
hello

Steps to Reproduce

$ pip install --user redo-tools
$ pip3 install --user xonsh

Place hello.do and hello.c into the same folder:

hello.do

#!/usr/bin/env xonsh

#del $MAKEFLAGS
#del $REDO_CHEATFDS

gcc -o $3 hello.c

redo-ifchange hello.c

hello.c

#include <stdio.h>
int main()
{
   printf("Hello, World!");
   return 0;
}

Finally, execute

$ redo hello

Note that explicitly deleting MAKEFLAGS and REDO_CHEATFDS is a work-around that forces the process to be serial.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Jan 31, 2019

Thanks for reporting @nkabir! I am not familar with what redo is or how it works. Is there a more minimal example that you can show? I use make and compilers in xonsh all the time....

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Jan 31, 2019

Hi @scopatz

Thanks for getting back! I created a minimal demonstration repository here

https://github.com/nkabir/fdcheck

Clone the repository to an Ubuntu 18.04 host and run make all and make flagged to reproduce the error. Please let me know if you encounter any issues. Xonsh is wonderful.-)

Also, redo is very versatile build tool that fits well with xonsh (aside from the file handle issue)

https://redo.readthedocs.io/en/latest/

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 1, 2019

Hmmm redo-tools seems to require Python 2, which I don't have anymore. Do you have an example that doesn't rely on Python 2 or is there a docker container that you can point me to? Sorry for all the trouble just getting started with this one.

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 3, 2019

Hi @scopatz. Which operating system do you primarily use? redo currently uses Python2. Python2 is still available in all of the major Linux repositories. If you're running Ubuntu, you can launch a new, clean Ubuntu 18.04 container named demo with

$ lxc launch ubuntu:18.04 demo

Or you can log into a clean Debian instance here:
https://taasproject.com

Then clone https://github.com/nkabir/fdcheck and run make all

These may be easier than debugging through a Docker instance.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 4, 2019

What happens if you do:

__xonsh__.commands_cache.threadable_predictors['redo'] = lambda *a, **k: False

in your current session or xonshrc? Does this fix the issue?

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 4, 2019

Hi Anthony.

I placed the line at the top of my .xonshrc and then executed redo from within xonsh and there was no change. Could this be an issue with running Python2.7 processes inside xonsh hosted on Python3.6?

Traceback (most recent call last):
  File "/home/nkabir/.local/bin/redo-ifchange", line 11, in <module>
    sys.exit(main())
  File "/home/nkabir/.local/lib/python2.7/site-packages/redo/cmd_ifchange.py", line 41, in main
    jobserver.setup(0)
  File "/home/nkabir/.local/lib/python2.7/site-packages/redo/jobserver.py", line 238, in setup
    'prefix your Makefile rule with a "+"')
ValueError: broken --jobserver-auth from make; prefix your Makefile rule with a "+"

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 4, 2019

No, it shouldn't be a problem to run Python 2 inside of xonsh. I am curious as to why you think this is a xonsh filehandle error and not a redo error. The traceback seems to indicate an incorrect redo configuration and doesn't mention anything about xonsh. Does this work in bash or another shell?

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 4, 2019

It works in bash. The error appeared when I ported redo scripts to xonsh.

I've brought this to the attention of Avery (redo author) and he suspected it was a xonsh issue. It's briefly covered here:
https://redo.readthedocs.io/en/latest/FAQParallel/

It's not a blocker for me at the moment--I'm disabling parallel builds with

del $MAKEFLAGS
del $REDO_CHEATFDS

at the top of my scripts. I can report back once I need parallel builds and have time to dig deeper. Another thanks for xonsh. My build scripts are much easier to maintain with xonsh's seamless integration with Python. xonsh has been a pleasure to use!

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 4, 2019

Oh interesting, so redo is actually launching xonsh, potentially via make. That is very interesting. I am still not sure why you / Avery think that this is a file descriptor issue. Can you rope Avery into this conversation to explain? Or is there a minimal Makefile that I can use to test this out?

I am not sure how many folks have tried to use xonsh inside of make or gdb or other tools with assume that $SHELL is a posix-compatible shell. We, of course, want xonsh to work with these tools.

xonsh has been a pleasure to use!

Thanks again!

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 5, 2019

I've posted to the redo mailing list here:

https://groups.google.com/d/msg/redo-list/eG63pNTm9KA/ya8DxlX-FgAJ

And I created this project that has a minimal Makefile to reproduce the issue:

https://github.com/nkabir/fdcheck

It runs as-is on a stock Ubuntu 18.04 host.

@rdpate

This comment has been minimized.

Copy link

rdpate commented Feb 5, 2019

I have followed this issue from the redo mailing list. I was unable to make xonsh after cloning because I do not have xonsh installed as required by Makefile. I do not see bootstrap instructions -- though I did figure out scripts/xon.sh seems to work (but scripts/xonsh does not because "#!/usr/bin/env python3 -u" looks for an executable named "python3 -u" instead of python3 with the option -u).

A minimal example is:

xonsh -c 'echo x | dd of=/dev/fd/3' 3>&1

This fails (I used scripts/xon.sh instead of xonsh), but should instead do the same as: xonsh -c 'echo x | dd'. I believe the reason is xonsh reportedly keeps the subprocess.Popen(close_fds=True) default (in Python3, the same default was False in Python2). The minimal example works in dash 0.5.8-2.10 as installed on Ubuntu 18.04.

My understanding of the problem is that:

  1. xonsh closes file descriptors when running commands (hence using /dev/fd/3 in the example)
  2. redo expects the POSIX behavior of inherited descriptors (fd 3 in my example) not being closed

Here's a more complex example that doesn't require /dev/fd. I am not sure how to adapt it to xonsh's tests. Note in particular for the line marked A, the heredoc is a placeholder for "a program which uses a file descriptor read from an environment variable". Two examples of such programs are redo and GNUMake.

#!/bin/sh -Cue
export FD=3
exec 3>&1  # actually a pipe(2) in the situation the OP encountered
for sh in scripts/xon.sh sh; do
    printf \\n%s\\n "testing $sh"
    rc=0
    "$sh" <<'END' || rc=$?
echo TEST | sh -c 'cat >&"$FD"' # mark A
END
    printf %s\\n "exit: $rc"
done
@apenwarr

This comment has been minimized.

Copy link

apenwarr commented Feb 5, 2019

Below is a test program that doesn't use redo that demonstrates the problem with xonsh. It uses os.pipe2() to create two file descriptors connected to each other by a pipe. It then launched a subprocess, closes its own copy of the 'write' side of the pipe, and waits for the 'read' side to become ready. Since nobody ever writes to the write side, this happens only when the no more processes exist that inherited the write fd. Our parent process closed its copy, so the only copy is inherited by our subprocess and its children.

We run this test twice, once with each of two different subprocesses:

  1. sh -c 'sleep 2 & exit'
  2. xonsh -c 'sleep 2 &; exit'

In case (1), we end up waiting a full two seconds, even though 'sh' has exited right away. (In sh, the 'exit' is not necessary, but I include it for symmetry with xonsh.) In case (2), our test program ends up finishing before the 'sleep' command has returned. This proves that the sleep command did not inherit the write fd as it should have. On the other hand, if I remove the '; exit', then it waits 2 seconds, but that's because xonsh itself refuses to exit while it has a child process still running, even though that child was backgrounded.

This test reveals several weirdnesses in xonsh:

i) It shouldn't touch file descriptors that it didn't personally open. I opened the write fd, I didn't refer to it from my xonsh command, and so the correct behaviour is to leave it alone. Doing this wrong will break parallel builds in make and redo, but also break any programs that rely on fd passing to children, and others that rely on the above trick to determine if an entire process pipeline has ended before continuing.

ii) xonsh should exit right away if a background process exists and there are no remaining commands. It isn't helping anyone by converting "xonsh -c 'sleep 2 &'" from an explicit background into a foreground process.

iii) The "don't exit until everything is done" behaviour might perhaps be nice in interactive mode, but in interactive mode, the "; exit" also doesn't exit, which is at least consistent. In batch mode, it behaves differently depending whether or not you have "; exit" at the end, which is pretty strange.

#!/usr/bin/python3
import os, subprocess, sys, time

def test(argv):
    t0 = time.time()
    print('\ntesting %r' % argv)
    r, w = os.pipe2(0)
    p = subprocess.Popen(argv, close_fds=False)
    os.close(w)
    b = os.read(r, 4096)
    assert not b
    t1 = time.time()
    t = t1 - t0
    print('pipe closed; p exit status = %r; t = %r' % (p.poll(), t))
    assert t >= 2.0, 'sleep did not inherit write fd'
    print('ok.')

test(['sh', '-c', 'sleep 2 & exit'])
test(['xonsh', '-c', 'sleep 2 &; exit'])
@apenwarr

This comment has been minimized.

Copy link

apenwarr commented Feb 5, 2019

Here's another example that uses just GNU make.

$ time make -sj2 SHELL=sh
real	0m2.010s
user	0m0.000s
sys	0m0.000s

$ time make -sj2 SHELL=xonsh
make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
real	0m5.001s
user	0m0.764s
sys	0m0.164s

Makefile:

default: all
all:
	$(MAKE) s
s:
	$(MAKE) a b
a:
	sleep 2
b:
	sleep 2
@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 5, 2019

Thanks @rdpate @apenwarr for the detailed descriptions here! I am still trying to piece together what the underlying issue is here. Part of the problem, I suspect, is that the two example cases @apenwarr listed are not semantically equivalent.

  1. sh -c 'sleep 2 & exit'

This means: run the sleep command for 2 sec AND if the exit code is zero (ie successful completion), run the exit command

  1. xonsh -c 'sleep 2 &; exit'

This means: run the sleep command for 2 sec in the background. Then run the exit command.

A proper translation of (1) would be either:

  1. xonsh -c "sleep 2 && exit", or the more Pythonic xonsh -c "sleep 2 and exit"

(3) behaves as expected in that it runs for two seconds and then exits. A proper translation of (2) into bash would be:

  1. sh -c "bg sleep 2 && exit"

This fails because there is no job control in Bash when not running as a REPL. Maybe it is a little strange that xonsh has job control even when not in interactive mode, but it shouldn't get in anyone's way usually.

Can you please clarify which semantics, (1) or (2), are actually important?

@apenwarr

This comment has been minimized.

Copy link

apenwarr commented Feb 5, 2019

Well... no. All of that is wrong.

& and && are completely different operators in sh. The only thing they have in common is that they are statement separators, like ";" is. '&' runs the preceding statement in the background, while '&&' works as you described in both sh and xonsh.

It is obviously not true that
sh -c "sleep 2 & exit"
is the same as
xonsh -c "sleep 2 && exit"
because you can try running them and immediately discover that one returns right away, and the other returns after 2 seconds.

And this command you suggested,
sh -c "bg sleep 2 && exit"
is just completely wrong. The 'bg' command in sh does not take a command line under any circumstances, it takes a jobspec, like %1 or %2. It also happens to print an error "bg: no job control" in this case, because apparently the bg command doesn't bother to parse its arguments when there is no job control, but that's irrelevant to the situation at hand. When run interactively you get the "right" error:

bash$ bg sleep 2
-bash: bg: sleep: no such job
-bash: bg: 2: no such job

"job control" in a shell is a very specific thing that is different from the ability to run subprocesses. The latter is available both interactively and non-interactively. For example, this works fine:

bash$ bash -c 'date; sleep 2 & echo hello world; date; wait; date; bg'
Tue Feb  5 13:17:18 EST 2019
hello world
Tue Feb  5 13:17:18 EST 2019
Tue Feb  5 13:17:20 EST 2019
bash: line 0: bg: no job control

Notice how 'hello world' prints before 2 seconds have elapsed, and yet the 'wait' command works (and waits for 'sleep 2' to finish), even though job control is disabled.

In any case, this syntax stuff is a tangent from the original bug, which is about file descriptors not working as expected. rdpate's one-liner example from above is nice and simple and avoids the need for background processes:

bash$ sh -c 'echo x | dd of=/dev/fd/3' 3>&1
x
0+1 records in
0+1 records out
2 bytes copied, 3.0603e-05 s, 65.4 kB/s

bash$ xonsh -c 'echo x | dd of=/dev/fd/3' 3>&1
/bin/dd: failed to open '/dev/fd/3': No such file or directory

Of course, you need to run those two commands from sh, because xonsh also doesn't understand 3>&1. It does understand 2>&1, apparently as some kind of special case, leading me to suspect that xonsh has a really big blind spot around file descriptors in general.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 5, 2019

Hmm I didn't think xonsh supported >& at all. It must be there for posix compatibility, but clearly isn't implemented generally enough.

@apenwarr

This comment has been minimized.

Copy link

apenwarr commented Feb 5, 2019

That doesn't matter. 3>&1 is interpreted by bash in both versions of that one-liner. xonsh still manages to then eat fd #3.

@rdpate

This comment has been minimized.

Copy link

rdpate commented Feb 5, 2019

Hmm I didn't think xonsh supported >& at all. It must be there for posix compatibility, but clearly isn't implemented generally enough.

My minimal test case is intended to be run from a shell such as zsh or bash, which then runs xonsh and shows the failure. (As @apenwarr mentioned.) I got a little creative (using dd and /dev/fd) in this test case to show that the descriptor manipulation is NOT done in xonsh (the dup(2) is done in a parent and it is a child that uses the descriptor).

You might see the other test case I presented, in which I was explicit in /bin/sh calling xonsh.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

Also, sh -c "sleep 2 & exit" does not return right away for me, but rather waits 2 seconds.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

Oddly, it only returns right away in bash, but waits in xonsh

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

Setting close_fds=False seems to fix this.

@apenwarr

This comment has been minimized.

Copy link

apenwarr commented Feb 6, 2019

Oddly, it only returns right away in bash, but waits in xonsh

Wow, you're right. I just tried to strace what xonsh does when it launches a subprocess and... it's pretty complicated, involves starting several subthreads, and will surely lead to weird bugs. But that's for another day!

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

A potential fix is in #3015, please test it out with redo

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 6, 2019

Thanks, Anthony! I do not have enough knowledge about the internals of either tool to make an authoritative assessment but wanted to provide smoke test results as soon as possible. The patch appears to work based on my build scripts and an example from above.

This is on a clean Ubuntu 18.04:

xonsh -c 'echo x | dd of=/dev/fd/3' 3>&1
x
0+1 records in
0+1 records out
2 bytes copied, 0.000256735 s, 7.8 kB/s

I've also removed the xonsh workarounds from my redo scripts

del $MAKEFLAGS
del $REDO_CHEATFDS

and redo is no longer complaining.

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

excellent!

@melund melund closed this in #3015 Feb 6, 2019

@melund

This comment has been minimized.

Copy link
Member

melund commented Feb 6, 2019

@nkabir #3015 is merged now. Thanks a lot for all the work hunting down this bug. It is really appreciated .

@scopatz

This comment has been minimized.

Copy link
Member

scopatz commented Feb 6, 2019

Much agreed with @melund! If there are lingering problems, please open new issues.

@nkabir

This comment has been minimized.

Copy link
Author

nkabir commented Feb 6, 2019

Much gratitude to @apenwarr and @rdpate for their deep and insightful low-level analysis...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment