Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sage-cleaner #14055

Closed
jdemeyer opened this issue Feb 4, 2013 · 123 comments
Closed

Fix sage-cleaner #14055

jdemeyer opened this issue Feb 4, 2013 · 123 comments

Comments

@jdemeyer
Copy link

jdemeyer commented Feb 4, 2013

sage-cleaner got completely broken due to incompatible filenames between sage-cleaner and the Sage library.

Apply:

  1. attachment: 14055_cleaner_sagelib.patch to the Sage library.
  2. attachment: 14055_simplify_env.patch to the Sage library.
  3. attachment: trac_14055_maxima_debug.patch to the Sage library.
  4. attachment: 14055_cleaner.patch to local/bin.
  5. attachment: 14055_cleaner_hostname.patch to local/bin.

Update ECL spkg with http://boxen.math.washington.edu/home/vbraun/spkg/ecl-12.12.1.p3.spkg

CC: @ppurka @nbruin

Component: scripts

Author: Jeroen Demeyer, Volker Braun

Reviewer: John Palmieri, Leif Leonhardy, Volker Braun

Merged: sage-5.10.beta2

Issue created by migration from https://trac.sagemath.org/ticket/14055

@jdemeyer jdemeyer added this to the sage-5.10 milestone Feb 4, 2013
@jdemeyer
Copy link
Author

jdemeyer commented Feb 4, 2013

comment:1

I am working on a patch.

@jdemeyer
Copy link
Author

jdemeyer commented Feb 4, 2013

Author: Jeroen Demeyer

@jhpalmieri
Copy link
Member

comment:4

In the line

pidfile = os.path.join(DOT_SAGE, 'tmp', 'cleaner-%s.pid'%HOSTNAME)

would it make sense to use 'temp' instead of 'tmp', so all of the sage-cleaner information is in the same directory?

(Is there any good reason for having both directories tmp and temp? At some point, we should clean this up.)

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

comment:5

THERE ARE STILL SOME INSTANCES OF UPPERCASE SAGE.

I don't like the mixture of tmp, temp, TMP and TEMP or whatever.

temp and tmp shouldn't be hardcoded.

"Deleting .../spawned_processes" gets printed even if .../spawned_processes is not a directory (not e), and also rmtree() is attempted, ignoring any OSError.

I don't like "%s" % i where i is (or should be) an int.

Command line option handling is certainly suboptimal. (argv[1] could be a float as well, there's no usage, and additional parameters simply get ignored.)

Likewise, if DOT_SAGE isn't set, an ugly KeyError is raised, instead of printing a meaningful message.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

Changed keywords from none to orphans

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

Reviewer: Punarbasu Purkayastha, Leif Leonhardy

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

comment:6

Ooops.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

Changed reviewer from Punarbasu Purkayastha, Leif Leonhardy to John Palmieri, Leif Leonhardy

@jdemeyer
Copy link
Author

comment:7

Leif: the subject of this ticket is not "Fix all possible issues in sage-cleaner", the goal is to just make it work.

@jdemeyer
Copy link
Author

comment:8

Replying to @jhpalmieri:

Is there any good reason for having both directories tmp and temp?

Absolutely not, but at least the directory used in sage-cleaner should match the directory used in the Sage library.

@jhpalmieri
Copy link
Member

comment:9

I am having a problem with this patch, but I don't know why. I am repeatedly running sage -tp 2 devel/sage/sage/homology/s*. Without the patch applied, everything is fine. With the patch, after the first successful run, I get this (I also removed the &>/dev/null after the call to sage-cleaner in spkg/bin/sage):

$ sage -tp 2 devel/sage/sage/homology/s*
SAGE_TMP_ROOT = /Users/palmieri/.sage/temp/jpalmieri538
Starting sage-cleaner
Checking PIDs []
sage-cleaner is finished
Running doctests with ID 2013-03-20-11-16-30-613ad8e5.
Sorting sources by runtime so that slower doctests are run first....
Doctesting 3 files using 2 threads.
sage -t devel/sage/sage/homology/simplicial_complex_morphism.py
    [187 tests, 0.3 s]
sage -t devel/sage/sage/homology/simplicial_complex_homset.py
    [49 tests, 0.1 s]
sage -t devel/sage/sage/homology/simplicial_complex.py
    Time out after testing finished
**********************************************************************
Tests run before process timed out:

...

**********************************************************************
----------------------------------------------------------------------
sage -t devel/sage/sage/homology/simplicial_complex.py  # Time out after testing finished
----------------------------------------------------------------------
Total time for all tests: 26.9 seconds
    cpu time: 0.4 seconds
    cumulative wall time: 0.5 seconds

Without the patch, the file simplicial_complex.py passes tests in under 7 seconds; cpu time and cumulative wall time are each around 6 or 7 seconds. With the patch, the time out is pretty consistent for each run after the first. I'm seeing this on two different OS X 10.8 machines.

Also, if I delete the directory DOT_SAGE/tmp, then the pidfile won't get written at all during doctesting or when running sage. What script is responsible for creating that directory? If this directory is missing, I don't get the time-out any more, so the problem is really with sage-cleaner.

On the bright side, it seems to be cleaning out the appropriate files, directories, and processes.

@jhpalmieri
Copy link
Member

comment:10

Maybe the problem is that time.sleep(wait + 3) used to be called before the first call to cleanup(), and now it's not. So now cleanup() can be called before any doctesting processes have started, so it returns zero, so sage-cleaner thinks there is nothing to be cleaned up and exits right away. I'm not sure why that results in a timeout in the doctesting, but anyway, this change fixes the problem for me:

diff --git a/sage-cleaner b/sage-cleaner
--- a/sage-cleaner
+++ b/sage-cleaner
@@ -115,6 +115,7 @@
         else:
             wait = 10
     
+        time.sleep(wait)
         # Initial cleanup, ignore time
         running_sages = cleanup()
         cleanup_time = 0.0

I don't know if the initial sleep should be for wait or a smaller number. Does it matter?

@jhpalmieri
Copy link
Member

comment:11

Replying to @jdemeyer:

Replying to @jhpalmieri:

Is there any good reason for having both directories tmp and temp?

Absolutely not, but at least the directory used in sage-cleaner should match the directory used in the Sage library.

Where is DOTSAGE/tmp used? Certainly the temp version is used in the definition of SAGE_TMP, so that is ubiquitous. For the file cleaner-HOSTNAME.pid, is that referenced anywhere in the Sage library? It looks like we could put that file in temp instead of tmp safely. Indeed, I can't find any other references to DOTSAGE/tmp...

@novoselt
Copy link
Member

comment:12

I've applied this patch to SAGE_ROOT/local/bin, ran "make ptestlong" and still got two ecl processes eating RAM when tests were finished without errors.

@jdemeyer
Copy link
Author

comment:13

jhpalmieri: I can totally reproduce problems with devel/sage/sage/homology/simplicial_complex.py, but they are different from yours and unrelated to sage-cleaner: #14323.

@jhpalmieri
Copy link
Member

comment:14

If you look at the output from the doctest command, I find it suspicious that the message "sage-cleaner is finished" appears before any doctests are run.

@jdemeyer
Copy link
Author

comment:15

Replying to @jhpalmieri:

I find it suspicious that the message "sage-cleaner is finished" appears before any doctests are run.

Sure, but I don't see how that could lead to doctest failures. It could be that your problem is also an instance of #14323, let's first fix that.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Mar 20, 2013

comment:16

Replying to @jhpalmieri:

Maybe the problem is that time.sleep(wait + 3) used to be called before the first call to cleanup(), and now it's not. So now cleanup() can be called before any doctesting processes have started, so it returns zero, so sage-cleaner thinks there is nothing to be cleaned up and exits right away.
[...]
I don't know if the initial sleep should be for wait or a smaller number. Does it matter?

I wouldn't rely on some wall time.

Unfortunately the cleaner cannot wait() for its parent, but it could poll whether it is still running (or -- much better I think -- wait for another signal), and do nothing until its parent has "vanished" (terminated, and no other process with the same PID is meanwhile running -- that's also a potential, although probably unlikely, problem with left-over Sage cleaner and spawned_processes PID files; in the latter case, even unrelated processes of the same user could get killed.)

If there's already another Sage cleaner instance running, it could of course exit immediately (modulo the problem mentioned before).

@jhpalmieri
Copy link
Member

comment:17

I also get the same problem with devel/sage/sage/homology/examples.py, but that file uses libGAP, too. If I doctest both of these, I only get a failure in examples.py, while the other passes doctests. Adding my patch from my comment above makes the failures go away. I'm still not sure if these problems are related to #14323 or not. If they are, then at least I can reliably produce the failures, and maybe that will help to track them down.

@jhpalmieri
Copy link
Member

comment:18

In case it helps, I can duplicate this problem on bsd.math.washington.edu.

@jdemeyer
Copy link
Author

comment:19

John: please check whether the new spkg at #14323 fixes your problem.

@jdemeyer
Copy link
Author

comment:20

Replying to @jhpalmieri:

In case it helps, I can duplicate this problem on bsd.math.washington.edu.

I can indeed reproduce the problem and #14323 doesn't help.

@jhpalmieri
Copy link
Member

comment:21

On boxen, with the spkg from #14323, I see that the issue from that ticket seems to be fixed, but then I can reproduce the issue here.

@jdemeyer
Copy link
Author

comment:22

I found the problem with expect.py: the doctest process starts the sage-cleaner, somehow leading to trouble.

@jdemeyer
Copy link
Author

Changed merged from sage-5.10.beta1 to none

@jdemeyer
Copy link
Author

comment:92

I am still seeing the linear_algebra.rst doctest failures. I don't understand how this ticket could cause it, though.

@jdemeyer jdemeyer reopened this Apr 30, 2013
@vbraun
Copy link
Member

vbraun commented May 2, 2013

comment:93

I've been running repeated doctests and the only thing that I see on taurus is occasional (about 5% of doctests runs) failure with the maxima pty interface. The trouble always starts with the assertion failure where input != echo:

File "devel/sage/doc/en/constructions/plotting.rst", line 52, in doc.en.constructions.plotting
Failed example:
    L = [(i/100.0, maxima.eval('jacobi_sn (%s/100.0,2.0)'%i))for i in range(-300,300)]
Exception raised:
    Traceback (most recent call last):
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 466, in _run
        self.execute(example, compiled, test.globs)
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 825, in execute
        exec compiled in globs
      File "<doctest doc.en.constructions.plotting[0]>", line 1, in <module>
        L = [(i/RealNumber('100.0'), maxima.eval('jacobi_sn (%s/100.0,2.0)'%i))for i in range(-Integer(300),Integer(300))]
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1208, in eval
        for L in code.split('\n') if L != ''])
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/interfaces/maxima.py", line 754, in _eval_line
        assert line_echo.strip() == line.strip()
    AssertionError

And that has nothing to do with this ticket...

@vbraun
Copy link
Member

vbraun commented May 2, 2013

comment:94

For the record, this is the comparison that fails:

        assert line_echo.strip() == line.strip(), 'mismatch:\n' + line_echo + line
    AssertionError: mismatch:
    jacobi_sn (-228/100 .0,2.0);
    jacobi_sn (-228/100.0,2.0);

@vbraun
Copy link
Member

vbraun commented May 2, 2013

Attachment: trac_14055_maxima_debug.patch.gz

Initial patch

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member

vbraun commented May 2, 2013

comment:95

I don't know why there are random spaces sprinkled in, but it certainly doesn't have anything to do with this ticket. I haven't found any other failures on taurus.

@jdemeyer
Copy link
Author

jdemeyer commented May 2, 2013

comment:96

Volker, I don't see how your patch would affect the error reported in [comment:82].

@vbraun
Copy link
Member

vbraun commented May 2, 2013

comment:97

What I'm saying is, I could not reproduce (on taurus) the error reported in comment:82. It might be a miscompile or a transient hardware error as far as I can tell.

@vbraun
Copy link
Member

vbraun commented May 2, 2013

@vbraun
Copy link
Member

vbraun commented May 2, 2013

comment:99

Running with nice -19 and background load seems to trigger it, I now get some maxima segfaults somewhat reliably. They look like a race in the multi-threaded ecl to me, for example the following failure:

sage: s = solve(eqn, a,b,c); s ## line 419 ##

;;; Unhandled lisp initialization error
;;; Message:
unbound-variable
;;; Arguments:

Internal or unrecoverable error in:

Lisp initialization error.

  [2: No such file or directory]

;;; ECL C Backtrace
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_dump_c_backtrace+0x28) [0x7f01f330e1d8]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_internal_error+0x3f) [0x7f01f32f91af]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1242f4) [0x7f01f32f92f4]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_funcall+0x70) [0x7f01f32dc3f0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_error+0xdb) [0x7f01f32fa13b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x125482) [0x7f01f32fa482]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(FEwrong_type_argument+0x1e) [0x7f01f32fa4ae]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(stream_dispatch_table+0x17) [0x7f01f32ece17]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_write_char+0x1b) [0x7f01f32ed6ab]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x13766b) [0x7f01f330c66b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(_ecl_write_symbol+0x156) [0x7f01f330cbc6]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_write_ugly_object+0x26) [0x7f01f330bcc6]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1242db) [0x7f01f32f92db]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_funcall+0x70) [0x7f01f32dc3f0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_error+0xdb) [0x7f01f32fa13b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1252d8) [0x7f01f32fa2d8]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_interpret+0x19cd) [0x7f01f32de67d]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x10e34f) [0x7f01f32e334f]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_eval_with_env+0x2eb) [0x7f01f32e4f0b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_signal_simple_error+0x26d) [0x7f01f32a9e4d]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(FEwrong_type_nth_arg+0x109) [0x7f01f32f9cf9]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(_ecl_sethash+0) [0x7f01f3326170]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x14df28) [0x7f01f3322f28]
;;; /lib64/libpthread.so.0() [0x324440f500]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1503e0) [0x7f01f33253e0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x15172c) [0x7f01f332672c]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_extend_hashtable+0x185) [0x7f01f3326335]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x151716) [0x7f01f3326716]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x11c4dc) [0x7f01f32f14dc]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_init_module+0x4e8) [0x7f01f32f7298]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(init_lib_LSP+0x4c9) [0x7f01f3218089]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_init_module+0x399) [0x7f01f32f7149]

@vbraun
Copy link
Member

vbraun commented May 4, 2013

strace of failed ecl startup (from taurus)

@vbraun
Copy link
Member

vbraun commented May 4, 2013

comment:100

Attachment: log.10693.gz

I've attached an strace from a failed ECL startup on taurus. It seems that during dlopen of ECL an unrelated thread (id 16046) happens to finish which causes a SIGCHLD. If the ECL signal handler is not turned off during init then that would explain things going south. Nils, since you wrote it maybe you have an opinion on that?

@nbruin
Copy link
Contributor

nbruin commented May 4, 2013

comment:101

The last time I looked at it, ECL as a library with multithread support is rather fundamentally incompatible with the way we use it. As far as I know, if thread support is enabled then ECL expects one (dedicated?) thread to handle asynchronous signals and I think that thread may even be split off when there's otherwise only one thread running. I don't see how to consolidate that with our own signal management. The solution up to know has been to only use ECL in single threaded mode, which may mean building it without thread support. I know ECL has been moving towards multithreadedness. However, as an "embeddable" lisp there's a good argument they should also keep a strictly single-threaded version as an option and I hope they're continuing with that (perhaps sending a message that we really appreciate single threadedness might help).

Our signal switching code has been a little behind the times for a while already. Some improvements were suggested here:

http://comments.gmane.org/gmane.lisp.ecl.general/8211

That code might be outdated already as well.

What is the hypothesis here, by the way? That an event triggered by sage-cleaner produces a signal meant for our own signal handler that happens to end up in with the ecl signal handler? If that's the problem I see no way around it. Then we just need to write our own signal handler to always be in control, keeping flags on whether some signals might have to be dispatched to ECL and do so for signals we deem appropriate for that. That's major surgery (I've actually been surprised we were able to avoid that for so long and I hope we can continue)

I guess normally, ECL expects SIGCHLD to be ignored. We apparently set a different mask? Changing to ECL's signal mask would include ignoring a SIGCHLD if it happens to be in control when the signal arrives. Can we afford to? Do we already try to do that? In that case is the problem that the signal arrives in a small window where we've switched handlers but not masks? Then we should be a little more careful about the switch, i.e., perhaps disable signals during the switch completely.

@vbraun
Copy link
Member

vbraun commented May 4, 2013

comment:102

For the record, we build ECL with --disable-threads so it shouldn't have any multithread support. It does link with pthreads, though perhaps boehmgc needs that.

I don't think the sage-cleaner has anything to do with the problem. It just slightly changes timings so that we happen to trigger a rather unlikely race on taurus/skynet. It certainly should not send any signals except for term/kill.

The new doctest framework seems to be the only thing that uses SIGCHLD in the Sage library.

@nbruin
Copy link
Contributor

nbruin commented May 4, 2013

comment:103

Replying to @vbraun:

For the record, we build ECL with --disable-threads so it shouldn't have any multithread support. It does link with pthreads, though perhaps boehmgc needs that.

OK, good.

I don't think the sage-cleaner has anything to do with the problem.

So perhaps this should go on a different ticket then?

The new doctest framework seems to be the only thing that uses SIGCHLD in the Sage library.

I took a quick look at unixint.d and it seems ECL does install a handler for SIGCHLD. Perhaps you're just unlucky that the SIGCHLD gets delivered at a very inopportune moment when ECL hasn't quite finished setting up its handlers? Or are we just sloppy in how we change handlers? The handler they use is defined in unixsys.d as si::wait-for-all-processes.

@vbraun
Copy link
Member

vbraun commented May 4, 2013

comment:104

Switching off ECL_OPT_TRAP_SIGCHLD seems to fix it (knock on wood). 50x testing linear_algebra.rst on taurus works with the new spkg (you need to recompile maxima after ecl).

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member

vbraun commented May 4, 2013

diff for review only: ecl-12.12.1.p2 -> ecl-12.12.1.p3

@jdemeyer
Copy link
Author

jdemeyer commented May 5, 2013

comment:105

Attachment: ecl-p3.diff.gz

Volker, if this works, it's a very nice piece of debugging. Testing right away...

@vbraun
Copy link
Member

vbraun commented May 6, 2013

comment:107

I ran 500 iterations for linear_algebra.rst and didn't get any failures. The "random space in output" presumably has also been caused by the SIGCHLD handler.

@jdemeyer
Copy link
Author

jdemeyer commented May 8, 2013

Merged: sage-5.10.beta2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants