Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errno 16 / NFS problems with parallel/decorate.py #9616

Closed
dandrake opened this issue Jul 28, 2010 · 21 comments
Closed

Errno 16 / NFS problems with parallel/decorate.py #9616

dandrake opened this issue Jul 28, 2010 · 21 comments

Comments

@dandrake
Copy link
Contributor

In 4.5.2.alpha1, we have for many people:

sage -t -long "devel/sage/sage/parallel/decorate.py"        


------------------------------------------------------------
Unhandled SIGSEGV: A segmentation fault occurred in Sage.
This probably occurred because a *compiled* component
of Sage has a bug in it (typically accessing invalid memory)
or is not properly wrapped with _sig_on, _sig_off.
You might want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate (sorry).
------------------------------------------------------------

**********************************************************************
File "/mnt/usb1/scratch/drake/release/tmp/sage-4.5.2.alpha1/devel/sage/sage/parallel/decorate.py", line 300:
    sage: g()
Expected:
    '10'
Got:
    [Errno 16] Device or resource busy: '/home/drake/.sage/temp/sage.math.washington.edu/30336/dir_0/.nfs0000000000591f8700069d5c'
    '10'
**********************************************************************
File "/mnt/usb1/scratch/drake/release/tmp/sage-4.5.2.alpha1/devel/sage/sage/parallel/decorate.py", line 311:
    sage: g()
Expected:
    'a'
Got:
    [Errno 16] Device or resource busy: '/home/drake/.sage/temp/sage.math.washington.edu/30336/dir_1/.nfs0000000000591f8d00069d5d'
    'a'
**********************************************************************

and so on. See https://groups.google.com/group/sage-release/msg/88b030aa31926459 and that thread.

This seems related to #9501.

CC: @nexttime @jhpalmieri @kcrisman @malb @sagetrac-mvngu @simon-king-jena @williamstein

Component: doctest coverage

Keywords: fork nfs device resource busy

Author: Mitesh Patel

Reviewer: John Palmieri

Merged: sage-4.5.2.rc0

Issue created by migration from https://trac.sagemath.org/ticket/9616

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

comment:1

For what they're worth, tests on sage.math with variations on

#!/bin/bash                                                                     

# This does not keep overall statistics:                                        
# env SAGE_TEST_GLOBAL_ITER=100 ./sage -tp 1 -long /path/to/file.py                     

SAGE_TEST="./sage -t -long"
#SAGE_TEST="env DOT_SAGE=/dev/shm/$USER/.sage $SAGE_TEST"                       
#SAGE_TEST="env DOT_SAGE=/scratch/$USER/.sage $SAGE_TEST"                       
RUNS=100
for I in `seq 1 $RUNS`;
do
    $SAGE_TEST devel/sage/sage/parallel/decorate.py
    CODE[$I]=$?

    echo "Results after $I of $RUNS runs:"
    echo "${CODE[*]}" | tr ' ' '\n' | sort -n | uniq -c
done

end with

Results after 100 of 100 runs:                                          
     1 0                                                                      
    99 128                                                                    
Results after 100 of 100 runs:                                       
   100 0                                                                      
Results after 100 of 100 runs:                                       
   100 0                                                                      

for the default DOT_SAGE, /scratch/$USER/.sage, and /dev/shm/$USER/.sage, respectively.

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

comment:2

For now, should we tag the relevant tests with # not tested or backout the whole patch? Other options?

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

comment:3

By the way, here are the latest doctesting exist codes (cf. #9243), from the top of sage-doctest:

# Return value in process exit code:
# 0: all tests passed
# 1: file not found
# 2: KeyboardInterrupt
# 4: doctest process was terminated by a signal
# 8: the doctesting framework raised an exception
# 16: script called with bad options
# 32: (used internally in sage-ptest)
# 64: time out
# 128: failed doctests

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 28, 2010

comment:4

According to William on sage-release, the segfault is an intentional part of a doctest, so I've changed the ticket's title.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 28, 2010

Changed keywords from fork nfs segfault to fork nfs device resource busy

@nexttime nexttime mannequin changed the title segfault / NFS problems with parallel/decorate.py Errno 16 / NFS problems with parallel/decorate.py Jul 28, 2010
@jhpalmieri
Copy link
Member

comment:5

Replying to @qed777:

For now, should we tag the relevant tests with # not tested or backout the whole patch? Other options?

If we backout the whole patch, I have more confidence that the doctests will get fixed quickly.

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

Attachment: trac_9616-backout_9501_fork_deco.patch.gz

Backout #9501

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

Author: Mitesh Patel

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 28, 2010

comment:6

Replying to @jhpalmieri:

Replying to @qed777:

For now, should we tag the relevant tests with # not tested or backout the whole patch? Other options?

If we backout the whole patch, I have more confidence that the doctests will get fixed quickly.

Adapting the procedure in this comment at #9583, I've attached a patch that undoes (or should undo) all of #9501. If the patch gets a positive review, we can open a new ticket for re-merging #9501.

@qed777 qed777 mannequin added the s: needs review label Jul 28, 2010
@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 28, 2010

comment:7

Hmmm, of course a simple procedure, but we'd back out too much in my opinion...

But I can live with that. (And I'm currently too (laz|bus)y to sort out the desirable parts of the original patch.)

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 28, 2010

Attachment: trac_9616-backout_only_some_of_9501.patch.gz

Backouts only ticket-relevant parts of #9501 (subset of Mitesh's patch)

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 28, 2010

comment:8

Replying to @nexttime:

(And I'm currently too (laz|bus)y to sort out the desirable parts of the original patch.)

Couldn't resist though (simpler as expected).

Not very tested, only successfully ran ./sage -t -long devel/sage/sage/parallel/ and rebuilt the documentation without errors or warnings.
(Ubuntu 9.04 x86_64, Core2, gcc 4.3.3)

So now two concurrent patches to review... ;-)

@jhpalmieri
Copy link
Member

comment:9

I've tested mpatel's patch on 5 machines: 4 on which the problem originally occurred (sage.math and skynet machines eno, iras, and taurus) and one machine (running OS X) which didn't have the original problem. After applying the patch, all tests pass for the directory "parallel" on all 5 machines. Long doctests for the whole Sage library pass on sage.math and taurus except for previously known, unrelated, failures.

I don't know if I'll get to leif's patch.

Since this is a rollback to a previous situation, I think this is good enough for a positive review for mpatel's patch, though. Opinions?

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 29, 2010

comment:10

Let the release managers decide... ;-)

@dandrake
Copy link
Contributor Author

Reviewer: John Palmieri

@dandrake
Copy link
Contributor Author

comment:11

Replying to @jhpalmieri:

Since this is a rollback to a previous situation, I think this is good enough for a positive review for mpatel's patch, though. Opinions?

You've done some good testing, and since the original patch was an enhancement, and didn't fix any bugs or failing doctests (right?), I think a positive review is warranted here.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 29, 2010

comment:12

Replying to @dandrake:

... since the original patch was an enhancement, and didn't fix any bugs or failing doctests (right?)

Well, I pushed back in mostly fixes (and improvements) to the documentation (one might consider bugfixes, too).

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 29, 2010

comment:13

Besides the above mentioned, this would completely miss, too:

       - ``reset_interface`` -- if True (the default), all the 
         pexpect interfaces are reset in the forked off 
         subprocesses.  You definitely want this, since not doing 
         this can lead to weird issues.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Jul 29, 2010

comment:14

Ooops, forget my last comment: The reset is performed just unconditionally in Mitesh's version.

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 29, 2010

comment:15

A partial backout, since it retains only some of the changes from #9501, needs a new review, which currently, at least, we don't have. Given the need to press forward with the 4.5.2 release cycle, I'm merging attachment: trac_9616-backout_9501_fork_deco.patch into 4.5.2.rc0.

This may not be an ideal resolution, but it seems reasonable given the circumstances. Absolutely no offense is intended.

I've opened #9631 for re-merging #9501 after we fix the NFS/doctest problem.

@qed777
Copy link
Mannequin

qed777 mannequin commented Jul 29, 2010

Merged: sage-4.5.2.rc0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants