Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent mpz_t state after interrupted sig_realloc() #24986

Closed
embray opened this issue Mar 15, 2018 · 87 comments
Closed

Inconsistent mpz_t state after interrupted sig_realloc() #24986

embray opened this issue Mar 15, 2018 · 87 comments

Comments

@embray
Copy link
Contributor

embray commented Mar 15, 2018

TODO

Use sig_occurred() to check whether an exception from Cysignals is currently being handled while in Integer.tp_dealloc. If so, assume that the state of the object's mpz struct may not be consistent, so do not call mpz_clear on it and do not place it back in the free pool.
Don't forget to re-enable the test that was disabled in #25137 in order to test that this is fixed.

As discussed on sage-devel, I'm fairly consistently (roughly 9 times out of 10) getting the following failure on Cygwin:

sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx
**********************************************************************
File "src/sage/structure/coerce_actions.pyx", line 786, in
sage.structure.coerce_actions.IntegerMulAction._repr_name_
Failed example:
    IntegerMulAction(ZZ, GF5)
Expected:
    Left Integer Multiplication by Integer Ring on Finite Field of size 5
Got:
    Left Integer Multiplication by Integer Ring on Finite Field of size 1
**********************************************************************
1 item had failures:
   1 of   4 in sage.structure.coerce_actions.IntegerMulAction._repr_name_
    [143 tests, 1 failure, 3.30 s]
----------------------------------------------------------------------
sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx  # 1
doctest failed

Obviously Sage doesn't even allow creation of an order 1 field. In fact, I traced the cause of this to a specific line in FiniteFieldFactory.create_key_and_extra_args where, by chance, an Integer with a value of 1 is constructed (using fast_tp_new) whose (mp_limb*)(Integer.value._mp_d) member is assigned the same address as the _mp_d of the Integer that happens to hold the field's order.

The result is that the order is then set to 1 as well. This happens after the check that order>1 so creation of the field still succeeds. Clearly there is a subtle bug either in fast_tp_new, or in the memory allocator itself.

Depends on #27073

CC: @jdemeyer @xcaruso

Component: cython

Author: Jeroen Demeyer

Branch/Commit: 521bac9

Reviewer: Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/24986

@embray embray added this to the sage-8.2 milestone Mar 15, 2018
@embray

This comment has been minimized.

@jdemeyer
Copy link

comment:2

We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:3

Replying to @jdemeyer:

We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.

Yep, I agree. Though I think there's also a possibility of a subtle bug. The code around which this is happening looks like this:

            order = Integer(order)
            if order <= 1:
                raise ValueError("the order of a finite field must be at least 2")

            if order.is_prime():
                p = order
                n = Integer(1)

It's the last line, n = Integer(1) where that Integer has the same _mp_d as order for some reason. Also in this case order is already passed in as an Integer, so I'm wondering if there's some subtle bug around order = Integer(order).

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:4

Maybe not--by the time I get to order = Integer(order) the next two Integers in the pool already shared the same _mp_d:

(gdb) p __pyx_v_4sage_5rings_7integer_integer_pool_count
$37 = 7
(gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[6]).value
$38 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}}
(gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[5]).value
$39 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}}

I'm going to add in some tracing and see if that reveals anything (or makes the problem go away entirely which would point more toward a Cygwin bug if it's that sensitive.

@jdemeyer
Copy link

comment:5

Can you compile the Sage library with -O0 -g just to check for a potential compiler bug? You will need to do rm -rf src/build first.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:6

I'll give that a try in a bit. A compiler bug is also certainly a possibility. This is gcc 6.4.0.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:7

It's not possible that there'd be anywhere in Sage where it creates Integer objects while the GIL is released, is there?

@jdemeyer
Copy link

comment:8

More generally, I think that Sage never releases the GIL. Libraries used by Sage (e.g. Numpy) might.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:9

Aha! I put some code in fast_tp_dealloc to check, before returning an Integer to the pool, if there is already an entry in the pool with the same _mp_d, and raise a RuntimeError if so. And this condition first occurs in earlier test in the same module:

sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx
**********************************************************************
File "src/sage/structure/coerce_actions.pyx", line 758, in sage.structure.coerce_actions.IntegerMulAction._call_
Failed example:
    alarm(0.5); (2^(10^7)) * P
Expected:
    Traceback (most recent call last):
    ...
    AlarmInterrupt
Got:
    RuntimeError: [fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L!
                            (integer_pool_count: 3)
    Exception RuntimeError: RuntimeError('[fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L!\n                        (integer_pool_count: 3)',) in 'sage.rings.integer.fast_tp_dealloc' ignored
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 541, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 951, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.structure.coerce_actions.IntegerMulAction._call_[11]>", line 1, in <module>
        alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P
    SystemError: error return without exception set

The presence of alarm() also explains the slight randomness of this error. So it seems that somewhere the allocation/deallocation of integers not safe to interruption by a signal...

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:10

Specifically this is inside fast_mul which is doing some pretty heavy creation and destruction of Integers. If I just run the same code manually at the prompt and hit Ctrl-C, depending on exactly where it is fast_mul, sometimes I just get a KeyboardInterrupt raised from cysignals, or sometimes my RuntimeError gets raised.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:11

Two new notes:

  1. I can reproduce the problem reliably on Cygwin by the following:
sage: E = EllipticCurve(GF(5), [4,0])
sage: set_random_seed(0); P = E.random_element()
sage: alarm(0.5); (2^(10^7)) * P
---------------------------------------------------------------------------
AlarmInterrupt                            Traceback (most recent call last)
<ipython-input-3-1f9f5650ad9d> in <module>()
----> 1 alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P
...
/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40438)()
   6371         # Now finally call into MPIR to do the shifting.
   6372         cdef Integer z = PY_NEW(Integer)
-> 6373         sig_on()
   6374         if n < 0:
   6375             mpz_fdiv_q_2exp(z.value, self.value, -n)

src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception()

AlarmInterrupt:
sage: a = Integer(5)
sage: a
5
sage: b = Integer(1)
sage: a
1

This same process works correctly on Linux though, which is still suspicious.

  1. If I remove the following lines from fast_tp_dealloc the problem appears to go away:
        # Here we free any extra memory used by the mpz_t by
        # setting it to a single limb.
        if o_mpz._mp_alloc > 10:
            _mpz_realloc(o_mpz, 1)

so it's possible also something fishy is happening with _mpz_realloc.

@jdemeyer
Copy link

comment:12

Replying to @embray:

Specifically this is inside fast_mul

Just to be clear that we are talking about the same thing: you mean fast_mul inside src/sage/structure/coerce_actions.pyx?

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:13

Yes. Then at some point the _mpz_realloc I mentioned above sets the pointer to the same address as one of the existing (presumably unused) integers in the pool.

@jdemeyer
Copy link

comment:14

It would be good to know exactly which code is executing when the interrupt/alarm happens.

There are two ways to find out:

  1. Run Python under gdb

  2. Using cysignals assume that Cygwin has a backtrace() function. You'll need to recompile cysignals with debugging. Within Sage, it suffices to do

SAGE_DEBUG=yes ./sage -f cysignals

@jdemeyer
Copy link

comment:15

Something else to try (just shooting in the dark here): explicitly call sage.ext.memory.init_memory_functions() before doing your tests.

@embray
Copy link
Contributor Author

embray commented Mar 15, 2018

comment:16

Replying to @jdemeyer:

Something else to try (just shooting in the dark here): explicitly call sage.ext.memory.init_memory_functions() before doing your tests.

What would explicitly calling it do? Shouldn't it be called automatically anyways? I confirmed in gdb that mpir is using the sage/cysignals allocation functions. I actually tried disabling it and it didn't seem to make a difference on the problem.

@jdemeyer
Copy link

comment:17

Replying to @embray:

Shouldn't it be called automatically anyways?

Of course it should. But the whole point of debugging is to verify that everything that should happen actually happens.

@jdemeyer
Copy link

comment:18

Does the cysignals testsuite pass (./sage -f -c cysignals)? Does anything change if cysignals (and the Sage library) is compiled with -O0?

@embray
Copy link
Contributor Author

embray commented Mar 16, 2018

comment:19

After rebuilding with SAGE_DEBUG=yes (mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with -O0.

For reference's sake, with the alarm(0.5) (and actually with other values as well, such as alarm(1)), the AlarmInterrupt is almost always raised from the same place:

sage: alarm(1); (2^(10^7)) * P

*** SIG 14 *** inside sig_on
do_raise_exception(sig=14)
PyErr_Occurred() = 0x0
Raising Python exception 0 ms after signals...
---------------------------------------------------------------------------
AlarmInterrupt                            Traceback (most recent call last)
<ipython-input-17-57b8340ec195> in <module>()
----> 1 alarm(Integer(1)); (Integer(2)**(Integer(10)**Integer(7))) * P

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__mul__ (build/cythonized/sage/rings/integer.c:12921)()
   1856             return y
   1857
-> 1858         return coercion_model.bin_op(left, right, operator.mul)
   1859
   1860     cpdef _mul_(self, right):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce.pyx in sage.structure.coerce.CoercionModel_cache_maps.bin_op (build/cythonized/sage/structure/coerce.c:9695)()
   1114             action = self.get_action(xp, yp, op, x, y)
   1115             if action is not None:
-> 1116                 return (<Action>action)._call_(x, y)
   1117
   1118         try:

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.IntegerMulAction._call_ (build/cythonized/sage/structure/coerce_actions.c:10587)()
    772             return fast_mul_long(a, n_long)
    773
--> 774         return fast_mul(a, nn)
    775
    776     def _repr_name_(self):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.fast_mul (build/cythonized/sage/structure/coerce_actions.c:11610)()
    899         sig_check()
    900         pow2a += pow2a
--> 901         n = n >> 1
    902     sum = pow2a
    903     n = n >> 1

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__rshift__ (build/cythonized/sage/rings/integer.c:40813)()
   6444         if not isinstance(x, Integer):
   6445             return x >> int(y)
-> 6446         return (<Integer>x)._shift_helper(y, -1)
   6447
   6448     cdef _and(Integer self, Integer other):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40452)()
   6375         # Now finally call into MPIR to do the shifting.
   6376         cdef Integer z = PY_NEW(Integer)
-> 6377         sig_on()
   6378         if n < 0:
   6379             mpz_fdiv_q_2exp(z.value, self.value, -n)

src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception()

AlarmInterrupt:

On occasion it's elsewhere, but most of the time it's in that shift call. To learn much else I'll have to drop into gdb, but first I'll have to rebuild some modules with -O(>0) and see if the problem returns...

@jdemeyer
Copy link

comment:20

In other to narrow the problem down, it would be good to know if it matters whether cysignals is compiled with -O0 or whether the Sage library is compiled with -O0.

@jdemeyer
Copy link

comment:21

Replying to @embray:

On occasion it's elsewhere, but most of the time it's in that shift call.

The problem is that the sig_on() hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.

@embray
Copy link
Contributor Author

embray commented Mar 16, 2018

comment:22

Replying to @embray:

After rebuilding with SAGE_DEBUG=yes (mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with -O0.

Oops--disregard that. I had forgotten exactly where I left off on this yesterday, and it appears I still had the troublesome _mpz_realloc disabled in the source I was building from. After re-enabling that the problem returns, even in the debug build.

@embray
Copy link
Contributor Author

embray commented Mar 16, 2018

comment:23

Replying to @jdemeyer:

Replying to @embray:

On occasion it's elsewhere, but most of the time it's in that shift call.

The problem is that the sig_on() hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.

Right, but doesn't that at least mean it's at least somewhere between that sig_on() and the corresponding sig_off()?

@jdemeyer
Copy link

comment:24

OK, so the problem persists with -O0. Good, so it looks like a real bug (either in Sage, cysignals, MPIR or Cygwin).

@embray
Copy link
Contributor Author

embray commented Mar 16, 2018

comment:25

Yes, likely one of those four things--probably not the compiler though I wouldn't rule it out completely.

@jdemeyer
Copy link

comment:26

Replying to @embray:

Right, but doesn't that at least mean it's at least somewhere between that sig_on() and the corresponding sig_off()?

Yes, but I would like to know the exact place in the MPIR code where it is interrupted.

@embray embray added this to the sage-8.7 milestone Dec 28, 2018
@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jan 6, 2019

Changed commit from 539d0e5 to 521bac9

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Jan 6, 2019

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

6d4bb9fdoctest framework: clear exception
521bac9Throw away mpz_t after an interrupt happened

@jdemeyer
Copy link

jdemeyer commented Jan 6, 2019

Changed dependencies from #26900 to none

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

comment:76

Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?

@jdemeyer
Copy link

jdemeyer commented Jan 7, 2019

comment:78

Replying to @embray:

Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?

Yes, it is fixed by cysignals 1.8.1: sagemath/cysignals@7030f02

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

comment:79

Okay, great. I'm testing now on Cygwin, and will give positive review assuming it passes.

Also need to make sure cysignals 1.8.1 lands in Debian but that's no reason to hold up the ticket.

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

comment:80

Been running this test in a loop for more than an hour now with no crash. Of course, this still isn't 100% fool-proof but it's definitely much better.

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

Reviewer: Erik Bray

@vbraun
Copy link
Member

vbraun commented Jan 17, 2019

comment:81

See patchbot

@jdemeyer
Copy link

comment:82

So this is genuinely breaking docbuild tests somehow. That's bizarre...

@jdemeyer
Copy link

comment:83

It's the doctest change from this ticket which is breaking things. Apparently the sphinxbuild.py test is raising an exception and then it tries to display the traceback, long after the exception has been handled by the doctest framework. I consider this a fishy test, so the solution is fixing that test.

@jdemeyer
Copy link

comment:84

In fact, the test already fails on Python 3 with vanilla Sage 8.6.

@jdemeyer
Copy link

Dependencies: #27073

@vbraun
Copy link
Member

vbraun commented Jan 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants