-
-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent mpz_t state after interrupted sig_realloc() #24986
Comments
This comment has been minimized.
This comment has been minimized.
comment:2
We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms. |
comment:3
Replying to @jdemeyer:
Yep, I agree. Though I think there's also a possibility of a subtle bug. The code around which this is happening looks like this: order = Integer(order)
if order <= 1:
raise ValueError("the order of a finite field must be at least 2")
if order.is_prime():
p = order
n = Integer(1) It's the last line, |
comment:4
Maybe not--by the time I get to
I'm going to add in some tracing and see if that reveals anything (or makes the problem go away entirely which would point more toward a Cygwin bug if it's that sensitive. |
comment:5
Can you compile the Sage library with |
comment:6
I'll give that a try in a bit. A compiler bug is also certainly a possibility. This is gcc 6.4.0. |
comment:7
It's not possible that there'd be anywhere in Sage where it creates Integer objects while the GIL is released, is there? |
comment:8
More generally, I think that Sage never releases the GIL. Libraries used by Sage (e.g. Numpy) might. |
comment:9
Aha! I put some code in
The presence of |
comment:10
Specifically this is inside |
comment:11
Two new notes:
This same process works correctly on Linux though, which is still suspicious.
# Here we free any extra memory used by the mpz_t by
# setting it to a single limb.
if o_mpz._mp_alloc > 10:
_mpz_realloc(o_mpz, 1) so it's possible also something fishy is happening with |
comment:12
Replying to @embray:
Just to be clear that we are talking about the same thing: you mean |
comment:13
Yes. Then at some point the |
comment:14
It would be good to know exactly which code is executing when the interrupt/alarm happens. There are two ways to find out:
|
comment:15
Something else to try (just shooting in the dark here): explicitly call |
comment:16
Replying to @jdemeyer:
What would explicitly calling it do? Shouldn't it be called automatically anyways? I confirmed in gdb that mpir is using the sage/cysignals allocation functions. I actually tried disabling it and it didn't seem to make a difference on the problem. |
comment:17
Replying to @embray:
Of course it should. But the whole point of debugging is to verify that everything that should happen actually happens. |
comment:18
Does the cysignals testsuite pass ( |
comment:19
After rebuilding with For reference's sake, with the
On occasion it's elsewhere, but most of the time it's in that shift call. To learn much else I'll have to drop into gdb, but first I'll have to rebuild some modules with |
comment:20
In other to narrow the problem down, it would be good to know if it matters whether cysignals is compiled with |
comment:21
Replying to @embray:
The problem is that the |
comment:22
Replying to @embray:
Oops--disregard that. I had forgotten exactly where I left off on this yesterday, and it appears I still had the troublesome |
comment:23
Replying to @jdemeyer:
Right, but doesn't that at least mean it's at least somewhere between that |
comment:24
OK, so the problem persists with |
comment:25
Yes, likely one of those four things--probably not the compiler though I wouldn't rule it out completely. |
comment:26
Replying to @embray:
Yes, but I would like to know the exact place in the MPIR code where it is interrupted. |
Changed dependencies from #26900 to none |
comment:76
Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this? |
comment:78
Replying to @embray:
Yes, it is fixed by cysignals 1.8.1: sagemath/cysignals@7030f02 |
comment:79
Okay, great. I'm testing now on Cygwin, and will give positive review assuming it passes. Also need to make sure cysignals 1.8.1 lands in Debian but that's no reason to hold up the ticket. |
comment:80
Been running this test in a loop for more than an hour now with no crash. Of course, this still isn't 100% fool-proof but it's definitely much better. |
Reviewer: Erik Bray |
comment:81
See patchbot |
comment:82
So this is genuinely breaking docbuild tests somehow. That's bizarre... |
comment:83
It's the doctest change from this ticket which is breaking things. Apparently the |
comment:84
In fact, the test already fails on Python 3 with vanilla Sage 8.6. |
Dependencies: #27073 |
Changed branch from u/jdemeyer/inconsistent_mpz_t_state_after_interrupted_sig_realloc__ to |
TODO
sig_occurred()
to check whether an exception from Cysignals is currently being handled while inInteger.tp_dealloc
. If so, assume that the state of the object's mpz struct may not be consistent, so do not callmpz_clear
on it and do not place it back in the free pool.As discussed on sage-devel, I'm fairly consistently (roughly 9 times out of 10) getting the following failure on Cygwin:
Obviously Sage doesn't even allow creation of an order 1 field. In fact, I traced the cause of this to a specific line in
FiniteFieldFactory.create_key_and_extra_args
where, by chance, anInteger
with a value of1
is constructed (usingfast_tp_new
) whose(mp_limb*)(Integer.value._mp_d)
member is assigned the same address as the_mp_d
of theInteger
that happens to hold the field's order.The result is that the
order
is then set to1
as well. This happens after the check thatorder>1
so creation of the field still succeeds. Clearly there is a subtle bug either infast_tp_new
, or in the memory allocator itself.Depends on #27073
CC: @jdemeyer @xcaruso
Component: cython
Author: Jeroen Demeyer
Branch/Commit:
521bac9
Reviewer: Erik Bray
Issue created by migration from https://trac.sagemath.org/ticket/24986
The text was updated successfully, but these errors were encountered: