Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docbuild segfaults when pari is compiled with threading #26608

Closed
timokau opened this issue Oct 31, 2018 · 46 comments
Closed

Docbuild segfaults when pari is compiled with threading #26608

timokau opened this issue Oct 31, 2018 · 46 comments

Comments

@timokau
Copy link
Contributor

timokau commented Oct 31, 2018

This ticket is a followup to this sage-packaging discussion. To summarize:

sage does not work together with pari's threading. Instead of relying on it being compiled without threading, I made use of the "nthreads" option to disable threading at runtime in #26002.

However since #24655 (unconditionally enabling threaded docbuild), the docbuild segfaults when pari is compiled with threading support. Apparently sage somehow uses pari while ignoring the nthread option. We get the following backtrace (provided by Antonio with an older version of sage):

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 113, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
    return map(*args)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 70, in build_ref_doc
    getattr(ReferenceSubBuilder(doc, lang), format)(*args, **kwds)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 720, in _wrapper
    getattr(DocBuilder, build_type)(self, *args, **kwds)
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/__init__.py", line 104, in f
    runsphinx()
  File "/build/sagemath-doc/src/sage-8.0/local-python/sage_setup/docbuild/sphinxbuild.py", line 207, in runsphinx
    sphinx.cmdline.main(sys.argv)
  File "/usr/lib/python2.7/site-packages/sphinx/cmdline.py", line 296, in main
    app.build(opts.force_all, filenames)
  File "/usr/lib/python2.7/site-packages/sphinx/application.py", line 333, in build
    self.builder.build_update()
  File "/usr/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 251, in build_update
    'out of date' % len(to_build))
  File "/usr/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 265, in build
    self.doctreedir, self.app))
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 549, in update
    self._read_serial(docnames, app)
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 569, in _read_serial
    self.read_doc(docname, app)
  File "/usr/lib/python2.7/site-packages/sphinx/environment/__init__.py", line 677, in read_doc
    pub.publish()
  File "/usr/lib/python2.7/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/usr/lib/python2.7/site-packages/sphinx/io.py", line 55, in read
    self.parse()
  File "/usr/lib/python2.7/site-packages/docutils/readers/__init__.py", line 78, in parse
    self.parser.parse(self.input, document)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/__init__.py", line 191, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 171, in run
    input_source=document['source'])
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2328, in explicit_markup
    self.explicit_list(blank_finish)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2358, in explicit_list
    match_titles=self.state_machine.match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 319, in nested_list_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2631, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/build/sagemath-doc/src/sage-8.0/src/sage_setup/docbuild/ext/sage_autodoc.py", line 1749, in run
    nested_parse_with_titles(self.state, self.result, node)
  File "/usr/lib/python2.7/site-packages/sphinx/util/nodes.py", line 208, in nested_parse_with_titles
    return state.nested_parse(content, 0, node, match_titles=1)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/lib/python2.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2326, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/usr/lib/python2.7/site-packages/docutils/parsers/rst/__init__.py", line 410, in run
    self.state, self.state_machine)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 189, in plot_directive
    return run(arguments, content, options, state_machine, state, lineno)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 779, in run
    close_figs=context_opt == 'close-figs')
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 644, in render_figures
    run_code(code_piece, code_path, ns, function_name)
  File "/usr/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 524, in run_code
    six.exec_(code, ns)
  File "/usr/lib/python2.7/site-packages/six.py", line 709, in exec_
    exec("""exec _code_ in _globs_, _locs_""")
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "sage/misc/classcall_metaclass.pyx", line 329, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1698)
    if cls.classcall is not None:
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 331, in __classcall__
    .__classcall__(cls, points, connected, fine, regular, star, defined_affine)
  File "sage/misc/cachefunc.pyx", line 1005, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6065)
    ArgSpec(args=['self', 'algorithm', 'deg_bound', 'mult_bound', 'prot'],
  File "/usr/lib/python2.7/site-packages/sage/structure/unique_representation.py", line 1027, in __classcall__
    instance = typecall(cls, *args, **options)
  File "sage/misc/classcall_metaclass.pyx", line 496, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2148)
    """
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 367, in __init__
    PointConfiguration_base.__init__(self, points, defined_affine)
  File "sage/geometry/triangulation/base.pyx", line 398, in sage.geometry.triangulation.base.PointConfiguration_base.__init__ (build/cythonized/sage/geometry/triangulation/base.cpp:4135)
    self._init_points(points)
  File "sage/geometry/triangulation/base.pyx", line 456, in sage.geometry.triangulation.base.PointConfiguration_base._init_points (build/cythonized/sage/geometry/triangulation/base.cpp:4982)
    red = matrix([ red.row(i) for i in red.pivot_rows()])
  File "sage/matrix/matrix2.pyx", line 517, in sage.matrix.matrix2.Matrix.pivot_rows (build/cythonized/sage/matrix/matrix2.c:8414)
    """
  File "sage/matrix/matrix_integer_dense.pyx", line 2217, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.pivots (build/cythonized/sage/matrix/matrix_integer_dense.c:19162)
    sage: matrix(3, range(9)).elementary_divisors()
  File "sage/matrix/matrix_integer_dense.pyx", line 2019, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.echelon_form (build/cythonized/sage/matrix/matrix_integer_dense.c:17749)
   
  File "sage/matrix/matrix_integer_dense.pyx", line 5719, in sage.matrix.matrix_integer_dense.Matrix_integer_dense._hnf_pari (build/cythonized/sage/matrix/matrix_integer_dense.c:46635)
    most `\max\mathcal{S}` where `\mathcal{S}` denotes the full
SignalError: Segmentation fault 

That shows us that src/sage/matrix/matrix_integer_dense.pyx is involved. Apparently that file directly uses cypari c-bindings instead of the libs/pari.py interface (where the nthreads option is added). For example:

def LLL_gram(self, flag = 0):
    if self._nrows != self._ncols:
        raise ArithmeticError("self must be a square matrix")

    n = self.nrows()
    # maybe should be /unimodular/ matrices ?
    P = self.__pari__()
    try:
        U = P.qflllgram(flag)
    except (RuntimeError, ArithmeticError) as msg:
        raise ValueError("qflllgram failed, "
                         "perhaps the matrix is not positive definite")
    if U.matsize() != [n, n]:
        raise ValueError("qflllgram did not return a square matrix, "
                         "perhaps the matrix is not positive definite");
    MS = matrix_space.MatrixSpace(ZZ,n)
    U = MS(U.sage())
    # Fix last column so that det = +1
    if U.det() == -1:
        for i in range(n):
            U[i,n-1] = - U[i,n-1]
    return U

Can someone more familiar with cython and cypari tell if the options defined in libs/pari.py would apply here? Why isn't libs/pari.py used?

CC: @antonio-rojas @jdemeyer @kiwifb @dimpase @saraedum @embray

Component: documentation

Keywords: docbuild, pari

Issue created by migration from https://trac.sagemath.org/ticket/26608

@timokau timokau added this to the sage-8.5 milestone Oct 31, 2018
@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:1

I think there's a little bit of misinformation / misconception here.

There's nothing about Sage's docbuild program that uses multi-threading. It uses a process pool and builds each sub-document in separate processes.

(There are some cases where it does not run builds in subprocesses when it probably should, and I think that is contributing somewhat to the explosion of memory usage in the docbuild, but that's a separate issue).

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:2

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

@antonio-rojas
Copy link
Contributor

comment:3

Note that the code excerpts in the last lines of the backtrace are nonsense since I was compiling an older version of the docs. Here's the "translated" version:

  File "sage/misc/cachefunc.pyx", line 1005, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6065)
    w = self.f(*args, **kwds)
  File "/usr/lib/python2.7/site-packages/sage/structure/unique_representation.py", line 1027, in __classcall__
    instance = typecall(cls, *args, **options)
  File "sage/misc/classcall_metaclass.pyx", line 496, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2148)
    return (<PyTypeObject*>type).tp_call(cls, args, kwds)
  File "/usr/lib/python2.7/site-packages/sage/geometry/triangulation/point_configuration.py", line 367, in __init__
    PointConfiguration_base.__init__(self, points, defined_affine)
  File "sage/geometry/triangulation/base.pyx", line 398, in sage.geometry.triangulation.base.PointConfiguration_base.__init__ (build/cythonized/sage/geometry/triangulation/base.cpp:4135)
    self._init_points(points)
  File "sage/geometry/triangulation/base.pyx", line 456, in sage.geometry.triangulation.base.PointConfiguration_base._init_points (build/cythonized/sage/geometry/triangulation/base.cpp:4982)
    red = matrix([ red.row(i) for i in red.pivot_rows()])
  File "sage/matrix/matrix2.pyx", line 517, in sage.matrix.matrix2.Matrix.pivot_rows (build/cythonized/sage/matrix/matrix2.c:8414)
    v = self.transpose().pivots()
  File "sage/matrix/matrix_integer_dense.pyx", line 2217, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.pivots (build/cythonized/sage/matrix/matrix_integer_dense.c:19162)
    E = self.echelon_form()
  File "sage/matrix/matrix_integer_dense.pyx", line 2019, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.echelon_form (build/cythonized/sage/matrix/matrix_integer_dense.c:17749)
    H_m = self._hnf_pari(flag, include_zero_rows=include_zero_rows)
  File "sage/matrix/matrix_integer_dense.pyx", line 5719, in sage.matrix.matrix_integer_dense.Matrix_integer_dense._hnf_pari (build/cythonized/sage/matrix/matrix_integer_dense.c:46635)
    sig_on()
SignalError: Segmentation fault 

@timokau

This comment has been minimized.

@timokau
Copy link
Contributor Author

timokau commented Oct 31, 2018

comment:5

Replying to @embray:

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

I'll test if that openblas patch fixes it. Very interesting ticket, I wonder if that also causes #26130 (I've heard darwin is somewhat more prone to threading bugs).

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:6

I'm not sure if it does, but it might. I grepped the pari/gp source and it doesn't use openblas_set_num_threads directly, but something else might be. I did see some reference to omp_set_num_threads but I don't think we compile with OpenMP by default in Sage.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:7

There's also some multi-threading support in FLINT which could be problematic, but I have no idea if that's relevant in this case.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:8

I read on the mailing list post "It is called indirectly via matplotlib when rendering plots, see full backtrace below (btw, I had to downgrade to an old version of Sage to get a meaningful backtrace - I really dislike this trend of hiding build output, it makes it very hard to debug stuff)"

I had a very similar problem to this; it actually came from the BLAS library by way of a Numpy ufunc (I think for the "dot" product of a matrix in a vector, or two matrices). I feel like I actually fixed this but now I can't remember.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:9

Do you know some direct, specific way to reproduce this so that I can try it?

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:10

Replying to @timokau:

Replying to @embray:

I don't know if PARI uses openblas in its multi-threaded mode but I wonder if this is related to #26585

I'll test if that openblas patch fixes it. Very interesting ticket, I wonder if that also causes #26130 (I've heard darwin is somewhat more prone to threading bugs).

I don't think it's related, because this only showed up when we patched fflas-ffpack to allow configuring the number of threads to use with openblas (by default it just sets it to 1). But conceivably there's a similar bug elsewhere. Possibly related to fork(). I have found many bugs in different projects related to threads/fork interaction.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:11

You know what though--I'm looking at the relevant code in openblas, and openblas_set_num_threads(1) might actually cause it to spin up a single thread (beyond the main thread) which is actually good enough to invoke the bug I fixed. I'm going to try to confirm that though.

@jdemeyer
Copy link

comment:12

Replying to @embray:

I'm not sure if it does, but it might. I grepped the pari/gp source and it doesn't use openblas_set_num_threads directly, but something else might be.

I don't think that PARI uses BLAS in any way.

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

comment:14

Replying to @embray:

I have found many bugs in different projects related to threads/fork interaction.

I'm also betting on this. PARI might setup some data structures related to threading (when compiled with threading support) which are invalid when running in a forked child process.

@embray

This comment has been minimized.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:15

Nevermind; that does not appear to be the case, I don't think.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:16

Replying to @jdemeyer:

Replying to @embray:

I have found many bugs in different projects related to threads/fork interaction.

I'm also betting on this. PARI might setup some data structures related to threading (when compiled with threading support) which are invalid when running in a forked child process.

Yes, I think you must be right. PARI has its own thread management, and it does not implement any pthread_atfork handler that I can find, which is strong cause to suspect it...

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:17

ISTM PARI/GP is not even built with multi-threading enabled unless you run its Configure with either --mt=pthread or --mt=mpi. On my system anyways this is not done by default...

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:18

Now if I build PARI with --mt=pthread I can make a child process segfault if it tries to do some multi-threaded work. Granted, this is partly with my own bad code which does some things improperly. Now I'd be curious if

a) Anyone getting having this problem is building PARI with --mt=pthread and

b) Exactly what code is being run in Sage that invokes multi-threading in PARI.

@timokau
Copy link
Contributor Author

timokau commented Oct 31, 2018

comment:19

Replying to @embray:

ISTM PARI/GP is not even built with multi-threading enabled unless you run its Configure with either --mt=pthread or --mt=mpi. On my system anyways this is not done by default...

Yes, that is why this issue came up on sage-packaging. Some distros ship pari with threading enabled. Sage does not. My effort in #26002 was not to change that, but to make sage compatible with a system pari.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:20

I can also make it deadlock with the right combination of evil calls.

@timokau
Copy link
Contributor Author

timokau commented Oct 31, 2018

comment:21

Even when you explicitly set nthreads to 1 before going forth with that evilness?

@embray
Copy link
Contributor

embray commented Nov 8, 2018

comment:22

I don't know. I've been sick for a the last week so I completely forget exactly where I left this. Nevertheless, now I know that unless I compile pari with --mt=pthread the problem won't occur at all. So I'll try doing that again, and then see if I can figure out exactly what code in Sage exhibits the bug. That will help me pinpoint it.

@timokau
Copy link
Contributor Author

timokau commented Nov 8, 2018

comment:23

Okay, glad you're feeling better :)

Let me know if I can help.

@embray
Copy link
Contributor

embray commented Jan 18, 2019

comment:24

Totally forgot about this...

@embray embray removed this from the sage-8.5 milestone Jan 18, 2019
@embray embray added this to the sage-8.7 milestone Jan 18, 2019
@embray embray self-assigned this Jan 18, 2019
@jdemeyer
Copy link

comment:25

Too bad, I also forgot about this. I'm literally right now returning from a week-long PARI/GP workshop where I could have discussed this.

@antonio-rojas
Copy link
Contributor

comment:26

The alternative multiprocess doc build introduced in #27490 works as a temporary workaround. I just replaced the revert with a one-line patch to enable it unconditionally for 8.7

@embray
Copy link
Contributor

embray commented Mar 25, 2019

comment:27

Removing most of the rest of my open tickets out of the 8.7 milestone, which should be closed.

@embray embray removed this from the sage-8.7 milestone Mar 25, 2019
@embray embray added the pending label Mar 25, 2019
@dimpase
Copy link
Member

dimpase commented Aug 7, 2019

comment:28

see also #28242, where we enable system Pari in vanilla Sage.

@embray
Copy link
Contributor

embray commented Aug 7, 2019

comment:29

Replying to @embray:

Now if I build PARI with --mt=pthread I can make a child process segfault if it tries to do some multi-threaded work. Granted, this is partly with my own bad code which does some things improperly. Now I'd be curious if

I wish I knew what I meant by this, because I want to investigate this again but I don't have a clear way yet to even reproduce the issue, and I can't find whatever example code this might be referring to... :(

@embray
Copy link
Contributor

embray commented Aug 7, 2019

comment:30

Some PARI notes regarding threading:

  • Setting --mt=pthread in PARI's Configure script results in it compiling a test file called config/pthread.c, and if that succeeds it sets a variable thread_engine=pthread and enable_tls=yes, and should print "Using mt engine pthread" (similar result if you use mpi, but for now I'm just looking at pthread).

  • This also outputs the macro #define PARI_MT_ENGINE "pthread" in the generated paricfg.h.

    • In src/language/paricfg.c it sets a global constant const char *paricfg_mt_engine = PARI_MT_ENGINE;. This is only used when printing the version info like

      $ gp --version
                                              GP/PARI CALCULATOR Version 2.11.1 (released)
                                      amd64 running linux (x86-64/GMP-6.0.0 kernel) 64-bit version
                                compiled: Aug  7 2019, gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4)
                                                        threading engine: pthread
                                             (readline v6.3 disabled, extended help enabled)
      

* In the generated `Makefile` this also adds some bits, such as `MT_LIBS=-lpthread`, the file `parimt.h` contains the contents of `src/mt/pthread.h`, and the module `src/mt/pthread.c` which implements an standard interface (including the `pari_mt_init` function), is built, alongside some generic code in `src/mt/mt.c`
  * If instead we'd had `--mt=single` (the default) it would compile `src/mt/single.c` which implements the same interfaces, mostly as no-ops (its `pari_mt_init` just sets the global variable `pari_mt_nbthreads = 1`).

* In `src/mt/pthread.c` there is a global pointer declared `static struct mt_pstate *pari_mt`.  This is a pointer to a `struct mt_pstate` which contains several variables related to the state of pari's thread pool, including pointers to the threads themselves.
  * When functions like `mt_queue_start()` or `mt_queue_start_lim()` (which `mt_queue_start()` is just a wrapper for) it initializes one of these `mt_pstate` structs, and sets the global `pari_mt` to point to it. 
  * The global `pari_mt` is also referenced in `mt_queue_reset()`.
  * `pari_mt` also contains a reference to one mutex, which is used when calling `mt_queue_get` to synchronize access to per-thread results in such a way that it blocks until a result is available (but also allows interruption by sigint).

@dimpase
Copy link
Member

dimpase commented Aug 7, 2019

comment:31

and Sage installs paricfg.h with

#define PARI_MT_ENGINE "single"

Well, I can use this on #28242 to test whether we got single-threaded libpari.

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:32

Managed to reproduce the problem again by doing a parallel docbuild after building PARI with --mt=pthread, which eventually gives a segfault.

Unfortunately the build_many function doesn't give us any info on the traceback from the original exception, so I added the following patch to the docbuild code so it would at least print the traceback:

diff --git a/src/sage_setup/docbuild/__init__.py b/src/sage_setup/docbuild/__init__.py
index e406bca..4912b5c 100644
--- a/src/sage_setup/docbuild/__init__.py
+++ b/src/sage_setup/docbuild/__init__.py
@@ -49,6 +49,7 @@ import shutil
 import subprocess
 import sys
 import time
+import traceback
 import warnings

 logger = logging.getLogger(__name__)
@@ -136,6 +137,8 @@ def builder_helper(type):
             if ABORT_ON_ERROR:
                 raise
         except BaseException as e:
+            exc_type, exc_value, exc_traceback = sys.exc_info()
+            traceback.print_tb(exc_traceback)
             # We need to wrap a BaseException that is not an Exception in a
             # regular Exception. Otherwise multiprocessing.Pool.get hangs, see
             # #25161

With that, I get a long traceback (much of which is stuff in Sphinx that isn't interesting). But as previously reported--and unsurprisingly--the segfault originates from some code for a plot:

  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 644, in render_figures
    run_code(code_piece, code_path, ns, function_name)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/matplotlib/sphinxext/plot_directive.py", line 524, in run_code
    six.exec_(code, ns)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/six.py", line 709, in exec_
    exec("""exec _code_ in _globs_, _locs_""")
  File "<string>", line 1, in <module>
  File "<string>", line 2, in <module>
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/categories/finite_coxeter_groups.py", line 750, in permutahedron
    vertices = [v.change_ring(AA) for v in vertices]
  File "sage/modules/free_module_element.pyx", line 1495, in sage.modules.free_module_element.FreeModuleElement.change_ring (build/cythonized/sage/modules/free_module_element.c:11182)
    return M(self.list(), coerce=True)
  File "sage/structure/parent.pyx", line 902, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9225)
    return mor._call_with_args(x, args, kwds)
  File "sage/structure/coerce_maps.pyx", line 171, in sage.structure.coerce_maps.DefaultConvertMap_unique._call_with_args (build/cythonized/sage/structure/coerce_maps.c:4872)
    return C._element_constructor(x, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 5601, in _element_constructor_
    return FreeModule_generic_field._element_constructor_(self, e, *args, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 1028, in _element_constructor_
    return self.element_class(self, x, coerce, copy)
  File "sage/modules/free_module_element.pyx", line 4119, in sage.modules.free_module_element.FreeModuleElement_generic_dense.__init__ (build/cythonized/sage/modules/free_module_element.c:28564)
    entries = [coefficient_ring(x) for x in entries]
  File "sage/structure/parent.pyx", line 900, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9198)
    return mor._call_(x)
  File "sage/structure/coerce_maps.pyx", line 157, in sage.structure.coerce_maps.DefaultConvertMap_unique._call_ (build/cythonized/sage/structure/coerce_maps.c:4449)
    return C._element_constructor(x)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/qqbar.py", line 759, in _element_constructor_
    return x._algebraic_(AA)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/universal_cyclotomic_field.py", line 606, in _algebraic_
    return R(QQbar(self))
  File "sage/structure/parent.pyx", line 900, in sage.structure.parent.Parent.__call__ (build/cythonized/sage/structure/parent.c:9198)
    return mor._call_(x)
  File "sage/categories/map.pyx", line 789, in sage.categories.map.Map._call_ (build/cythonized/sage/categories/map.c:6953)
    cpdef Element _call_(self, x):
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/universal_cyclotomic_field.py", line 260, in _call_
    zeta = QQbar.zeta(k)
  File "sage/misc/cachefunc.pyx", line 1949, in sage.misc.cachefunc.CachedMethodCaller.__call__ (build/cythonized/sage/misc/cachefunc.c:10274)
    w = self._instance_call(*args, **kwds)
  File "sage/misc/cachefunc.pyx", line 1825, in sage.misc.cachefunc.CachedMethodCaller._instance_call (build/cythonized/sage/misc/cachefunc.c:9759)
    return self.f(self._instance, *args, **kwds)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/qqbar.py", line 1366, in zeta
    nf = CyclotomicField(n)
  File "sage/structure/factory.pyx", line 369, in sage.structure.factory.UniqueFactory.__call__ (build/cythonized/sage/structure/factory.c:2146)
    return self.get_object(version, key, kwds)
  File "sage/structure/factory.pyx", line 406, in sage.structure.factory.UniqueFactory.get_object (build/cythonized/sage/structure/factory.c:2350)
    return self._cache[version, cache_key]
  File "sage/misc/weak_dict.pyx", line 704, in sage.misc.weak_dict.WeakValueDictionary.__getitem__ (build/cythonized/sage/misc/weak_dict.c:3653)
    cdef PyObject* wr = PyDict_GetItemWithError(self, k)
  File "sage/cpython/dict_del_by_value.pyx", line 58, in sage.cpython.dict_del_by_value.PyDict_GetItemWithError (build/cythonized/sage/cpython/dict_del_by_value.c:1261)
    ep = mp.ma_lookup(mp, <PyObject*><void*>key, PyObject_Hash(key))
  File "sage/rings/real_lazy.pyx", line 1398, in sage.rings.real_lazy.LazyNamedUnop.__hash__ (build/cythonized/sage/rings/real_lazy.c:15533)
    return hash(complex(self))
  File "sage/rings/real_lazy.pyx", line 822, in sage.rings.real_lazy.LazyFieldElement.__complex__ (build/cythonized/sage/rings/real_lazy.c:9872)
    return self.eval(complex)
  File "sage/rings/real_lazy.pyx", line 1352, in sage.rings.real_lazy.LazyNamedUnop.eval (build/cythonized/sage/rings/real_lazy.c:14982)
    arg = self._arg.eval(R)
  File "sage/rings/real_lazy.pyx", line 1129, in sage.rings.real_lazy.LazyBinop.eval (build/cythonized/sage/rings/real_lazy.c:12722)
    left = self._left.eval(R)
  File "sage/rings/real_lazy.pyx", line 1130, in sage.rings.real_lazy.LazyBinop.eval (build/cythonized/sage/rings/real_lazy.c:12734)
    right = self._right.eval(R)
  File "sage/rings/real_lazy.pyx", line 1647, in sage.rings.real_lazy.LazyAlgebraic.eval (build/cythonized/sage/rings/real_lazy.c:17915)
    self.eval(self.parent().interval_field(64)) # up the prec
  File "sage/rings/real_lazy.pyx", line 1673, in sage.rings.real_lazy.LazyAlgebraic.eval (build/cythonized/sage/rings/real_lazy.c:18066)
    roots = self._poly.roots(ring = AA if isinstance(self._parent, RealLazyField_class) else QQbar)
  File "sage/rings/polynomial/polynomial_element.pyx", line 7721, in sage.rings.polynomial.polynomial_element.Polynomial.roots (build/cythonized/sage/rings/polynomial/polynomial_element.c:61844)
    rts = complex_roots(self, retval='algebraic')
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/polynomial/complex_roots.py", line 258, in complex_roots
    rts = cfac.roots(multiplicities=False)
  File "sage/rings/polynomial/polynomial_element.pyx", line 7629, in sage.rings.polynomial.polynomial_element.Polynomial.roots (build/cythonized/sage/rings/polynomial/polynomial_element.c:59297)
    ext_rts = self.__pari__().polroots(precision=L.prec())
  File "sage/rings/polynomial/polynomial_element.pyx", line 6021, in sage.rings.polynomial.polynomial_element.Polynomial.__pari__ (build/cythonized/sage/rings/polynomial/polynomial_element.c:49003)
    return self._pari_with_name(self._parent.variable_name())
  File "sage/rings/polynomial/polynomial_element.pyx", line 6074, in sage.rings.polynomial.polynomial_element.Polynomial._pari_with_name (build/cythonized/sage/rings/polynomial/polynomial_element.c:49395)
    vals = [x.__pari__() for x in self.list()]
  File "sage/rings/complex_number.pyx", line 593, in sage.rings.complex_number.ComplexNumber.__pari__ (build/cythonized/sage/rings/complex_number.c:7107)
    return self.real().__pari__()
  File "sage/rings/real_mpfr.pyx", line 3248, in sage.rings.real_mpfr.RealNumber.__pari__ (build/cythonized/sage/rings/real_mpfr.c:22355)
    sig_on()

Here it's probably building the reference docs for sage.categories.finite_coxeter_groups which contains a plot using CoxeterGroup.permutahedron(), which in turn much further down happens to use PARI to compute some complex roots of a polynomial.

It looks like it's not actually reaching the polroots() call before crashing somewhere down in converting the polynomial coeffs to PARI values...?

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:33

The nffactor function in PARI ("Factorization of the univariate polynomial (or rational function) T over the number field nf") internally uses a parallel Chinese remainder routine in the internal function polint_chinese. This results in a thread pool being launched. This is being called at some point during the categories reference doc build.

@dimpase
Copy link
Member

dimpase commented Aug 8, 2019

comment:34

If #26002 has any effect, the pool should have just a single thread, no?

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:35

In further digging, yes, I just found #26002 and that indeed pari has pari_mt_nbthreads = 1 so maybe I'm barking up the wrong tree. The polint_chinese call indeed does not use any threads in this case.

@dimpase
Copy link
Member

dimpase commented Aug 8, 2019

comment:36

hmm, maybe it still does something with thread local variables?

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:37

On a wild guess, I tried switching the docbuild to use my multiprocessing.Pool replacement from #27490 (this is on Linux), and the crashes no longer occur.

The major difference is that in multiprocessing.Pool, each docbuild subprocess is forked from a separate thread from the main thread, whereas in my replacement they're just forked directly from the main thread.

Somehow this alone is enough to leave some structures in PARI in a bad state, and only if it was built with multithreading support in the first place (apparently).

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:38

Going back to comment [comment:30], if --mt=pthread then a variable in PARI's Configure, enable_tls="yes" is set. This in turns leads to defining a macro in paricfg.h called ENABLE_TLS.

The only effect of ENABLE_TLS is in src/headers/parisys.h:

#ifdef ENABLE_TLS
#  define THREAD __thread
#else
#  define THREAD
#endif

So variables in PARI declared THREAD use TLS in this case. This alone could be enough to suspect strange behavior in PARI when it's forked from a thread...

@dimpase
Copy link
Member

dimpase commented Aug 8, 2019

comment:39

Replying to @embray:

On a wild guess, I tried switching the docbuild to use my multiprocessing.Pool replacement from #27490 (this is on Linux), and the crashes no longer occur.

This is matching what I observed on #28242 (and the workaround I added to that branch) --
so we're in the same wilderness :-)

@embray
Copy link
Contributor

embray commented Aug 8, 2019

comment:40

Replying to @embray:

So variables in PARI declared THREAD use TLS in this case. This alone could be enough to suspect strange behavior in PARI when it's forked from a thread...

Yup. That's really all it is. PARI has tons of global variables declared __thread, including but not least of all pari_mainstack. When multiprocessing.Pool spins up a new thread, all of those __thread variables are suddenly set back to their initialization values (0x0, typically).

This isn't so unusual in PARI's case. It assumes that it's the only one that will be starting new threads that it manages. It does not assume it will ever be used in someone else's multi-threaded application.

@embray
Copy link
Contributor

embray commented Aug 15, 2019

comment:41

#28356 proposes a workaround for this issue.

It won't solve the issue in general (PARI is not safe to use in arbitrary multi-threaded code), but at least it won't crash when building the docs.

@jdemeyer
Copy link

comment:42

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

@antonio-rojas
Copy link
Contributor

comment:43

Replying to @jdemeyer:

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

Yes, there are no test suite failures related to this.

@embray
Copy link
Contributor

embray commented Aug 26, 2019

comment:44

Replying to @jdemeyer:

Did somebody check that this problem is limited to docbuilds? Does the Sage testsuite pass (apart from doc-related tests of course)?

In principle it's not just limited to docbuilds: The broader problem, for which I would like to find a better resolution, is that the cypari2 Pari instances are simply not thread-safe. A more general approach would be if Sage's single Pari instance were actually thread-local.

As it is, Sage doesn't use threads for much of anything, whether in the tests, or in general, so the problem arises primarily in the docbuild.

But this can cause problems for anyone who carelessly tries to use multiprocessing.Pool, or threads in general, in Sage* :(

!* If they do anything in those threads that happens to use PARI.

@embray
Copy link
Contributor

embray commented Nov 25, 2019

comment:45

Closing this ticket for now, since the immediate problem (Sage's docbuild) has been worked around by #28356. The broader issue still should have a ticket to track it, even if it won't be resolved any time soon, so I have opened #28800.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants