Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in multiple tests on OS X, Python 3.9 #376

Closed
hyanwong opened this issue Nov 30, 2020 · 52 comments
Closed

Segfault in multiple tests on OS X, Python 3.9 #376

hyanwong opened this issue Nov 30, 2020 · 52 comments
Milestone

Comments

@hyanwong
Copy link
Member

Tests all pass with OS X on python 3.8, but on 3.9 there are multiple failures. The first I get, from running python3 -m pytest -vs -n0 is in tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors:

python3 -m pytest -vs -n0
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/local/anaconda3/envs/python39/bin/python3
cachedir: .pytest_cache
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
collected 598 items                                                                                                                                  

tests/test_cli.py::TestMain::test_cli_main PASSED
tests/test_cli.py::TestMain::test_main PASSED
tests/test_cli.py::TestDefaultPaths::test_get_default_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_ancestors_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_ancestors_trees_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_output_trees_path PASSED
tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors Fatal Python error: Segmentation fault

Current thread 0x0000000118bb1dc0 (most recent call first):
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1095 in __complete_epoch
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1122 in __match_ancestors_single_threaded
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1169 in match_ancestors
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 370 in match_ancestors
  File "/Users/yan/Documents/Research/BDI/tsinfer/tests/test_cli.py", line 132 in setUp
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 547 in _callSetUp
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 589 in run
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 653 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/unittest.py", line 278 in runtest
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 256 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 310 in from_call
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 216 in call_and_report
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 127 in runtestprotocol
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 313 in _main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 257 in wrap_session
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pytest/__main__.py", line 5 in <module>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/runpy.py", line 87 in _run_code
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/runpy.py", line 197 in _run_module_as_main
zsh: segmentation fault  python3 -m pytest -vs -n0

Running in parallel identifies at least these others as problems

maximum crashed workers reached: 16

====================================================================== FAILURES ======================================================================
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw1] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw1' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer_from_ts'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw2] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw2' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw3] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw3' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw0] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw0' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer_bad_file'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw6] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw6' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_simple_example'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw7] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw7' crashed while running 'tests/test_inference.py::TestMissingDataRoundTrip::test_random_data_invariant_sites_ancestral_state'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw5] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw5' crashed while running 'tests/test_inference.py::TestUnfinalisedErrors::test_match_samples_unfinalised'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw8] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw8' crashed while running 'tests/test_inference.py::TestAlgorithmsExactlyEqualPrecision24::test_four_samples'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw9] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw9' crashed while running 'tests/test_inference.py::TestExtractAncestors::test_simple_simulation'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw10] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw10' crashed while running 'tests/test_evaluation.py::TestPerfectInference::test_single_tree_defaults'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw4] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw4' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_random_data_large_example'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw11] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw11' crashed while running 'tests/test_evaluation.py::TestPerfectInference::test_small_smc_path_compression'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw13] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw13' crashed while running 'tests/test_inference.py::TestSparseAncestorsRoundTrip::test_single_locus_two_samples'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw14] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw14' crashed while running 'tests/test_inference.py::TestGeneratedAncestors::test_no_recombination'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw12] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw12' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_single_locus'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw15] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw15' crashed while running 'tests/test_inference.py::TestMatchSamples::test_partial_bad_indexes'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw16] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw16' crashed while running 'tests/test_inference.py::TestPathCompressionSamplesCEngine::test_simulation_with_error'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw18] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw18' crashed while running 'tests/test_evaluation.py::TestMeanSampleAncestry::test_random_data_inferred_no_simplify'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw17] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw17' crashed while running 'tests/test_evaluation.py::TestCountSampleChildEdges::test_inferred_simplify'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw19] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw19' crashed while running 'tests/test_evaluation.py::TestNodeSpan::test_inferred'
@jeromekelleher
Copy link
Member

Great, thanks @hyanwong. Can you run this through gdb please?

gdb python
run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
bt

This should run until it segfaults, and then we want to print out the backtrace. (bt)

Or whatever equivalent debugger is present in the clang toolchain.

@jeromekelleher jeromekelleher added this to the Release 0.2.0 milestone Nov 30, 2020
@hyanwong
Copy link
Member Author

OS X needs to codesign gdb, or run under sudo. Once I do that, it's not super helpful with the default install of Python 3.9 (below). I guess I need to recompile Python 3.9 for this to work?

Reading symbols from /usr/local/bin/python3...
(No debugging symbols found in /usr/local/bin/python3)
(gdb) run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Starting program: /usr/local/bin/python3 -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
[New Thread 0x2603 of process 34225]
[New Thread 0x1903 of process 34225]
warning: unhandled dyld version (16)
[New Thread 0x2303 of process 34225]

Thread 3 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x2303 of process 34225]
0x000000010000e000 in ?? ()
(gdb) bt
#0  0x000000010000e000 in ?? ()
#1  0x0000000100000000 in ?? ()
#2  0x0000000000000006 in ?? ()
#3  0x00007ffeefbffbc0 in ?? ()
#4  0x00007ffeefbffc39 in ?? ()
#5  0x00007ffeefbffc3c in ?? ()
#6  0x00007ffeefbffc43 in ?? ()
#7  0x00007ffeefbffc47 in ?? ()
#8  0x00007ffeefbffc4b in ?? ()
#9  0x0000000000000000 in ?? ()

@benjeffery
Copy link
Member

Can you run

make clean
CFLAGS="-ggdb -O0" python3 setup.py build_ext --inplace

before the pytest?

Although if your stack is only 9 deep then we are probably still in Python-land, which is weird.

@hyanwong
Copy link
Member Author

Oddly, that hangs when running.

I'll try to recompile Python 3.9 with the debug flags.

@hyanwong
Copy link
Member Author

Hmm https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html:

"Note that for some reason, Python always sends a SIGTRAP before it begins executing the script. Don’t worry, just continue. If you want to find out why your program segfaults, you can run bt:"

@benjeffery
Copy link
Member

Ah didn't know that! Just do continue then!

@hyanwong
Copy link
Member Author

Can't get the damn thing to do anything other than hang now. Argh.

@benjeffery
Copy link
Member

make clean
make

?

@hyanwong
Copy link
Member Author

Yep. Tried all that

@jeromekelleher
Copy link
Member

Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.

@hyanwong
Copy link
Member Author

Slightly weirdly, after make; make clean, I can't now run the tests normally:

/usr/local/bin/python3.9 -m pytest tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 [0] / gw1 [0] / gw2 [0] / gw3 [0]                                                                                                                

======================================================================= ERRORS =======================================================================
_________________________________________________________ ERROR collecting tests/test_cli.py _________________________________________________________
ImportError while importing test module '/Users/yan/Documents/Research/BDI/tsinfer/tests/test_cli.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/Cellar/python@3.9/3.9.0_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cli.py:35: in <module>
    import tsinfer
tsinfer/__init__.py:38: in <module>
    from .inference import *  # NOQA
tsinfer/inference.py:36: in <module>
    import _tsinfer
E   ModuleNotFoundError: No module named '_tsinfer'
============================================================== short test summary info ===============================================================
ERROR tests/test_cli.py

@hyanwong
Copy link
Member Author

Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.

I'm trying it on 2 different python 3.9 versions, one from conda, another from the OS X installation repo "Homebrew". Neither seem to have debug symbols compiled in, so I guess I'll need to compile my own.

@jeromekelleher
Copy link
Member

Yes, looks like compiling your own might be the simplest for getting debug symbols, as there's mac-specific technical issues with the conda distributed binaries.

@hyanwong
Copy link
Member Author

I guess this is CPython, right?

@benjeffery
Copy link
Member

From reading your link https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html it seems they have symbols without recompiling python.

@hyanwong
Copy link
Member Author

Well now. If I compile my own version of python3.9.1rc1, the tests all now pass.

@jeromekelleher
Copy link
Member

Interesting/annoying. What version of Python and where does it come from are we using in CI @benjeffery? Is this some issue with Apple breaking things again?

@hyanwong
Copy link
Member Author

I don't think this is Apple, as they don't provide Python 3.9, as far as I know. I suspect it's conda? Or maybe there's been a bug fix in Python 3.9.1?

@jeromekelleher
Copy link
Member

I suspect it's conda?

You hit the problem via Homebrew as well though, right?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 30, 2020

Both on 3.9.0.

I've just tried 3.9.1 compiled via the pyenv package (https://github.com/pyenv/pyenv)

pyenv install -v 3.9-dev
# compiles python3.9d
python3.9d -m pytest                      
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.1rc1+, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 ok / gw1 ok / gw2 ok / gw3 ok
gw0 [598] / gw1 [598] / gw2 [598] / gw3 [598]

Works.

So perhaps we should simply not support 3.9, but only 3.9.1 on OS X? Apparently 3.9.1 will be released in December.

@jeromekelleher
Copy link
Member

I'd very much like to know what happened here - segfaults shouldn't happen, ever.

@hyanwong
Copy link
Member Author

I can recompile cpython at 3.9.0 and see if it segfaults.

@jeromekelleher
Copy link
Member

Yes please - including debug symbols if you can.

@benjeffery
Copy link
Member

The CI python is from miniconda:
/usr/local/miniconda/envs/anaconda-client-env/bin/python
Python 3.9.0

Here are the flags it was built with:

CC = "clang"
CFLAGS = "-Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC  -O2  -isystem /usr/local/miniconda/envs/anaconda-client-env/include -fPIC  -O2  -isystem /usr/local/miniconda/envs/anaconda-client-env/include "

Obtained with the awesome
python -m sysinfo

@benjeffery
Copy link
Member

I'm running gdb on CI now.

@benjeffery
Copy link
Member

Proving tricky!

Reading symbols from python...
(No debugging symbols found in python)
warning: `/tmp/lto.o': can't open to read symbols: No such file or directory.
Starting program: /usr/local/miniconda/envs/anaconda-client-env/bin/python -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Note: this version of macOS has System Integrity Protection.
Because `startup-with-shell' is enabled, gdb has worked around this by
caching a copy of your shell.  The shell used by "run" is now:
    /Users/runner/Library/Caches/gdb/bin/bash
Unable to find Mach task port for process-id 2464: (os/kern) failure (0x5).
 (please check gdb is codesigned - see taskgated(8))
The program is not being run.
No stack.

@hyanwong
Copy link
Member Author

Yes, that's why I was having problems!

@hyanwong
Copy link
Member Author

hyanwong commented Nov 30, 2020

You can run as sudo. Or code-sign the gdb binary.

@hyanwong
Copy link
Member Author

More oddities. This version of python, compiled from source by Pyenv, crashes:

(base) yan@Yans-New-Air tsinfer % /Users/yan/.pyenv/versions/3.9.0/bin/python3.9          
Python 3.9.0 (default, Nov 30 2020, 17:02:19) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
>>> 

Whereas this one, compiled by me from the GitHub repo, passes:

(base) yan@Yans-New-Air tsinfer % /usr/local/bin/python3.9                      
Python 3.9.0 (tags/v3.9.0:9cf6752276, Nov 30 2020, 20:02:19) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin

loading with python3.9 -v reveals that the only different library loaded is encodings.ascii which isn't loaded in the version that crashes. But that can't be anything to do with it, I would have thought.

@hyanwong
Copy link
Member Author

Could it be the same issue as pybind11 had, here?

@jeromekelleher
Copy link
Member

All very strange.

@jeromekelleher
Copy link
Member

I'd be happy this isn't our fault if I could get a backtrace from gbd - did you have any success building these with debug symbols @hyanwong? Avoiding the CLI tests will get rid of a level of indirection, hopefully make things more straightforward.

@benjeffery
Copy link
Member

benjeffery commented Nov 30, 2020

Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?

tests/test_inference.py::TestUnfinalisedErrors::test_match_samples_unfinalised

I'm still digging. Can't get gdb working on CI as can't sign the binary so bisecting to find the cause.

@hyanwong
Copy link
Member Author

Unfortunately the version with gdb is the one that passed. I'm getting intermittent segfault failures using 3.9.0rc1, all when calling evaluation.py (see below), but not in a repeatable manner - seems like it could be a race condition. I don't (seem to) get them when calling evaluation.py directly though - I might just need to keep on rerunning. But I'm not sure whether to dig deeper as this might simply be an RC1 thing.

E               subprocess.CalledProcessError: Command '['/usr/local/bin/python3.9', 'evaluation.py', 'ancestor-properties', '-n', '5', '-l', '0.5', '-d', '/var/folders/00/l36df9q14hv13hhw8wglncch0000gr/T/tsi_eval53fi_31d']' died with <Signals.SIGSEGV: 11>.

@hyanwong
Copy link
Member Author

Yay! Got one (although this is using 3.9.0rc1, not the actual release:

(gdb) run  evaluation.py ancestor-properties -n 5 -l 0.5
Starting program: /usr/local/bin/python3.9 evaluation.py ancestor-properties -n 5 -l 0.5
[New Thread 0x2607 of process 24849]
[New Thread 0x1b2f of process 24849]
warning: unhandled dyld version (16)
[New Thread 0x1e0b of process 24849]
[New Thread 0x1f03 of process 24849]
          num_sites  num_trees  estimated_anc_num  estimated_anc_mean_len  ...  recombination_rate  mutation_rate       Ne  random_seed
length                                                                     ...                                                         
50000.0        40.1       32.5               10.8                0.887581  ...        1.000000e-08   1.000000e-08  10000.0  617528857.4
100000.0       80.5       60.7               18.6                0.806291  ...        1.000000e-08   1.000000e-08  10000.0  680026157.5
150000.0      125.2       97.1               26.4                0.721979  ...        1.000000e-08   1.000000e-08  10000.0  618470870.1
200000.0      155.7      118.7               32.2                0.647576  ...        1.000000e-08   1.000000e-08  10000.0  441609977.5
250000.0      212.9      153.8               39.3                0.659095  ...        1.000000e-08   1.000000e-08  10000.0  514572925.7
300000.0      234.6      184.9               45.3                0.522309  ...        1.000000e-08   1.000000e-08  10000.0  557663632.2
350000.0      283.6      217.7               57.7                0.473193  ...        1.000000e-08   1.000000e-08  10000.0  581671619.9
400000.0      341.0      262.1               67.9                0.454860  ...        1.000000e-08   1.000000e-08  10000.0  442279087.5
450000.0      402.9      298.5               80.9                0.379989  ...        1.000000e-08   1.000000e-08  10000.0  446110263.2
500000.0      425.8      321.3               81.2                0.419182  ...        1.000000e-08   1.000000e-08  10000.0  621938852.6

[10 rows x 13 columns]

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001310bb in _Py_DECREF (filename=0x100400640 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422	    if (--op->ob_refcnt != 0) {

@hyanwong
Copy link
Member Author

hyanwong commented Nov 30, 2020

Ah, no, I get the same error with python 3.9.0 (the actual release, not the rc1 version), but it's not 100% repeatable. I'll see if it goes away with 3.9.1rc1

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422	    if (--op->ob_refcnt != 0) {

@benjeffery
Copy link
Member

benjeffery commented Nov 30, 2020

Can you get the full bt?
I've added some logging and now the bug is gone.... looking like a race condition in TreeSequenceBuilder_add_mutations

@hyanwong
Copy link
Member Author

It seems to disappear with python 3.9.1rc1. Stupidly I forgot to backtrace. Doh. Need to recompile 3.9.0 and do it again. Damn!

@hyanwong
Copy link
Member Author

hyanwong commented Dec 1, 2020

Here's the full bt @benjeffery. Anecdotally it seems to be triggered only when I'm running a lot of threads at a time. I've left the GDB session running in case you need any more detail. Off to bed now!

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422	    if (--op->ob_refcnt != 0) {
(gdb) bt
#0  0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
#1  0x00000001001316ac in _Py_XDECREF (op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:497
#2  0x0000000100130ebd in meth_dealloc (m=0x117ea29b0) at Objects/methodobject.c:169
#3  0x000000010013c5e6 in _Py_Dealloc (op=0x117ea29b0) at Objects/object.c:2209
#4  0x000000010011449e in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0x117ea29b0) at ./Include/object.h:430
#5  0x000000010011827c in _Py_XDECREF (op=0x117ea29b0) at ./Include/object.h:497
#6  0x000000010011ed49 in free_keys_object (keys=0x117525310) at Objects/dictobject.c:598
#7  0x00000001001164b9 in dictkeys_decref (dk=0x117525310) at Objects/dictobject.c:333
#8  0x000000010011ab06 in dict_dealloc (mp=0x117ea2f50) at Objects/dictobject.c:2026
#9  0x000000010013c5e6 in _Py_Dealloc (op=0x117ea2f50) at Objects/object.c:2209
#10 0x00000001002defbe in _Py_DECREF (filename=0x1003f67be "Python/pystate.c", lineno=768, op=0x117ea2f50) at ./Include/object.h:430
#11 0x00000001002e0624 in _PyInterpreterState_ClearModules (interp=0x102808210) at Python/pystate.c:768
#12 0x00000001002b6398 in _PyImport_Cleanup (tstate=0x100c05790) at Python/import.c:623
#13 0x00000001002d98bd in Py_FinalizeEx () at Python/pylifecycle.c:1431
#14 0x0000000100316d0d in Py_RunMain () at Modules/main.c:679
#15 0x00000001003170b9 in pymain_main (args=0x7ffeefbffab0) at Modules/main.c:707
#16 0x0000000100317107 in Py_BytesMain (argc=7, argv=0x7ffeefbffb20) at Modules/main.c:731
#17 0x00000001000013b2 in main ()

@benjeffery
Copy link
Member

I think that confirms this is the same issue kind of issue as the pybind one you linked above. Will look into the reasoning there and how they fixed.

@jeromekelleher
Copy link
Member

I'd be surprised if this turned out to be a problem with our code now - seems like a bug in Python like the pybind one? I guess we just disable the tests for now until 3.9.1 is released?

@benjeffery
Copy link
Member

I don't think this is worth much more digging for now, so I agree we wait for 3.9.1.

@jeromekelleher
Copy link
Member

I think we're satisfied that this isn't our bug, so I'm going to take it out of the next release and remove the "bug" label. We can keep it open until we've got CI running on 3.9.1, though.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 11, 2020

The semi-standard OS X package manager "homebrew" has released python 3.9.1 and all the tests pass on my laptop with that installation. So hopefully when Github Actions & Conda have a 3.9.1 release this will all start working again. I'll run with 3.9 from now on, so I should spot any problems I hope.

@jeromekelleher
Copy link
Member

I had a go at reproducing this using fastmac. I wasn't able to get a working environment in about an hour - I tried with homebrew/pip initially, and this failed because numcodecs/blosc seems to have some issue on recent llvm compilers. I tried with conda, but borked up the whole thing and had to give up. I'll have another go at some point when I get a chance.

Macs are such a horrid unix environment...

@benjeffery
Copy link
Member

I doubt this will be the last time so certainly worth keeping what ever incantations you use to get the debugger working.

@jeromekelleher
Copy link
Member

I've gotten a working environment using fastmac. Using python 3.9 from conda I can reproduce the segfault if I use the built-in clang 12 compiler. However, if I install clang11 from conda-forge (conda install -c conda-forge c-compiler) and compile use this, there's no issue.

So, it does seem to be some connection with clang12, and maybe something to do with using different compilers for building Python and the extension module (which shouldn't matter, but hey). @hyanwong, did you get a version of this working where both Python and the extension module were build with clang12? I failed at this because of problems with numcodecs under clang12.

@jeromekelleher
Copy link
Member

Huh, and now I can't reproduce even if I revert to using clang12 as the compiler. There must be something else in the toolchain that's causing the problem, which conda-forge c-compiler solves.

So, I'm inclined to say this is a weird mac problem caused by it being a fragmented crappily executed unix development environment. I think this is a wider problem (see the nodes here e.g.), so there's no much we can do. If people hit the problem we can advise them to conda install -c conda-forge c-compiler, but we'll be shipping conda packages for tsinfer 0.2 soon, so that might no be necessary either. (Good motivation for shipping conda packages, too!)

I'll see if I can get CI working by changing the compilers.

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Dec 17, 2020
@hyanwong
Copy link
Member Author

hyanwong commented Dec 17, 2020

I found that compiling wasn't very reproducible, but seemed to depend on number of threads. OTOH I didn't test extensively, so could be accidental correlation rather than causation.

@hyanwong
Copy link
Member Author

Just to note for posterity that compiling with the defaiult Mac compilers seems to work OK now on apple silicon in the most recent macOS 14.2 (Sonoma). See #909.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants