Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault when trying to run msprime.log_arg_likelihood() #2107

Closed
savitakartik opened this issue Sep 13, 2022 · 17 comments · Fixed by #2114
Closed

segmentation fault when trying to run msprime.log_arg_likelihood() #2107

savitakartik opened this issue Sep 13, 2022 · 17 comments · Fixed by #2114
Labels
Milestone

Comments

@savitakartik
Copy link

import tskit
import msprime
import random

random.seed(6)
nreps = 20
ne = [10, 1e2, 1e3, 1e4, 3e4, 5e4, 7e4, 1e5]

for i in range(nreps):
        ifile = "slim_msp_rep_" + str(i+1) + ".trees"
        ts = tskit.load(ifile)
        for j in range(len(ne)):
                print(msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne = ne[j]))

This code results in a Segmentation fault more often than not. The tree sequences are from a single-locus SLiM model without recombination, followed by recapitation of the deep past with msprime. I use a random seed of 6 with SLiM and msprime.
I'm not sure how to reproduce the error as the code runs fine for the same tree sequence on occasion.

@jeromekelleher jeromekelleher added this to the 1.2.1 milestone Sep 13, 2022
@jeromekelleher
Copy link
Member

Thanks for the bug report @savitakartik.

It looks like this will fail for a specific tree sequence file that you have stored - could you narrow it down to one file that definitely does result in a segfault, at least some of the time, please?

The file shouldn't be too big, so hopefully you can just attach it here.

@JereKoskela
Copy link
Member

This is speculation without seeing the details, but since part of the tree comes from SLiM I'm guessing this will have something to do with simultaneous mergers. I wouldn't be surprised if those could cause the way the likelihood calculation traverses nodes and edges to go haywire. The likelihood is only really valid for tree sequences which are output by the Hudson coalescent with recombination algorithm; for anything else we should expect things to break. Even if the code runs, the output won't be very meaningful.

@jeromekelleher
Copy link
Member

That's true - still, we shouldn't segfault, so I'd like to get to the bottom of this.

@savitakartik
Copy link
Author

savitakartik commented Oct 11, 2022

For example, this tree sequence file segfaults for Ne=10, 1000, 10000
slim_msp_rep_7.zip

@jeromekelleher
Copy link
Member

Hmm, I can't reproduce. I've run this code:

import msprime
import tskit
import tqdm

ts = tskit.load("slim_msp_rep_7.trees")
print(ts)

for _ in tqdm.tqdm(range(1000)):
    ll = msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne=1000)
print(ll)

Are you running this on the BMRC cluster? What version of msprime/Python etc?

@savitakartik
Copy link
Author

Sorry, wrong file - could you try this file but without the tqdm module, although I don't know why it would make a difference. I get a segmentation fault with:

import msprime
import tskit

ts = tskit.load("slim_rep_7.trees")
print(ts)

for i in range(1000):
        ll = msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne=1000)
        print(i, ll)

but when I import tqdm (e.g. your code above), it works fine.

I'm using Python 3.10.5, msprime 1.2.0, tskit 0.5.2 on the BMRC cluster.
slim_rep_7.zip

@jeromekelleher
Copy link
Member

Great, just reproduced:

$ gdb python3
(gdb) run segfault.py 
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, 
    r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
179                 edge = last_parent_edge[edges->child[edge]];
(gdb) bt
#0  0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, 
    r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
#1  0x00007ffff73be4db in msprime_log_likelihood_arg (self=<optimised out>, 
    args=<optimised out>, kwds=<optimised out>) at msprime/_msprimemodule.c:3198
#2  0x00000000005f6929 in PyCFunction_Call ()
#3  0x00000000005f74f6 in _PyObject_MakeTpCall ()
#4  0x000000000057164c in _PyEval_EvalFrameDefault ()
#5  0x0000000000569dba in _PyEval_EvalCodeWithName ()
#6  0x00000000005f6eb3 in _PyFunction_Vectorcall ()
#7  0x000000000056cc1f in _PyEval_EvalFrameDefault ()
#8  0x0000000000569dba in _PyEval_EvalCodeWithName ()
#9  0x00000000006902a7 in PyEval_EvalCode ()
#10 0x000000000067f951 in ?? ()
#11 0x000000000067f9cf in ?? ()
#12 0x000000000067fa71 in ?? ()
#13 0x0000000000681b97 in PyRun_SimpleFileExFlags ()
#14 0x00000000006b9d32 in Py_RunMain ()
#15 0x00000000006ba0bd in Py_BytesMain ()
#16 0x00007ffff7dd7083 in __libc_start_main (main=0x4efd60 <main>, argc=2, 
    argv=0x7fffffffdea8, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7fffffffde98) at ../csu/libc-start.c:308
#17 0x00000000005fc5fe in _start ()

so the segfault happens here

Digging in:

(gdb) frame 0
#0  0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
179                 edge = last_parent_edge[edges->child[edge]];
(gdb) print edge
$1 = 36521
(gdb) print edges->child[edge]
$2 = 246130
(gdb) print last_parent_edge[246130]
Cannot access memory at address 0x2853448

so, looks like we've run over in the last_parent_edge array. But hmm, there's only 36K nodes, so perhaps the child[edge] above was unset?

(gdb) print nodes->num_rows
$3 = 36522
$4 = 36522
(gdb) print edges->num_rows
$5 = 36521
(gdb) print edge
$6 = 36521

Ah, no, we've overrun the edge table. How did edge become equal to the number of rows in the table? Because we increment edge here and assume that it corresponds to a valid index.

So, we just need to check this, I think. It probably can't happen if we have binary trees.

@JereKoskela - should we fix this corner case and return a value, or should we check for binary trees in the input and return an error when the trees don't conform to our expectations?

@JereKoskela
Copy link
Member

JereKoskela commented Oct 12, 2022

I think it would be cleanest to return an error when the tree isn't binary, since any number we return won't really mean much. Is there an easy, existing way to check that @jeromekelleher, or do you just mean adding a check to when we increment edge?

@jeromekelleher
Copy link
Member

Something like this in the Python code is probably best:

for tree in ts.trees():
     if np.any(tree.num_children_array > 2):
          raise ValueError(...)

Although should probably check for unary nodes etc as well.

@JereKoskela
Copy link
Member

@jeromekelleher, any idea why python3 -m pytest would insist that Tree object has no attribute num_children_array?

Running git submodule update --init --recursive, then python -m pip install -r requirements/development.txt and make in the project root all complete without errors, and I'm parallel with the latest version of the main msprime repo.

@benjeffery
Copy link
Member

@JereKoskela what version of tskit are you using? Pip won't upgrade it unless asked and the dev requirements might need a pin.
python -m pip install --upgrade tskit

@JereKoskela
Copy link
Member

Upgrading tskit did the job. I had naively assumed reinstalling requirements would do that. Thanks @benjeffery!

@benjeffery
Copy link
Member

Sounds like we need a pin. Will raise an issue!

@jeromekelleher
Copy link
Member

Reopening this until we update the CHANGELOG

@jeromekelleher jeromekelleher modified the milestones: 1.2.2, 1.2.1 Oct 20, 2022
@GertjanBisschop
Copy link
Member

Small detail. The implemented check will also flag a polytomy when a tree has multiple roots. tree.num_children_array[-1] tracks the number of children of the virtual root. Will post a pr with a more informative error message in that particular case.

@jeromekelleher
Copy link
Member

Did we update the changelong in #2232?

@GertjanBisschop
Copy link
Member

The changelog update for the bugfix was done in #2227.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants