Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"likelihood function" error when running GARD in hyphy v.2.3.14 #854

Closed
lkelly3 opened this issue Oct 29, 2018 · 8 comments
Closed

"likelihood function" error when running GARD in hyphy v.2.3.14 #854

lkelly3 opened this issue Oct 29, 2018 · 8 comments

Comments

@lkelly3
Copy link

lkelly3 commented Oct 29, 2018

Hello

I have been running GARD with hyphy v.2.3.14 for a number of datasets, with the following command:
mpirun -x LD_LIBRARY_PATH -np ${NSLOTS} HYPHYMPI ${HYPHY_TEMPLATES_DIR}/GARD.bf $INPUTFILE 012345 "General Discrete" 3 $OUTPUTSTR > $LOGNAME

Most datasets run fine without any kind of error, but for a few they fail with an error after seeming to run for a while without a problem.

This is an example of what I see in the log file:

GENERATION 3 with 2 breakpoints (~2% converged)
Breakpoints c-AIC Delta c-AIC [BP 1] [BP 2]
0 26644.60
1 24363.32 2281.279 608
2 23792.32 571.002 609 1972
GA has considered 63/ 861328 (1375 over all runs) unique models
Total run time 0 hrs 1 mins 35 seconds
Throughput 14.47 models/second
Allocated time remaining 999 hrs 58 mins 25 seconds (approx. 52103888.15789473 more models.)

GENERATION 4 with 2 breakpoints (~2% converged)
Breakpoints c-AIC Delta c-AIC [BP 1] [BP 2]
0 26644.60
1 24363.32 2281.279 608
2 23778.11 585.217 609 1955
GA has considered 126/ 861328 (1438 over all runs) unique models
Total run time 0 hrs 1 mins 53 seconds
Throughput 12.73 models/second
Allocated time remaining 999 hrs 58 mins 7 seconds (approx. 45810951.38053098 more models.)
Error:

Master node received an error:HyPhy killed by signal 15

Function call stack
1 : MPI Receive from -1 storing actual sender node into fromNode and storing the string result into result_String
Standard input redirect:
Empty Associative List-------
2 : fromNode=ReceiveJobs(0,0)
Standard input redirect:
Empty Associative List-------
3 : CleanUpMPI(0)
Standard input redirect:
Empty Associative List-------
4 : ExecuteAFile from file ibfPath using basepath /share/apps/centos7/hyphy/gcc/7.1.0/openmpi/3.0.0-gcc/2.3.14/lib/hyphy/TemplateBatchFiles/.
Standard input redirect:
Empty Associative List-------

In the errors file I get the following:

HYPHYMPI terminated.
Error:
Internal error, dumping the offending likelihood function to /tmp/hyphy.dump
ERROR: [_LikelihoodFunction::LocateTheBump (index 7) current value -12090.46287788592 (parameter = 0), best value -12090.46287788592 (parameter = 0)); delta = 5.456968210637569e-12 ]

Parameter name givenTree2.FRAX03_FRAEX38873_V2_000290300_1_R0.t

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;


MPI_ABORT was invoked on rank 61 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Optimize storing into, res, the following likelihood function:lf ;

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

[nxv11:29389] 63 more processes have sent help message help-mpi-api.txt / mpi-abort
[nxv11:29389] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[warn] Epoll MOD(1) on fd 31 failed. Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
[warn] Epoll MOD(4) on fd 31 failed. Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor

All of the datasets that fail have this same type of error. However, for some of the datasets, if I tried resubmitting the job with exactly the same settings they ran to completion with any problem. So, it seems like the error might be occurring at random, rather than being specific to certain datasets.

I would be grateful of any suggestions on how to fix this issue. Please let me know if you need me to send more info or example input/output files.

Many thanks
Laura

@spond
Copy link
Member

spond commented Oct 29, 2018

Dear @lkelly3,

Please try the beta branch. I've been tinkering with the internal checks for consistency, and the beta branch is less stringent. This will be incorporated in the next release.

Best,
Sergei

@lkelly3
Copy link
Author

lkelly3 commented Oct 31, 2018

Hi Sergei

Thanks for your suggestion. I have now tried the beta version of hyphy, testing it with 8 datasets that I previously tried three times, unsuccessfully, with v.2.3.14.

I ran the beta version with the same command/parameter settings as above. 5 of the datasets successfully ran to completion without error; the other 3 quit with the same type of error as before, e.g.:
HYPHYMPI terminated.
Error:
Internal error dumping the offending likelihood function to /tmp/hyphy.dump

ERROR: [_LikelihoodFunction::LocateTheBump (index 36) current value -1010917.389203595 (parameter = 0), best value -1010848.074485539 (parameter = 0)); delta = 69.31471805600449 ]

Do you have any suggestions of what might be causing this type of error? It seems like if I keep resubmitting the failed datasets then eventually they will probably run successfully, because each time I tried this (with v.2.3.14) a few of the previously failed datasets complete OK. However, I ultimately have thousands of datasets I would like to analyse with GARD, so I would like to understand what is causing these issues.

Thanks for your help
Laura

@lkelly3
Copy link
Author

lkelly3 commented Nov 15, 2018

Hello again

Just to let you know that I still haven't been able to resolve this problem. I have managed to successfully run the 3 datasets that previously failed multiple times (see my previous comment) via the datamonkey server; all ran to completion successfully (with evidence of recombination found), so I don't think the problem lies with these datasets per se. The only difference, to my knowledge, between the analysis I ran with hyphy and on the datamonkey server is that I specified the GTR model with hyphy.

As I mentioned before, I have a lot of other datasets I would like to analyse, so would really like to be able to resolve this problem if possible. Any help further help with this would be gratefully received!

Many thanks
Laura

@spond
Copy link
Member

spond commented Nov 15, 2018

Dear Laura,

Would you mind e-mailing one of the datasets that is causing the issue to me? I'd like to reproduce the error and offer a more robust fix.

Best,
Sergei

@lkelly3
Copy link
Author

lkelly3 commented Nov 15, 2018 via email

@spond
Copy link
Member

spond commented Nov 15, 2018

Dear @lkelly3,

GitHub strips attachments that don't have one of the approved extension (e.g. .txt). Can you please e-mail it to me directly using spond at temple edu address?

Best,
Sergei

@rdvelazquez
Copy link
Contributor

Dear @spond,

I've also experienced stochastic errors when using GARD with HyPhy 2.3.14. The file linked below runs without issue about a quarter of the time and fails with the same errors as listed by @lkelly3 the other 75 percent of the time.

Example GARD input file: https://github.com/veg/tools-iuc/blob/master/tools/hyphy/test-data/gard-in1.fa

Datamonkey is currently using HyPhy version 2.3.11 of GARD because of issues with the newer versions. The example file runs fine on datamonkey.

This just came up for me today as I was installing GARD on our galaxy instance.

We've also noted that there are issues with GARD and aocc compilers (I'm unsure of the specifics)

It sounds like the issues are likely addressed in the newer version but just thought I'd provide some additional input here.

Best,
Ryan

@spond
Copy link
Member

spond commented Feb 19, 2019

Dear @rdvelazquez,

I think the best way to resolve this issue is to port GARD.bf to run with libv3 and HyPhy 2.4. Let's keep this ticket open until it is done.

Best,
Sergei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants