Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUSTED-PH with large number of sequences #39

Open
rsiani opened this issue Jun 9, 2023 · 3 comments
Open

BUSTED-PH with large number of sequences #39

rsiani opened this issue Jun 9, 2023 · 3 comments

Comments

@rsiani
Copy link

rsiani commented Jun 9, 2023

Hello there,
I have previously used BUSTED-PH successfully on ~500 seqs. Now, I was trying to improve the analysis for a publication and I managed to get up to ~1500 seqs. However, after running for several hours, the program seems to quietly die and doesn't produce results. Unfortunately, I wasn't precautious enough to redirect the sd.out to a file to look at what point it's crashing...

Anyway, I just wanted to ask if at a theoretical level there is any issue with such numerous sequences and if I should consider an alternative method? Or if I could just solve it by throwing more resources at it (I am currently limited to 14 threads).

@spond
Copy link
Member

spond commented Jun 9, 2023

Dear @rsiani,

You should be able to run BUSTED-PH on ~1,500 sequences in a reasonable amount of time (~1 day or so, would be my guess).

How long is your alignment and which version of HyPhy are you using?

I would suggest adding ENV="TOLERATE_NUMERICAL_ERRORS=1" to the command line invocation of the program, because sometimes larger alignments could trigger internal numerical consistency checks, which by default are errors.

One other possibility is that the program is running out of memory, but that should trigger sooner in the execution.

Best,
Sergei

@rsiani
Copy link
Author

rsiani commented Jun 9, 2023

Dear @spond,
thanks for the fast reply. I managed to get one run to complete in some hours, but only with Srv turned off. Now I am trying again with Srv on. The alignment is around 400bp and I am using the latest version, 2.5.51.

As soon as I have results from that as well I will update you!

Best,
Rob

@spond
Copy link
Member

spond commented Jun 9, 2023

Dear @rsiani,

Generally, you get a ~3-5x performance hit with SRV on. Here it could be worse than that because of the additional memory overhead. Each branch will require the storage of 9 transition matrices (default settings, with 3x3 rate classes), which is about ~800MB for a tree of 1500 sequences, so there's a lot of memory movement which slows things down a lot.

I'd be curious to learn how long it takes. Make sure to specify --starting-points K where K ~ 10 to get a good starting guess for the optimization.

Also, for this many branches, you could consider increasing the number of rate classes (of course this will slow the performance down).

Which CPUs do you have?

Best,
Sergei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants