BUSTED-PH with large number of sequences #39

rsiani · 2023-06-09T07:38:30Z

Hello there,
I have previously used BUSTED-PH successfully on ~500 seqs. Now, I was trying to improve the analysis for a publication and I managed to get up to ~1500 seqs. However, after running for several hours, the program seems to quietly die and doesn't produce results. Unfortunately, I wasn't precautious enough to redirect the sd.out to a file to look at what point it's crashing...

Anyway, I just wanted to ask if at a theoretical level there is any issue with such numerous sequences and if I should consider an alternative method? Or if I could just solve it by throwing more resources at it (I am currently limited to 14 threads).

spond · 2023-06-09T12:46:46Z

Dear @rsiani,

You should be able to run BUSTED-PH on ~1,500 sequences in a reasonable amount of time (~1 day or so, would be my guess).

How long is your alignment and which version of HyPhy are you using?

I would suggest adding ENV="TOLERATE_NUMERICAL_ERRORS=1" to the command line invocation of the program, because sometimes larger alignments could trigger internal numerical consistency checks, which by default are errors.

One other possibility is that the program is running out of memory, but that should trigger sooner in the execution.

Best,
Sergei

rsiani · 2023-06-09T13:13:10Z

Dear @spond,
thanks for the fast reply. I managed to get one run to complete in some hours, but only with Srv turned off. Now I am trying again with Srv on. The alignment is around 400bp and I am using the latest version, 2.5.51.

As soon as I have results from that as well I will update you!

Best,
Rob

spond · 2023-06-09T14:23:36Z

Dear @rsiani,

Generally, you get a ~3-5x performance hit with SRV on. Here it could be worse than that because of the additional memory overhead. Each branch will require the storage of 9 transition matrices (default settings, with 3x3 rate classes), which is about ~800MB for a tree of 1500 sequences, so there's a lot of memory movement which slows things down a lot.

I'd be curious to learn how long it takes. Make sure to specify --starting-points K where K ~ 10 to get a good starting guess for the optimization.

Also, for this many branches, you could consider increasing the number of rate classes (of course this will slow the performance down).

Which CPUs do you have?

Best,
Sergei

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUSTED-PH with large number of sequences #39

BUSTED-PH with large number of sequences #39

rsiani commented Jun 9, 2023

spond commented Jun 9, 2023

rsiani commented Jun 9, 2023

spond commented Jun 9, 2023

BUSTED-PH with large number of sequences #39

BUSTED-PH with large number of sequences #39

Comments

rsiani commented Jun 9, 2023

spond commented Jun 9, 2023

rsiani commented Jun 9, 2023

spond commented Jun 9, 2023