dispersion value meaning #36

Adarsh931 · 2022-06-13T22:04:10Z

Dispersion calculation gives me two values, I dont which is the one that is mentioned on the website (and if <0.05 then alignment is likely fine):
@disperse file=ensemble_mafft.efa D_LP=0.005485 D_Cols=1

What is D_LP and D_Cols?

rcedgar · 2022-06-14T14:42:23Z

Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-)

Adarsh931 · 2022-06-14T14:55:49Z

Thanks for the answer. Why does it matter how the alignment was done? I mean the concept should remain the same regardless of muscle or MAFFT. Sorry I am bit confused.

…

On Tue, Jun 14, 2022, 10:42 AM Robert Edgar ***@***.***> wrote: Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-) — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG27XBESLIBDXCX4BSXDYD3VPCK5VANCNFSM5YVUPBOA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rcedgar · 2022-06-14T14:59:49Z

For dispersion of an ensemble to correlate with error in the alignment, you need a large set of MSAs which are known to have state-of-the-art accuracy on structural benchmarks and which vary as much as possible according to model parameters (gap penalties and substitution matrix) and guide tree, where the variations do not compromise average benchmark accuracy. AFAIK muscle5 is the only algorithm that can do this.

rcedgar · 2022-06-14T16:04:16Z

Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the same sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated.

Adarsh931 · 2022-06-14T17:16:13Z

This makes sense. So it means if I generate multiple alignments using MAFFT (say either by running it repeatedly or by using different parameters (like changing the number of iterations in MAFFT)), I can use the dispersion method in muscle to calculate errors in MAFFT alignments and if the dispersion is not too high, then I can just one of the MSA by MAFFT. Am I thinking, right?

…

On Tue, Jun 14, 2022 at 12:04 PM Robert Edgar ***@***.***> wrote: Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the *same* sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG27XBA5HT3WCHUW5TZSEPLVPCUQVANCNFSM5YVUPBOA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rcedgar · 2022-06-14T18:15:17Z

Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble.

greg-harhay · 2022-07-12T23:30:49Z

Hi Robert,
I am requesting clarification about the dispersion metrics. I have aligned the spike gene from 192 coronavirus genomes with muscle5 using the -align and - diversified options to create an ensemble of 100 alignments. Using the -disperse option to measure the dispersion in the ensemble yields D_LP=5.066e-06 D_Cols=0.0002444. I tried to dig through the code, but I'm not a C-coder and need a little help interpreting these results. Since these are apparently measures of dispersion, low values in both numbers are desirable I presume, preferably 0, but maybe small numbers are OK. Any suggested thresholds? Any advice about how I could go about breathing some biology into these numbers? Thanks.

rcedgar · 2022-07-13T00:02:21Z

As noted at the start of this issue, I need to do a better job with the output and documentation here. To answer your question, you breathe biology into this exercise by using the MSA for something. Alignments are a means to an end, what is the end here? Let's say you want to measure the squrgle coefficient of the spike ACE binding domain. Then you do this: calculate the squrgle coefficient S from every MSA and this gives you the mean and standard deviation of S. This tells you the uncertainty in S due to alignment errors.

greg-harhay · 2022-07-14T15:35:55Z

I'm not familiar with the "squrgle coefficient S". I couldn't find a definition online. Could you provide a definition or links to a definition ? Thanks.

rcedgar · 2022-07-14T15:38:07Z

:-)) It means whatever you want it to mean -- it was a nonsense word serving as a placeholder for whatever it is you want to infer from an alignment.

greg-harhay · 2022-07-14T15:59:58Z

Thanks for the clarification.

Adarsh931 · 2022-10-11T07:56:52Z

Thank you so much for the detailed explanation. It makes sense now.

…

On Tue, Jun 14, 2022 at 2:15 PM Robert Edgar ***@***.***> wrote: Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG27XBFW7225DXQAC3VRYN3VPDD37ANCNFSM5YVUPBOA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rcedgar closed this as completed Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dispersion value meaning #36

dispersion value meaning #36

Adarsh931 commented Jun 13, 2022

rcedgar commented Jun 14, 2022

Adarsh931 commented Jun 14, 2022 via email

rcedgar commented Jun 14, 2022

rcedgar commented Jun 14, 2022

Adarsh931 commented Jun 14, 2022 via email

rcedgar commented Jun 14, 2022

greg-harhay commented Jul 12, 2022

rcedgar commented Jul 13, 2022

greg-harhay commented Jul 14, 2022

rcedgar commented Jul 14, 2022

greg-harhay commented Jul 14, 2022

Adarsh931 commented Oct 11, 2022 via email

dispersion value meaning #36

dispersion value meaning #36

Comments

Adarsh931 commented Jun 13, 2022

rcedgar commented Jun 14, 2022

Adarsh931 commented Jun 14, 2022 via email

rcedgar commented Jun 14, 2022

rcedgar commented Jun 14, 2022

Adarsh931 commented Jun 14, 2022 via email

rcedgar commented Jun 14, 2022

greg-harhay commented Jul 12, 2022

rcedgar commented Jul 13, 2022

greg-harhay commented Jul 14, 2022

rcedgar commented Jul 14, 2022

greg-harhay commented Jul 14, 2022

Adarsh931 commented Oct 11, 2022 via email