-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dispersion value meaning #36
Comments
Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-) |
Thanks for the answer. Why does it matter how the alignment was done? I
mean the concept should remain the same regardless of muscle or MAFFT.
Sorry I am bit confused.
…On Tue, Jun 14, 2022, 10:42 AM Robert Edgar ***@***.***> wrote:
Yeah, sorry this is quite obscure -- should be better in the output and in
the documentation. D_LP is dispersion, from memory I think D_Cols is
average column confidence. So the MSAs in this ensemble have very low
dispersion and therefore probably have very few errors -- assuming they are
actually from muscle5 and not mafft :-)
—
Reply to this email directly, view it on GitHub
<#36 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG27XBESLIBDXCX4BSXDYD3VPCK5VANCNFSM5YVUPBOA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
For dispersion of an ensemble to correlate with error in the alignment, you need a large set of MSAs which are known to have state-of-the-art accuracy on structural benchmarks and which vary as much as possible according to model parameters (gap penalties and substitution matrix) and guide tree, where the variations do not compromise average benchmark accuracy. AFAIK muscle5 is the only algorithm that can do this. |
Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the same sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated. |
This makes sense. So it means if I generate multiple alignments using MAFFT
(say either by running it repeatedly or by using different parameters (like
changing the number of iterations in MAFFT)), I can use the dispersion
method in muscle to calculate errors in MAFFT alignments and if the
dispersion is not too high, then I can just one of the MSA by MAFFT. Am I
thinking, right?
…On Tue, Jun 14, 2022 at 12:04 PM Robert Edgar ***@***.***> wrote:
Correction -- "MSAs which are known to have state-of-the-art accuracy on
structural benchmarks" is wrong, of course in practice we don't have
structural benchmarks to compare with. What I mean is, the algorithm used
to generate each MSA has high accuracy. The trick is to get many
alternative alignments of the *same* sequences such that they are all
equally plausible. If they vary, then this is necessarily due to errors and
the number of errors in a typical alignment from the ensemble can therefore
be estimated.
—
Reply to this email directly, view it on GitHub
<#36 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG27XBA5HT3WCHUW5TZSEPLVPCUQVANCNFSM5YVUPBOA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble. |
Hi Robert, |
As noted at the start of this issue, I need to do a better job with the output and documentation here. To answer your question, you breathe biology into this exercise by using the MSA for something. Alignments are a means to an end, what is the end here? Let's say you want to measure the squrgle coefficient of the spike ACE binding domain. Then you do this: calculate the squrgle coefficient S from every MSA and this gives you the mean and standard deviation of S. This tells you the uncertainty in S due to alignment errors. |
I'm not familiar with the "squrgle coefficient S". I couldn't find a definition online. Could you provide a definition or links to a definition ? Thanks. |
:-)) It means whatever you want it to mean -- it was a nonsense word serving as a placeholder for whatever it is you want to infer from an alignment. |
Thanks for the clarification. |
Thank you so much for the detailed explanation. It makes sense now.
…On Tue, Jun 14, 2022 at 2:15 PM Robert Edgar ***@***.***> wrote:
Wrong, sorry. First, muscle5 is more accurate than MAFFT on average.
Second, you don't know how to vary MAFFT parameters in the right way -- you
need to maximize the parameter changes to vary the alignment without
degrading average accuracy on benchmarks. This is very hard to figure out.
Even if you do figure it out, you are only varying a small number of
parameters such as gap penalties and the number of iterations. This is not
sufficient to get enough variation in the ensemble. If you see less
variation in MAFFT, this may be explained because you are varying much
fewer parameters. In muscle5, roughly 200 parameters are varied including
all substitution matrix values and gap penalties, plus the guide tree. It
would be a major research project to figure out how to vary a comparable
number of parameters in MAFFT, and even if you succeeded the accuracy is
lower than muscle5 with MAFFT defaults, so the a priori assumption is that
the default muscle5 alignment is better than anything in your MAFFT
ensemble.
—
Reply to this email directly, view it on GitHub
<#36 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG27XBFW7225DXQAC3VRYN3VPDD37ANCNFSM5YVUPBOA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Dispersion calculation gives me two values, I dont which is the one that is mentioned on the website (and if <0.05 then alignment is likely fine):
@disperse file=ensemble_mafft.efa D_LP=0.005485 D_Cols=1
What is D_LP and D_Cols?
The text was updated successfully, but these errors were encountered: