Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispersion value meaning #36

Closed
Adarsh931 opened this issue Jun 13, 2022 · 12 comments
Closed

dispersion value meaning #36

Adarsh931 opened this issue Jun 13, 2022 · 12 comments

Comments

@Adarsh931
Copy link

Dispersion calculation gives me two values, I dont which is the one that is mentioned on the website (and if <0.05 then alignment is likely fine):
@disperse file=ensemble_mafft.efa D_LP=0.005485 D_Cols=1

What is D_LP and D_Cols?

@rcedgar
Copy link
Owner

rcedgar commented Jun 14, 2022

Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-)

@Adarsh931
Copy link
Author

Adarsh931 commented Jun 14, 2022 via email

@rcedgar
Copy link
Owner

rcedgar commented Jun 14, 2022

For dispersion of an ensemble to correlate with error in the alignment, you need a large set of MSAs which are known to have state-of-the-art accuracy on structural benchmarks and which vary as much as possible according to model parameters (gap penalties and substitution matrix) and guide tree, where the variations do not compromise average benchmark accuracy. AFAIK muscle5 is the only algorithm that can do this.

@rcedgar
Copy link
Owner

rcedgar commented Jun 14, 2022

Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the same sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated.

@Adarsh931
Copy link
Author

Adarsh931 commented Jun 14, 2022 via email

@rcedgar
Copy link
Owner

rcedgar commented Jun 14, 2022

Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble.

@greg-harhay
Copy link

Hi Robert,
I am requesting clarification about the dispersion metrics. I have aligned the spike gene from 192 coronavirus genomes with muscle5 using the -align and - diversified options to create an ensemble of 100 alignments. Using the -disperse option to measure the dispersion in the ensemble yields D_LP=5.066e-06 D_Cols=0.0002444. I tried to dig through the code, but I'm not a C-coder and need a little help interpreting these results. Since these are apparently measures of dispersion, low values in both numbers are desirable I presume, preferably 0, but maybe small numbers are OK. Any suggested thresholds? Any advice about how I could go about breathing some biology into these numbers? Thanks.

@rcedgar
Copy link
Owner

rcedgar commented Jul 13, 2022

As noted at the start of this issue, I need to do a better job with the output and documentation here. To answer your question, you breathe biology into this exercise by using the MSA for something. Alignments are a means to an end, what is the end here? Let's say you want to measure the squrgle coefficient of the spike ACE binding domain. Then you do this: calculate the squrgle coefficient S from every MSA and this gives you the mean and standard deviation of S. This tells you the uncertainty in S due to alignment errors.

@greg-harhay
Copy link

I'm not familiar with the "squrgle coefficient S". I couldn't find a definition online. Could you provide a definition or links to a definition ? Thanks.

@rcedgar
Copy link
Owner

rcedgar commented Jul 14, 2022

:-)) It means whatever you want it to mean -- it was a nonsense word serving as a placeholder for whatever it is you want to infer from an alignment.

@greg-harhay
Copy link

Thanks for the clarification.

@Adarsh931
Copy link
Author

Adarsh931 commented Oct 11, 2022 via email

@rcedgar rcedgar closed this as completed Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants