Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add average amino acid identity (AAI) #16

Open
ninjatacoshell opened this issue Mar 11, 2016 · 5 comments
Open

Add average amino acid identity (AAI) #16

ninjatacoshell opened this issue Mar 11, 2016 · 5 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already method the issue relates to how results are calculated
Milestone

Comments

@ninjatacoshell
Copy link

An alternative to ANI for more distantly related genomes is average amino acid identity (AAI; see Konstantinidis and Tiedje 2005 and Rodrigues and Konstantinidis 2014). Instead of DNA FASTA files the user would need to supply protein FASTA files.

This web tool only lets you calculate AAI for two genomes at a time.

This web tool lets you calculate AAI for multiple genomes, but only the ones that are stored in its database (i.e. no user-generated genomes). And the database doesn't appear to have been updated since around 2012.

This web tool lets you calculate AAI for up to 10 genomes at a time, but you have to run them through RAST, first, which is inconvenient.

So being able to run AAI on your own machine, like pyani already does for ANI and tetranucleotide regression, would be very useful.

@widdowquinn
Copy link
Owner

I like the idea, but I'm inclined to leave this to a later version of pyani that integrates with pyrbbh.

For AAI we need to define equivalent proteins for comparison. That's something which can be done in several ways, and I'd like to hand that method choice off to the user's preference. I'm not sure that there's a data standard for specifying such equivalence for pairs of proteins, or for whole groups - I'll have to put some time into looking around for one (suggestions welcome) or devising one that works here.

@widdowquinn widdowquinn self-assigned this Mar 30, 2016
@widdowquinn widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Mar 30, 2016
@ninjatacoshell
Copy link
Author

In the original paper by Konstantinidis and Tiedje 2005 they performed AAI by searching all protein-coding sequences from the query genome against the reference genome using TBLASTN, with cut-offs of at least 30% identity and at least 70% coverage. They called this one-way BLAST. Then they took the top matching segment and performed the reverse search using BLASTX (presumably with the same cut-offs). They called this two-way BLAST. In their analysis the two-way BLAST was slightly more reliable.

How would BBH compare to their two-way BLAST in terms of computation time? And would it be invulnerable to inconsistencies in the annotation between different genomes the way their two-way BLAST is?

@widdowquinn
Copy link
Owner

The method from Konstantinidis and Tiedje is one of several ways to define 'equivalent proteins/CDS'. It happens to be one that doesn't require a prior protein annotation on the 'reference', but it does require one on the query.

The two-way BLAST search is likely to be more reliable than the one-way analysis for the same reasons RBH/BBH matches are more reliable than one-way BLAST matches, in general (as described in, e.g. https://github.com/widdowquinn/Teaching-Dundee-BS32010/blob/master/workshop_2/06-RBBH.ipynb and https://github.com/widdowquinn/Teaching-Dundee-BS32010/blob/master/lecture/2016-03-21_BS32010_Pritchard.pdf).

In terms of differences in computation time, I don't know off-hand how it would work out. I'd expect reciprocal BLASTP of a query protein complement against protein database of a reference protein complement to be faster than BLASTX of query against untranslated genome, but I wouldn't be upset if that wasn't true ;) As for inconsistencies in annotation - given that you have one protein annotation already in the K&T method, then I wouldn't consider it invulnerable to "annotation inconsistency". You could try two-way TBLASTX if you want to ignore annotation altogether (but although you're then invulnerable to annotation inconsistency, you also do not gain any of its many advantages…)

@ninjatacoshell
Copy link
Author

I don't know if it will help, but they've put their script for calculating AAI (using Ruby) on GitHub: https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/aai.rb. Perhaps it (or part of it) can be rewritten for Python?

@sbridel
Copy link

sbridel commented Mar 29, 2017

Suggestion: https://github.com/dparks1134/CompareM using Diamond and Prodigal to find equivalent protein.

The AAI feature will be very nice in pyani

@widdowquinn widdowquinn added this to the 0.3.1 milestone May 28, 2020
@widdowquinn widdowquinn added the method the issue relates to how results are calculated label May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already method the issue relates to how results are calculated
Projects
None yet
Development

No branches or pull requests

3 participants