Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI merges FASTA headers when querying with multiple same peptides #12

Merged

Conversation

bmesuere
Copy link
Member

This branch fixes both #12 and #17.

When querying the Unipept API through the CLI with a FASTA file containing multiple of the same peptides, the resulting output contains 1 single FASTA header instead of the header for these peptides. See the following example:

$ cat data/IITHPNFNGNTLDNDIMLIK 
>a|1
IITHPNFNGNTLDNDIMLIK
>b|2
IITHPNFNGNTLDNDIMLIK
$ unipept pept2prot -i data/IITHPNFNGNTLDNDIMLIK
fasta_header,peptide,uniprot_id,taxon_id
>b|2,IITHPNFNGNTLDNDIMLIK,P00761,9823
>b|2,IITHPNFNGNTLDNDIMLIK,C5IWV5,9823
>b|2,IITHPNFNGNTLDNDIMLIK,F1SRS2,9823
>b|2,IITHPNFNGNTLDNDIMLIK,P00761,9823
>b|2,IITHPNFNGNTLDNDIMLIK,C5IWV5,9823
>b|2,IITHPNFNGNTLDNDIMLIK,F1SRS2,9823
$ unipept pept2lca -i data/IITHPNFNGNTLDNDIMLIK 
fasta_header,peptide,taxon_id,taxon_name,taxon_rank
>b|2,IITHPNFNGNTLDNDIMLIK,9823,Sus scrofa,species
>b|2,IITHPNFNGNTLDNDIMLIK,9823,Sus scrofa,species

Is this expected behaviour?

Original pull request by @silox on Tue Dec 16 2014 at 00:28.
Merged by @silox on Wed Apr 08 2015 at 18:17.

@bmesuere
Copy link
Member Author

Conclusion from the meeting: No, this is not expected behaviour and should be altered so the output as follows:

$ cat data/IITHPNFNGNTLDNDIMLIK 
>a|1
IITHPNFNGNTLDNDIMLIK
>b|2
IITHPNFNGNTLDNDIMLIK
$ unipept pept2prot -i data/IITHPNFNGNTLDNDIMLIK
fasta_header,peptide,uniprot_id,taxon_id
>a|1,IITHPNFNGNTLDNDIMLIK,P00761,9823
>a|1,IITHPNFNGNTLDNDIMLIK,C5IWV5,9823
>a|1,IITHPNFNGNTLDNDIMLIK,F1SRS2,9823
>b|2,IITHPNFNGNTLDNDIMLIK,P00761,9823
>b|2,IITHPNFNGNTLDNDIMLIK,C5IWV5,9823
>b|2,IITHPNFNGNTLDNDIMLIK,F1SRS2,9823
$ unipept pept2lca -i data/IITHPNFNGNTLDNDIMLIK 
fasta_header,peptide,taxon_id,taxon_name,taxon_rank
>a|1,IITHPNFNGNTLDNDIMLIK,9823,Sus scrofa,species
>b|2,IITHPNFNGNTLDNDIMLIK,9823,Sus scrofa,species

Original comment by @silox on Wed Dec 10 2014 at 17:53.

@bmesuere
Copy link
Member Author

The fasta mapper now maps a peptide to a list of fasta headers and not to a single fasta headers. (This was the cause of the overwritement of fasta headers as multiple peptides got mapped to the same header).

When the results get written to stdout, we calculate what header we need to take from the array in the faster mapper by dividing the peptides already written and dividing that by the total amount of occurences of that peptide in the batch, dividing that by the amount of FASTA headers for that peptide.

Original comment by @silox on Tue Dec 16 2014 at 00:32.

@bmesuere
Copy link
Member Author

I'm not sure this is the way to go after reviewing this as all the headers and their counts get saved in memory.

Original comment by @silox on Sun Mar 29 2015 at 21:15.

@bmesuere
Copy link
Member Author

Since we run this on huge files, we must make sure we don't store things in memory that aren't needed anymore. #13 is related.

Original comment by @bmesuere on Sun Mar 29 2015 at 21:24.

@bmesuere
Copy link
Member Author

This branch fixes both #12 and #17 and is (imo) ready to merge. There should be almost no slowdown on account of the << and .shift operations as these are O(1) operations.

Original comment by @silox on Wed Apr 08 2015 at 17:26.

@bmesuere
Copy link
Member Author

Reviewed and seems ok. You can merge this.

Original comment by @bmesuere on Wed Apr 08 2015 at 18:16.

@bmesuere bmesuere mentioned this pull request Aug 14, 2019
@bmesuere bmesuere added this to the 1.0 milestone Aug 14, 2019
@bmesuere bmesuere merged commit 1f59f98 into migrated/pr-12/master Aug 14, 2019
@bmesuere bmesuere deleted the migrated/pr-12/fix/12/deduplicate-fasta-headers branch August 14, 2019 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants