Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving blast sequences doesn't work well with numbers #88

Closed
wwood opened this issue Jun 14, 2012 · 5 comments
Closed

Retrieving blast sequences doesn't work well with numbers #88

wwood opened this issue Jun 14, 2012 · 5 comments
Milestone

Comments

@wwood
Copy link
Contributor

wwood commented Jun 14, 2012

See also a user's discussion of what appears to be the same bug
https://groups.google.com/forum/?fromgroups#!topic/sequenceserver/zUSIujAnHRI

ben@ben:/tmp$ cat numbers.fa
>378462
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
>186233
AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGG
ben@ben:/tmp$ cat characters.fa
>characters378462
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
>characters186233
AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGG

then creating databases

ben@ben:/tmp$ makeblastdb -dbtype nucl -parse_seqids -in characters.fa 


Building a new DB, current time: 06/14/2012 09:54:17
New DB name:   characters.fa
New DB title:  characters.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 2 sequences in 0.0204051 seconds.
ben@ben:/tmp$ makeblastdb -dbtype nucl -parse_seqids -in numbers.fa


Building a new DB, current time: 06/14/2012 09:54:22
New DB name:   numbers.fa
New DB title:  numbers.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 2 sequences in 0.000734091 seconds.

The problem is shown below. numbers.fa doesn't work, while characters.fa does:

ben@ben:/tmp$ blastdbcmd -entry 378462 -db numbers.fa
Error: 378462: OID not found
BLAST query/options error: Entry not found in BLAST database
ben@ben:/tmp$ blastdbcmd -entry 'lcl|378462' -db numbers.fa
>lcl|378462 
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG

whereas with characters

ben@ben:/tmp$ blastdbcmd -entry characters378462 -db characters.fa
>lcl|characters378462 
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
ben@ben:/tmp$ blastdbcmd -entry 'lcl|characters378462' -db characters.fa
>lcl|characters378462 
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG

Currently when sequenceserver queries the blast database for the number - it uses what is between the first and second "|" characters:

id  = cid.include?('|') ? cid.split('|')[1] : cid.split('|')[0]

I can't see/remember any reason for parsing this - why not just throw blastdbcmd the whole first word (all of cid)?

Do you think this is a bug in blastdbcmd that should be reported?

wwood pushed a commit to wwood/sequenceserver that referenced this issue Jun 14, 2012
…34, leave them be

Attempting to parse them only introduces errors, (at least)
specifically when the identifier is a number after lcl. Fixes wurmlab#88

Signed-off-by: Ben J. Woodcroft <donttrustben near gmail.com>
@yeban
Copy link
Collaborator

yeban commented Jun 15, 2012

Impressive debugging @wwood :). I have no clue either why SS reads b/w the first and second pipe (|). I remember asking myself the same question the last time I touched the code for some simple refactoring. The logic was already in place; didn't bother changing it lest I should break sequence retreival.

@yeban yeban closed this as completed in 6d83a08 Dec 6, 2014
@yeban
Copy link
Collaborator

yeban commented Dec 6, 2014

While our hack works, this should be fixed upstream in the longterm. @vivekiitkgp Please could you report this issue to NCBI?

@raivivek
Copy link
Member

raivivek commented Dec 6, 2014

@yeban Okay. I will.

raivivek added a commit to raivivek/sequenceserver that referenced this issue Dec 7, 2014
Signed-off-by: Vivek Rai <vivekraiiitkgp@gmail.com>
raivivek added a commit to raivivek/sequenceserver that referenced this issue Dec 7, 2014
Sequences with only numeric FASTA ids are not properly retrieved using
blastdbcmd. While our hack fixes this, it is to be reported upstream.

Signed-off-by: Vivek Rai <vivekraiiitkgp@gmail.com>
raivivek added a commit to raivivek/sequenceserver that referenced this issue Dec 8, 2014
Sequences with only numeric FASTA ids are not properly retrieved using
blastdbcmd. While our hack fixes this, it is to be reported upstream.

Signed-off-by: Vivek Rai <vivekraiiitkgp@gmail.com>
yeban pushed a commit to yeban/sequenceserver that referenced this issue Dec 8, 2014
Sequences with only numeric FASTA ids are not properly retrieved using
blastdbcmd. While our hack fixes this, it is to be reported upstream.

Signed-off-by: Vivek Rai <vivekraiiitkgp@gmail.com>
yeban added a commit to yeban/sequenceserver that referenced this issue Dec 9, 2014
…rget_only.

reopen wurmlab#88

Signed-off-by: Anurag Priyam <anurag08priyam@gmail.com>
@yeban
Copy link
Collaborator

yeban commented Dec 9, 2014

Our hack fails. Reverting the change. @vivekiitkgp has reported this to NCBI. Will wait for the issue to be fixed upstream.

@yeban
Copy link
Collaborator

yeban commented Jan 5, 2017

In my understanding, unless it's fixed upstream there's no way around it except to not allow numeric ids (implemented via --doctor), or creating a FASTA index ourselves or via another tool. I am going with the former.

@yeban yeban closed this as completed Jan 5, 2017
@yeban yeban added this to the 1.1 milestone Jan 5, 2017
@yeban yeban added vendor-issue and removed bug labels Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants