Retrieving blast sequences doesn't work well with numbers #88
See also a user's discussion of what appears to be the same bug
ben@ben:/tmp$ cat numbers.fa >378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG >186233 AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGG ben@ben:/tmp$ cat characters.fa >characters378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG >characters186233 AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGG
then creating databases
ben@ben:/tmp$ makeblastdb -dbtype nucl -parse_seqids -in characters.fa Building a new DB, current time: 06/14/2012 09:54:17 New DB name: characters.fa New DB title: characters.fa Sequence type: Nucleotide Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 2 sequences in 0.0204051 seconds. ben@ben:/tmp$ makeblastdb -dbtype nucl -parse_seqids -in numbers.fa Building a new DB, current time: 06/14/2012 09:54:22 New DB name: numbers.fa New DB title: numbers.fa Sequence type: Nucleotide Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 2 sequences in 0.000734091 seconds.
The problem is shown below.
ben@ben:/tmp$ blastdbcmd -entry 378462 -db numbers.fa Error: 378462: OID not found BLAST query/options error: Entry not found in BLAST database ben@ben:/tmp$ blastdbcmd -entry 'lcl|378462' -db numbers.fa >lcl|378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
whereas with characters
ben@ben:/tmp$ blastdbcmd -entry characters378462 -db characters.fa >lcl|characters378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG ben@ben:/tmp$ blastdbcmd -entry 'lcl|characters378462' -db characters.fa >lcl|characters378462 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
Currently when sequenceserver queries the blast database for the number - it uses what is between the first and second "|" characters:
id = cid.include?('|') ? cid.split('|') : cid.split('|')
I can't see/remember any reason for parsing this - why not just throw blastdbcmd the whole first word (all of
Do you think this is a bug in blastdbcmd that should be reported?
The text was updated successfully, but these errors were encountered:
…34, leave them be Attempting to parse them only introduces errors, (at least) specifically when the identifier is a number after lcl. Fixes wurmlab#88 Signed-off-by: Ben J. Woodcroft <donttrustben near gmail.com>
Impressive debugging @wwood :). I have no clue either why SS reads b/w the first and second pipe (|). I remember asking myself the same question the last time I touched the code for some simple refactoring. The logic was already in place; didn't bother changing it lest I should break sequence retreival.