Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment not making sense #31

Closed
shump2 opened this issue Nov 3, 2021 · 11 comments
Closed

Assignment not making sense #31

shump2 opened this issue Nov 3, 2021 · 11 comments

Comments

@shump2
Copy link

shump2 commented Nov 3, 2021

Hi,
I have created a db (nt.ACC.taxonomy ~82 million records) using an updated version of the entire nt NCBI database.
Some sequence queries fail to assign to the correct species despite the blastn hits being correct.
For example: if I search one sequence against the nt database with the custom taxonomy file. The blastn returns 11 hits but the alignment returns a very strange result to COVID. I am not sure where to begin in resolving this and if there are any known issues?
Any assistance here would be much appreciated.

Blastn file
haplo_51122 KU317715.1 99.361 313 2 0 1 313 309 621 1.57e-157 568 307 plus 621 313
haplo_51122 MW124469.1 99.361 313 2 0 1 313 343 655 1.57e-157 568 307 plus 655 313
haplo_51122 MZ157283.1 99.361 313 2 0 1 313 384 696 1.57e-157 568 307 plus 15460 313
haplo_51122 KU317714.1 99.042 313 3 0 1 313 309 621 7.32e-156 562 304 plus 622 313
haplo_51122 KU317712.1 99.042 313 3 0 1 313 285 597 7.32e-156 562 304 plus 601 313
haplo_51122 KM245630.1 99.042 313 3 0 1 313 384 696 7.32e-156 562 304 plus 15461 313
haplo_51122 DQ525222.1 98.026 304 6 0 1 304 335 638 7.43e-146 529 286 plus 638 313
haplo_51122 KU317713.1 98.893 271 3 0 1 271 279 549 1.63e-132 484 262 plus 549 313
haplo_51122 MW124560.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313
haplo_51122 MW124542.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313
haplo_51122 DQ525226.1 90.099 303 30 0 1 303 335 637 2.83e-105 394 213 plus 638 313

Result
seq_01 superkingdom:Viruses;96.0;phylum:Pisuviricota;96.0;class:Pisoniviricetes;96.0;order:Nidovirales;96.0;family:Coronaviridae;96.0;genus:Betacoronavirus;96.0;species:Severe acute respiratory syndrome-related coronavirus;96.0;

@qunfengdong
Copy link
Owner

Thanks for the report. To be clear, are you using the entire nt database? BLCA is designed to deal with marker genes instead of a generic database. You will need to use a particularly family of marker genes as the database, so that the database entries are in similar length. Otherwise, the subsequent multiple sequence alignment is not reliable. If you are using the entire nt database, the multiple sequence alignment may be a problem.

@qunfengdong
Copy link
Owner

if you can make your database and query available for us to download, we can take a look.

@YJulyXing
Copy link
Collaborator

YJulyXing commented Nov 4, 2021 via email

@shump2
Copy link
Author

shump2 commented Nov 4, 2021

Yes it's correct, I changed it as I copied it in.
I will send on the database soon. It's large.
The blast searches are quite accurate and I guess the final assignment will be based on the bit score ranks. Not sure how the covid sequences appears!

@YJulyXing
Copy link
Collaborator

That's wired. Could you also send the sequence of this one entry, and I'll take a look.

@shump2
Copy link
Author

shump2 commented Nov 4, 2021

Hi @YJulyXing @qunfengdong
See here the link to the data: https://drive.google.com/drive/folders/1-WGhTt9wesYZbpY80I-A44QZkCtmtxte?usp=sharing

  1. nt.ACC.taxonomy file of the entire nt database ( wget https://ftp.ncbi.nlm.nih.gov/blast/db/nt.{00..47}.tar.gz)
  2. test (the sequence - should be Strombus gigas)
  3. test.blastn (the resulting blastn file generated)
  4. test.blca.out (clustalo)
  5. test1.blca.out (muscle)

Thanks for taking the time to look into this issue.

@YJulyXing
Copy link
Collaborator

YJulyXing commented Nov 4, 2021 via email

@shump2
Copy link
Author

shump2 commented Nov 4, 2021

Ok I understand that but these are all the same species albeit some are whole mitogenomes but the return a different species from the blast hits?

@qunfengdong
Copy link
Owner

@shump2 @YJulyXing We should have made it more clear in our documentation. For BLCA, we expect that the database sequences are of similar length: that is, the sequences are from a gene family. For example, all the 16S gene sequences have more or length the similar lengths (not identical, but similar). If the database sequences have very dramatically different length, the multiple sequence alignment may become a problem. In your case, if some of the sequences correspond to the whole mitogenomes, but others correspond to a particular gene in the mitogenomes, they are of very different lengths, which may create problems for reliable multiple sequence alignments.

@YJulyXing
Copy link
Collaborator

YJulyXing commented Nov 4, 2021 via email

@qunfengdong
Copy link
Owner

@shump2 do you mind providing your blast database? @YJulyXing needs to check if your database entries really have the correct taxa ID in the NCBI taxa database you used.

@shump2 shump2 closed this as completed Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants