Assignment not making sense #31

shump2 · 2021-11-03T15:49:25Z

Hi,
I have created a db (nt.ACC.taxonomy ~82 million records) using an updated version of the entire nt NCBI database.
Some sequence queries fail to assign to the correct species despite the blastn hits being correct.
For example: if I search one sequence against the nt database with the custom taxonomy file. The blastn returns 11 hits but the alignment returns a very strange result to COVID. I am not sure where to begin in resolving this and if there are any known issues?
Any assistance here would be much appreciated.

Blastn file
haplo_51122 KU317715.1 99.361 313 2 0 1 313 309 621 1.57e-157 568 307 plus 621 313
haplo_51122 MW124469.1 99.361 313 2 0 1 313 343 655 1.57e-157 568 307 plus 655 313
haplo_51122 MZ157283.1 99.361 313 2 0 1 313 384 696 1.57e-157 568 307 plus 15460 313
haplo_51122 KU317714.1 99.042 313 3 0 1 313 309 621 7.32e-156 562 304 plus 622 313
haplo_51122 KU317712.1 99.042 313 3 0 1 313 285 597 7.32e-156 562 304 plus 601 313
haplo_51122 KM245630.1 99.042 313 3 0 1 313 384 696 7.32e-156 562 304 plus 15461 313
haplo_51122 DQ525222.1 98.026 304 6 0 1 304 335 638 7.43e-146 529 286 plus 638 313
haplo_51122 KU317713.1 98.893 271 3 0 1 271 279 549 1.63e-132 484 262 plus 549 313
haplo_51122 MW124560.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313
haplo_51122 MW124542.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313
haplo_51122 DQ525226.1 90.099 303 30 0 1 303 335 637 2.83e-105 394 213 plus 638 313

Result
seq_01 superkingdom:Viruses;96.0;phylum:Pisuviricota;96.0;class:Pisoniviricetes;96.0;order:Nidovirales;96.0;family:Coronaviridae;96.0;genus:Betacoronavirus;96.0;species:Severe acute respiratory syndrome-related coronavirus;96.0;

qunfengdong · 2021-11-03T15:55:59Z

Thanks for the report. To be clear, are you using the entire nt database? BLCA is designed to deal with marker genes instead of a generic database. You will need to use a particularly family of marker genes as the database, so that the database entries are in similar length. Otherwise, the subsequent multiple sequence alignment is not reliable. If you are using the entire nt database, the multiple sequence alignment may be a problem.

qunfengdong · 2021-11-03T15:56:48Z

if you can make your database and query available for us to download, we can take a look.

YJulyXing · 2021-11-04T05:09:26Z

Hi, could you check if the blastn file you showed was correct? In your blastn output, the query ID was "haplo_51122", but in the result file the ID was "seq_01"? I was wondering if they refer to the same thing?

…

On Wed, Nov 3, 2021 at 11:57 AM qunfengdong ***@***.***> wrote: if you can make your database and query available for us to download, we can take a look. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKABWIXNOXWFM4VB2JALSBTUKFSVZANCNFSM5HJIRBKA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

shump2 · 2021-11-04T08:02:36Z

Yes it's correct, I changed it as I copied it in.
I will send on the database soon. It's large.
The blast searches are quite accurate and I guess the final assignment will be based on the bit score ranks. Not sure how the covid sequences appears!

YJulyXing · 2021-11-04T13:35:43Z

That's wired. Could you also send the sequence of this one entry, and I'll take a look.

shump2 · 2021-11-04T16:27:28Z

Hi @YJulyXing @qunfengdong
See here the link to the data: https://drive.google.com/drive/folders/1-WGhTt9wesYZbpY80I-A44QZkCtmtxte?usp=sharing

nt.ACC.taxonomy file of the entire nt database ( wget https://ftp.ncbi.nlm.nih.gov/blast/db/nt.{00..47}.tar.gz)
test (the sequence - should be Strombus gigas)
test.blastn (the resulting blastn file generated)
test.blca.out (clustalo)
test1.blca.out (muscle)

Thanks for taking the time to look into this issue.

YJulyXing · 2021-11-04T19:14:17Z

Also, we don't think that using the entire nt database would work. You need to extract a family of maker genes. If the gene is inside a genome, it would mess up the multiple sequence alignment. Peter Shum ***@***.***> 于 2021年11月4日周四下午12:27写道：

…

Hi @YJulyXing <https://github.com/YJulyXing> @qunfengdong <https://github.com/qunfengdong> See here the link to the data: https://drive.google.com/drive/folders/1-WGhTt9wesYZbpY80I-A44QZkCtmtxte?usp=sharing 1. nt.ACC.taxonomy file of the entire nt database ( wget https://ftp.ncbi.nlm.nih.gov/blast/db/nt.{00..47}.tar.gz) 2. test (the sequence - should be Strombus gigas) 3. test.blastn (the resulting blastn file generated) 4. test.blca.out (clustalo) 5. test1.blca.out (muscle) Thanks for taking the time to look into this issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKABWIX4VTTC5JBH2FLBQOLUKKX7ZANCNFSM5HJIRBKA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

shump2 · 2021-11-04T19:35:43Z

Ok I understand that but these are all the same species albeit some are whole mitogenomes but the return a different species from the blast hits?

qunfengdong · 2021-11-04T22:14:02Z

@shump2 @YJulyXing We should have made it more clear in our documentation. For BLCA, we expect that the database sequences are of similar length: that is, the sequences are from a gene family. For example, all the 16S gene sequences have more or length the similar lengths (not identical, but similar). If the database sequences have very dramatically different length, the multiple sequence alignment may become a problem. In your case, if some of the sequences correspond to the whole mitogenomes, but others correspond to a particular gene in the mitogenomes, they are of very different lengths, which may create problems for reliable multiple sequence alignments.

YJulyXing · 2021-11-04T23:13:33Z

What is your blast database? Are you using the default blast database (16SMicrobial)?

…

On Thu, Nov 4, 2021 at 6:14 PM qunfengdong ***@***.***> wrote: @shump2 <https://github.com/shump2> @YJulyXing <https://github.com/YJulyXing> We should have made it more clear in our documentation. For BLCA, we expect that the database sequences are of similar length: that is, the sequences are from a gene family. For example, all the 16S gene sequences have more or length the similar lengths (not identical, but similar). If the database sequences have very dramatically different length, the multiple sequence alignment may become a problem. In your case, if some of the sequences correspond to the whole mitogenomes, but others correspond to a particular gene in the mitogenomes, they are of very different lengths, which may create problems for reliable multiple sequence alignments. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKABWIUUW6CN2WNBIOVHYCTUKMHULANCNFSM5HJIRBKA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

qunfengdong · 2021-11-04T23:15:49Z

@shump2 do you mind providing your blast database? @YJulyXing needs to check if your database entries really have the correct taxa ID in the NCBI taxa database you used.

shump2 closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment not making sense #31

Assignment not making sense #31

shump2 commented Nov 3, 2021

qunfengdong commented Nov 3, 2021

qunfengdong commented Nov 3, 2021

YJulyXing commented Nov 4, 2021 via email

shump2 commented Nov 4, 2021

YJulyXing commented Nov 4, 2021

shump2 commented Nov 4, 2021

YJulyXing commented Nov 4, 2021 via email

shump2 commented Nov 4, 2021

qunfengdong commented Nov 4, 2021

YJulyXing commented Nov 4, 2021 via email

qunfengdong commented Nov 4, 2021

Assignment not making sense #31

Assignment not making sense #31

Comments

shump2 commented Nov 3, 2021

qunfengdong commented Nov 3, 2021

qunfengdong commented Nov 3, 2021

YJulyXing commented Nov 4, 2021 via email

shump2 commented Nov 4, 2021

YJulyXing commented Nov 4, 2021

shump2 commented Nov 4, 2021

YJulyXing commented Nov 4, 2021 via email

shump2 commented Nov 4, 2021

qunfengdong commented Nov 4, 2021

YJulyXing commented Nov 4, 2021 via email

qunfengdong commented Nov 4, 2021