Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add the corresponding taxonomic level prefix based on the taxonomy name? #82

Closed
YeGuoZJU opened this issue Jul 10, 2023 · 6 comments

Comments

@YeGuoZJU
Copy link

YeGuoZJU commented Jul 10, 2023

Dear Dr. Shen,Thank you for such a handy tool!

I have a question. I use metawrap classify_bins to assign taxonomy to genomic bins, and the result show as follow:

bin.10.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.6.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.17.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae;Anaerostipes;Anaerostipes hadrus
bin.21.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Oscillospiraceae
bin.4.fa Bacteria;Bacillota;Clostridia;uncultured Clostridia bacterium
bin.8.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae
bin.15.fa Bacteria;Bacillota
bin.19.fa Bacteria;Verrucomicrobiota;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
bin.23.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Oscillospiraceae
bin.3.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Tannerellaceae;Parabacteroides
bin.12.fa Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobrevibacter;Methanobrevibacter smithii
bin.7.fa Bacteria;Bacillota;Negativicutes;Acidaminococcales;Acidaminococcaceae;Phascolarctobacterium;Phascolarctobacterium faecium
bin.16.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae;Anaerobutyricum;Anaerobutyricum hallii
bin.11.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides uniformis
bin.2.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.13.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.5.fa Bacteria
bin.9.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.18.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.22.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Phocaeicola

I want to add the corresponding taxonomic level prefix according to their taxonomy name, like this:

bin.1.fa d__Bacteria;p__Actinobacteria;c__Actinomycetia;o__Bifidobacteriales;f__Bifidobacteriaceae;g__Bifidobacterium;s__Bifidobacterium bifidum

I have tried many times without success, can you give me some advice? Thank you very much!

@shenwei356
Copy link
Owner

https://bioinf.shenwei.me/taxonkit/usage/#reformat

 taxonkit reformat --verbose -i 2 -P -F -T test.tsv \
    | cut -f 1,3

bin.5.fa        k__Bacteria;p__;c__;o__;f__;g__;s__
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__;g__;s__
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;g__;s__
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__

Well, I think ;p__;c__;o__;f__;g__;s__ should be optional removed for Bacteria.

shenwei356 added a commit that referenced this issue Jul 11, 2023
@shenwei356
Copy link
Owner

Try the new binary:

taxonkit_linux_amd64.tar.gz

taxonkit reformat --verbose -i 2 -P -T test.tsv \
    | cut -f 1,3
bin.5.fa        k__Bacteria;;;;;;
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;;;
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;;
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;

You can also remove the tailing semicolon with sed.

taxonkit reformat --verbose -i 2 -P -T test.tsv \
    | cut -f 1,3 \
    | sed -E  's/;+$//'
bin.5.fa        k__Bacteria
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola

@YeGuoZJU
Copy link
Author

Try the new binary:

taxonkit_linux_amd64.tar.gz

taxonkit reformat --verbose -i 2 -P -T test.tsv \
    | cut -f 1,3
bin.5.fa        k__Bacteria;;;;;;
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;;;
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;;
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;

You can also remove the tailing semicolon with sed.

taxonkit reformat --verbose -i 2 -P -T test.tsv \
    | cut -f 1,3 \
    | sed -E  's/;+$//'
bin.5.fa        k__Bacteria
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola

Thank you very much for your reply. According to your suggestion, I have fulfilled my request.
Unfortunately, I have a puzzle about the result,

bin.10.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.6.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.17.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae;Anaerostipes;Anaerostipes hadrus
bin.21.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Oscillospiraceae
bin.4.fa Bacteria;Bacillota;Clostridia;uncultured Clostridia bacterium
bin.8.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae
bin.15.fa Bacteria;Bacillota
bin.19.fa Bacteria;Verrucomicrobiota;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
bin.23.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Oscillospiraceae
bin.3.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Tannerellaceae;Parabacteroides
bin.12.fa Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobrevibacter;Methanobrevibacter smithii
bin.7.fa Bacteria;Bacillota;Negativicutes;Acidaminococcales;Acidaminococcaceae;Phascolarctobacterium;Phascolarctobacterium faecium
bin.16.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae;Anaerobutyricum;Anaerobutyricum hallii
bin.11.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides uniformis
bin.2.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.13.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.5.fa Bacteria
bin.9.fa Bacteria;Bacillota;Clostridia;Eubacteriales
bin.18.fa Bacteria;Bacillota;Clostridia;Eubacteriales;Lachnospiraceae
bin.22.fa Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Phocaeicola

image

the result:

image

the item bin.15.fa don't show any message, but what should not be displayed is a k__Bacteria;p__Bacillota similar to such as the first two levels of bin.13.fa or bin.2.fa ?

Sorry to bother you again, thank you for your reply

shenwei356 added a commit that referenced this issue Jul 11, 2023
…TaxId with the parent-child pair, use the last child only. #82
@shenwei356
Copy link
Owner

Since some taxon names are ambiguous/duplicated, taxonkit reformat uses the parent-child pair to query the TaxIds, for lineages with more than one node.

For example, in bin.13.fa Bacteria;Bacillota;Clostridia;Eubacteriales, Clostridia;Eubacteriales is used. It works because in the NCBI taxonomy, Clostridia is the parent node of Eubacteriales.

$ echo Eubacteriales | taxonkit name2taxid | taxonkit lineage -i 2
Eubacteriales   186802  cellular organisms;Bacteria;Terrabacteria group;Bacillota;Clostridia;Eubacteriales

However, in bin.15.fa Bacteria;Bacillota which is a simplified lineage provided by the binner, Bacteria is not the parent of Bacillota. So it failed to check the TaxId.

$ echo Bacillota | taxonkit name2taxid | taxonkit lineage -i 2
Bacillota       1239    cellular organisms;Bacteria;Terrabacteria group;Bacillota

I just fixed this. Now, for lineages with more than one node, if it fails to query TaxId with the parent-child pair, try with the last child only.

$ taxonkit reformat --verbose -i 2 -P -T test.tsv \
    | cut -f 1,3 \
    | sed -E  's/;+$//'
12:33:54.455 [INFO] parsing complete lineages from field 2
12:33:54.455 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp
12:33:54.455 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
12:33:54.455 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp
12:33:54.455 [INFO] parsing nodes file: /home/shenwei/.taxonkit/nodes.dmp
12:33:54.483 [INFO] 71301 merged nodes parsed
12:33:54.565 [INFO] 468129 delnodes parsed
12:33:55.820 [INFO] 2499662 names parsed
12:33:56.062 [INFO] 2499662 nodes parsed
12:33:56.062 [INFO] creating links: child name -> parent name -> taxid
12:34:00.677 [INFO] created links: child name -> parent name -> taxid
bin.10.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae
bin.6.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales
bin.17.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;g__Anaerostipes;s__Anaerostipes hadrus
bin.21.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Oscillospiraceae
bin.4.fa        k__Bacteria;p__Bacillota;c__Clostridia;;;;s__uncultured Clostridia bacterium
bin.8.fa        k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae
bin.15.fa       k__Bacteria;p__Bacillota
bin.19.fa       k__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila
bin.23.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Oscillospiraceae
bin.3.fa        k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Tannerellaceae;g__Parabacteroides
bin.12.fa       k__Archaea;p__Euryarchaeota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter;s__Methanobrevibacter smithii
bin.7.fa        k__Bacteria;p__Bacillota;c__Negativicutes;o__Acidaminococcales;f__Acidaminococcaceae;g__Phascolarctobacterium;s__Phascolarctobacterium faecium
bin.16.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;g__Anaerobutyricum;s__Anaerobutyricum hallii
bin.11.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis
bin.2.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae
bin.13.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales
bin.5.fa        k__Bacteria
bin.9.fa        k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales
bin.18.fa       k__Bacteria;p__Bacillota;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae
bin.22.fa       k__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola

taxonkit_linux_amd64.tar.gz

@shenwei356
Copy link
Owner

It would be much easier if TaxIds of the bins are available. So you can specify the taxid column with -I.

@YeGuoZJU
Copy link
Author

It would be much easier if TaxIds of the bins are available. So you can specify the taxid column with -I.

Thank you Dr. Shen for your patient answer. Surely, If the result of metawrap classify_bins could provide the TaxIds of the bins , it could be an easy task.

All in all, thank you very much ! wish you all the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants