Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output taxid for certain rank or complete lineage #8

Closed
tolot27 opened this issue Jan 26, 2018 · 13 comments
Closed

output taxid for certain rank or complete lineage #8

tolot27 opened this issue Jan 26, 2018 · 13 comments

Comments

@tolot27
Copy link
Contributor

tolot27 commented Jan 26, 2018

Like the format string in reformat, it would be interesting to have placeholders for the taxids of the available ranks.
Most useful for downstream analyses and visualization, i.e. of metagenomic data, would be the taxid of the species and subspecies rank. Often, the taxonomic classifiers are randomly or incorrectly choosing a certain strain or a lot of different strains of the same species/subspecies. That clutters the output.
Having such a placeholder can be easily used to filter the dataset during visualization rather than during processing.

@shenwei356
Copy link
Owner

It's a good suggestion. But in some cases, e.g., viruses which do not have that much rank levels. These will be some missing/"unclassified xxx" ranks, which actually do not refer to any taxid.

For example, for taxid 1327037

$ echo 1327037 | taxonkit lineage | taxonkit reformat -f "{k};{p};{c};{o};{f};{g};{s}"
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y       Viruses;;;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y

@tolot27
Copy link
Contributor Author

tolot27 commented Jan 26, 2018

Even if there are not that much rank levels, the latest rank is still species (see https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?name=1327037).

@shenwei356
Copy link
Owner

shenwei356 commented Jan 27, 2018

Implemented in v0.2.4-dev3 or later versions.

Note that: Lots of taxids share same taxon names like "diastema", "solieria" and "environmental samples", therefore it's not just about simply mapping taxon name to taxid. I use both taxon name and the name of it's parent taxid to get the right taxid. This also bring more accurate result when using flag -F/--fill-miss-rank to estimate and fill missing rank. But it's slower because of reading names.dmp twice.

PS: I spent 7 hours 😫

$ cat lineage.txt
9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
349741  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
11932   Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
314101  cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y


$ cat lineage.txt | taxonkit reformat -t
9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens  Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens      2759;7711;40674;9443;9604;9605;9606                                           
349741  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835    Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila   2;74201;203494;48461;1647988;239934;239935
239935  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila      2;74201;203494;48461;1647988;239934;239935
11932   Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle  Viruses;;;;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle    10239;;;;11632;11749;11932
314101  cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B     Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B   2;;;;;;314101
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y     Viruses;;;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y  10239;;;28883;10699;;1327037


$ cat lineage.txt | taxonkit reformat -t -F
9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens  Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens      2759;7711;40674;9443;9604;9605;9606
349741  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835    Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila   2;74201;203494;48461;1647988;239934;239935
239935  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila      2;74201;203494;48461;1647988;239934;239935
11932   Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle  Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle  10239;;;;11632;11749;11932
314101  cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B     Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B      2;;;;;;314101
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y     Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y    10239;;;28883;10699;;1327037


$ cat lineage.txt | taxonkit reformat -t -F | cut -f 2,3 | perl -pe 's/^/Orignial: /; s/\n/\n\n/; s/\t/\nReformat: /'
Orignial: cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
Reformat: Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens

Orignial: cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila

Orignial: cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila

Orignial: Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle

Orignial: cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
Reformat: Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B

Orignial: Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y

@tolot27
Copy link
Contributor Author

tolot27 commented Jan 29, 2018

Thanks for your great work.
Currently, I don't understand why you have to read names.dmp twice. The only case I can image is if only a taxon name is provided, without any taxid.

I suggest taxid 376619 (Francisella tularensis subsp. holarctica LVS) as an additional test case because it also includes a subspecies

@shenwei356
Copy link
Owner

Thanks for your suggestions. I was fool. I'll improve it soon.

@shenwei356
Copy link
Owner

shenwei356 commented Jan 29, 2018

Code has been optimized for speed and memory occupation. You may redownload the binaries

And 376619 is added to the test data set, where 349741 is also a subspecies.

$ cat lineage.txt
9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
9913    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
376619  cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
349741  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101  cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
11932   Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
$ cat lineage.txt | ./taxonkit reformat -t -F > lineage.txt.reformat.fill
$ cat lineage.txt.reformat.fill \
    | perl -pe 's/^/Taxid   : /; \
        s/\t/\nLineage : /; \
        s/\t/\nReformat: /; \
        s/\t/\nTaxids  : /; \
        print "\n";'

Taxid   : 9606
Lineage : cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
Reformat: Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens
Taxids  : 2759;7711;40674;9443;9604;9605;9606

Taxid   : 9913
Lineage : cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
Reformat: Eukaryota;Chordata;Mammalia;unclassified Mammalia order;Bovidae;Bos;Bos taurus
Taxids  : 2759;7711;40674;;9895;9903;9913

Taxid   : 376619
Lineage : cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
Reformat: Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
Taxids  : 2;1224;1236;72273;34064;262;263

Taxid   : 349741
Lineage : cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Taxids  : 2;74201;203494;48461;1647988;239934;239935

Taxid   : 239935
Lineage : cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Taxids  : 2;74201;203494;48461;1647988;239934;239935

Taxid   : 314101
Lineage : cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
Reformat: Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B
Taxids  : 2;;;;;;314101

Taxid   : 11932
Lineage : Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Taxids  : 10239;;;;11632;11749;11932

Taxid   : 1327037
Lineage : Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y
Taxids  : 10239;;;28883;10699;;1327037

shenwei356 added a commit that referenced this issue Jan 29, 2018
@shenwei356
Copy link
Owner

@tolot27 Tell me if you have more suggestions or feedback, I would like to release a new version soon.

@tolot27
Copy link
Contributor Author

tolot27 commented Jan 30, 2018

Two points I thought about: First, with the parameter -r, --miss-rank-repl string it is possible to chose a different prefix than unclassified for missing ranks, but it is not possible to omit the prefix at all because the empty string "" triggers the default.
Second, there is no fallback for the subspecies taxid. Only the parameter -R, --miss-taxid-repl string exists, which is just a static string like the prefix for the subspecies rank. Because my main aim is to just filter out strain level resolution but keep subspecies resolution if applicable, I currently use awk to copy the species taxid to the subspecies taxid, if the later one is empty. Than I can use the subspecies taxid as the taxon column with KronaTools. Maybe I should stick with awk than enhancing taxonkit with such special cases.

@shenwei356
Copy link
Owner

shenwei356 commented Jan 30, 2018

For the first issue, you may use a special symbol and remove it at last.

Subspecies is supportted using place holder {S} in flag -f. 😄 http://bioinf.shenwei.me/taxonkit/usage/#reformat

@tolot27
Copy link
Contributor Author

tolot27 commented Jan 30, 2018

subspecies is supportted using place holder {S}.

I know, but that was not my point. I talked about a column containing the taxid of the subspecies if defined or the taxid of the species, if no subspecies is defined. Currently, the subspecies taxid column is empty, if no subspecies is defined, which is correct, indeed. But a merged column is required, for instance for KronaTools. But probably, that's out of scope of taxonkit.

@shenwei356
Copy link
Owner

shenwei356 commented Jan 30, 2018

Yes, that can be done using awk.

@shenwei356
Copy link
Owner

v0.2.4 is out.

You can set prefix "unclassified" by flag -p/--miss-rank-repl-prefix now.

@tolot27
Copy link
Contributor Author

tolot27 commented Jul 9, 2018

Since no "deepest" taxid of supported/requested ranks exist, you can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants