Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance matrix from classification amd class2tree #849

Closed
morgan-sparks opened this issue Oct 5, 2020 · 8 comments
Closed

Distance matrix from classification amd class2tree #849

morgan-sparks opened this issue Oct 5, 2020 · 8 comments
Milestone

Comments

@morgan-sparks
Copy link

Session Info
Just a quick question, not really a bug. I used the classfication() function for a list of species (includes insects, plants, molluscs, etc.) I have to download taxonomic hierarchy and then converted it to a tree using class2tree(). Works great! Obviously these are not true phylogenetic relationships, but I was curious how distances are determined for the phylo object. I am guessing it has to do with the taxonomic hierarchy and that a genus is so far from a family, a family so far from an order and so on, but there are no distinctions made for distance between plants and reptiles and reptiles from amphibians. I am trying to make variance-covariance matrix to control for phylogenetic non-independence in a statistical model and I am wondering how far off my mark I am using this method. Thanks!
@sckott
Copy link
Contributor

sckott commented Oct 5, 2020

thx for your question @morgan-sparks - @trvinh can you give some details?

It'd be worth adding a more detailed description to the function documentation too as it's pretty thin on what is being done within the function.

@trvinh
Copy link
Contributor

trvinh commented Oct 6, 2020

Hi @morgan-sparks ,
yes, you were right, the classification is basically based on the taxonomy hierarchy string. Given a list of species, first the function will collect all possible taxonomy ranks and their corresponding IDs for each species. Then, it will try to align the rank vectors, just like doing a sequence alignment (with each position is a taxonomy rank, instead of a nucleotide or an amino acid), and replace the ranks by their taxonomy IDs. After that, this aligned ID matrix will be used to cluster species that have similar taxonomy string together (with hierarchical clustering function hclust).

For example, this is the rank vectors for 3 species:

SpecA: strain, species, family, phylum, kingdom, superkingdom
SpecB: strain, species, genus, family, kingdom, superkingdom
SpecC: strain, species, phylum, kingdom, superkingdom

Then our aligned rank matrix will look like:

Name	strain	species	genus	family	phylum	kingdom	superkingdom
SpecA	strain	species	NA	family	phylum	kingdom	superkingdom
SpecB	strain	species	genus	family	NA	kingdom	superkingdom
SpecC	strain	species	NA	NA	phylum	kingdom	superkingdom

It will be converted into this ID matrix (note: any missing ranks will have a pseudo ID from the previous rank):

Name	strain	species	genus	family	phylum	kingdom	superkingdom
SpecA	11	21	21	41	51	61	71
SpecB	12	22	31	41	61	61	71
SpecC	13	23	23	23	52	61	71

That aligned ID matrix will be used for clustering. As you can see, our tree will be:

((SpecA, SpecB), SpecC),

since SpecA and SpecB belong together to the family 41, and all three share the same kingdom 61.

So, the classification clusters the taxa not only based on the ranks (e.g. species, family, or phylum,...), but also on the actual taxonomy clade (specified by the IDs). Which means, it can cluster Arabidopsis within plant clade and snakes into reptiles. The more info you have in the aligned ID matrix (taxa with detailed taxonomy string; and enough taxa that cover all possible ranks), the higher resolution you can get for your taxonomy tree. For example, this is the first some lines of a real aligned ID matrix I am working with (not exactly the same as the one class2tree delivers, I just want give you an example how much data this matrix can/should have):

abbrName	ncbiID	fullName	strain	norank_491	norank_57678	norank_56615	norank_1086053	isolate	forma	varietas	formaspecialis	subspecies	species	norank_600669	norank_44542	speciessubgroup	norank_2642122	norank_2619919	norank_2627626	norank_1919231	norank_38063	norank_2622230	norank_2642239	norank_2629106	norank_2618147	norank_2625095	norank_2641301	norank_2621923	norank_256003	norank_2627676	norank_2621901	norank_2644672	norank_1104572	norank_2642439	norank_1234666	norank_2619537	norank_2646925	norank_2643926	norank_2635519	norank_2638681	norank_2624677	norank_2685009	norank_2644212	norank_2631227	norank_2144190	norank_2622732	norank_69483	norank_2643768	norank_2641396	norank_2668073	norank_2648758	norank_1655640	norank_2639520	norank_2634974	norank_2620651	norank_2641160	norank_2634096	norank_2648975	norank_2684913	norank_1700837	norank_132567	norank_2626337	norank_2634179	norank_2629569	norank_2623058	norank_1727652	norank_2684908	norank_2636082	norank_886737	norank_2638413	norank_2638438	norank_2632030	norank_2630403	norank_1679063	norank_2648658	norank_2642017	norank_2631942	norank_2630810	norank_2625419	norank_2565780	norank_2629969	norank_2648942	norank_2648404	norank_2631116	norank_2624956	norank_2610901	norank_2499238	norank_2752537	norank_857194	norank_220671	norank_254878	norank_2641196	norank_2633939	norank_2678336	norank_2640012	speciesgroup	norank_44537	subgenus	genus	norank_1535325	subtribe	norank_1304792	norank_651142	norank_1593277	norank_2116545	norank_114584	norank_2683658	norank_218105	norank_1113537	norank_1142503	norank_588816	norank_1803510	family	norank_469895	norank_715341	tribe	norank_588815	norank_1648033	subfamily	norank_359160	norank_147370	norank_2601530	norank_337687	norank_1912919	superfamily	norank_2601529	norank_74971	norank_104431	norank_43741	norank_37567	norank_43738	parvorder	norank_480117	norank_480118	infraorder	norank_33351	norank_33349	norank_33347	suborder	norank_33343	order	norank_159987	norank_33083	norank_355688	norank_9263	norank_91836	norank_314147	norank_6970	norank_4734	superorder	norank_71275	norank_186626	norank_1437201	norank_1489908	norank_1437010	subcohort	norank_1489872	norank_91827	norank_9347	norank_123369	norank_123368	norank_123367	norank_123366	norank_123365	cohort	norank_186625	norank_1489341	infraclass	norank_32525	norank_71240	norank_4447	subclass	norank_1437183	norank_1520881	norank_85512	class	norank_2283796	norank_2290931	norank_2692248	norank_2283794	norank_2683659	norank_2683660	norank_715962	norank_715989	norank_404260	norank_436492	norank_58024	norank_436491	norank_436489	norank_436486	norank_8492	norank_1329799	norank_32561	norank_8457	norank_32524	norank_78536	norank_58023	norank_716546	norank_3193	norank_32523	norank_1338369	superclass	norank_117571	norank_117570	norank_7776	norank_7742	subphylum	norank_197562	norank_197563	norank_716545	phylum	norank_1783276	norank_112252	norank_2611352	norank_2696291	norank_554915	norank_33630	norank_1206795	norank_33634	norank_2697495	norank_2698737	norank_3208	norank_1935183	norank_1783275	norank_2611341	norank_1798711	norank_68336	norank_1783257	norank_33511	norank_88770	norank_1783272	norank_1783270	norank_1206794	norank_33317	subkingdom	norank_33213	norank_6072	kingdom	norank_999999005	norank_999999001	norank_999999002	norank_999999003	norank_999999004	norank_33154	superkingdom	norank_131567	root
ncbi100226	100226	Streptomyces coelicolor A3(2)	100226	100226	100226	100226	100226	100226	100226	100226	100226	100226	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1902	1477431	1477431	1477431	1883	1883	1883	1883	1883	1883	1883	1883	1883	1883	1883	1883	1883	1883	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	2062	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	85011	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	1760	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	201174	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	1783272	2	131567	1
ncbi10090	10090	Mus musculus	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	10090	862507	10088	10088	10088	10088	10088	10088	10088	10088	10088	10088	10088	10088	10088	10088	10066	10066	10066	10066	10066	10066	39107	39107	39107	39107	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	1963758	1963758	9989	9989	9989	9989	9989	9989	314147	314147	314147	314146	314146	314146	314146	314146	1437010	1437010	1437010	1437010	9347	9347	9347	9347	9347	9347	9347	9347	9347	9347	32525	32525	32525	32525	32525	32525	32525	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	32524	32524	32524	32524	32524	32523	1338369	8287	117571	117570	7776	7742	89593	89593	89593	89593	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	33511	33511	33511	33511	33511	33511	33511	33213	6072	33208	33208	999999001	999999002	999999003	999999004	33154	2759	131567	1
ncbi10116	10116	Rattus norvegicus	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10116	10114	10114	10114	10114	10114	10114	10114	10114	10114	10114	10114	10114	10114	10114	10066	10066	10066	10066	10066	10066	39107	39107	39107	39107	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	337687	1963758	1963758	9989	9989	9989	9989	9989	9989	314147	314147	314147	314146	314146	314146	314146	314146	1437010	1437010	1437010	1437010	9347	9347	9347	9347	9347	9347	9347	9347	9347	9347	32525	32525	32525	32525	32525	32525	32525	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	40674	32524	32524	32524	32524	32524	32523	1338369	8287	117571	117570	7776	7742	89593	89593	89593	89593	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	7711	33511	33511	33511	33511	33511	33511	33511	33213	6072	33208	33208	999999001	999999002	999999003	999999004	33154	2759	131567	1

Have I answered your question, Morgan? I hope that it can help! :-)
Best,
Vinh

@sckott sckott added this to the v0.9.99 milestone Oct 6, 2020
@sckott
Copy link
Contributor

sckott commented Oct 6, 2020

thanks @trvinh - can you add some of that to the function documentation?

@trvinh
Copy link
Contributor

trvinh commented Oct 6, 2020

thanks @trvinh - can you add some of that to the function documentation?

@sckott : yes, I can. But where should I add it to? Directly to the Rd block for each subfunction? But how can the users see that, since those functions are only internal?

@sckott
Copy link
Contributor

sckott commented Oct 6, 2020

I was thinking in the documentation for the class2tree function https://github.com/ropensci/taxize/blob/master/R/class2tree.R#L1-L52 - perhaps in the @details section

you can also add docs to specific internal methods if you want, and then use @keywords internal so that the manual file isn't exported (Seen in the listing of docs pages), but the user can still get to it by doing ?fxn-name

@trvinh
Copy link
Contributor

trvinh commented Oct 6, 2020

I will try :-)

@sckott
Copy link
Contributor

sckott commented Oct 6, 2020

thank you!

@morgan-sparks
Copy link
Author

Thanks to you both @sckott @trvinh, this was very helpful and a nice option for those of us trying to make basic distance based phylogenies that span very broad species lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants