class2tree replacement #634

gedankenstuecke · 2017-10-02T10:39:04Z

As discussed in #611, this update will use the full NCBI Taxonomy information to create the hierarchical clustering.

Description

The old class2tree function did not take unnamed ranks into account to cluster the species. This led to the trees being unresolved for many splits as the named taxonomy levels were shared between them. The new class2tree function makes full use of the NCBI Taxonomy string, including the unnamed ranks, leading to higher resolution trees that have less multifurcations.

After many months of discussion @trvinh and I (mostly him!) have figured out how to do this in R instead of Python! 😂

Related Issue

#611

Example

A full example can be found in this gist

In short:

library("taxize")

spnames <- c('Homo_sapiens','Pan_troglodytes','Macaca_mulatta','Mus_musculus','Rattus_norvegicus','Bos_taurus','Canis_lupus','Ornithorhynchus_anatinus','Xenopus_tropicalis','Takifugu_rubripes','Gallus_gallus','Ciona_intestinalis','Branchiostoma_floridae','Schistosoma_mansoni','Caenorhabditis_elegans','Anopheles_gambiae','Drosophila_melanogaster','Ixodes_scapularis','Ustilago_maydis','Neurospora_crassa','Monodelphis_domestica','Danio_rerio','Nematostella_vectensis','Cryptococcus_neoformans')

outTaxize <- classification(spnames, db='ncbi')
tree <- class2tree(outTaxize)
plot(tree)

will now yield

As we just updated an existing function (and – to be honest – as we so far couldn't find out how to run the tests, some update on the docs on that would be appreciated) we didn't include new tests.

replace class2tree.R

fix the classification data type

gedankenstuecke · 2017-10-02T12:25:42Z

Subsequent to the initial commit we added

zoo to the DESCRIPTION for the additional dependency, so that we can use na.locf
modified taxize-package.R to import transpose from data.table and complete.cases from stats and na.locf from zoo.
Ran roxygen2 to generate the NAMESPACE file
fixed the wrong classification output to make sure that we adhere to the expected class2tree output

We don't fully understand why the other tests fail for us though.

sckott · 2017-10-02T19:45:24Z

thanks @gedankenstuecke and @trvinh - having a look

travis failure should just be due to that Travis doesn't share environment variables on pull request builds - so no worries, do check that it passes locally for you though and i'll check as well

sckott · 2017-10-02T20:21:14Z

opened issue for tests #635

basically, in R, you can do e.g., in the taxize directory, start R, then

library(devtools)
library(testthat)
load_all()
# to test an individual test file
test_file("tests/testthat/test-class2tree.R")
# to run all tests
devtools::test()

may want to run Sys.setenv(NOT_CRAN = "true") before running tests as well, as many are skipped on CRAN, and you do want to run them locally presumably

sckott

thanks for this! made some specific comments inline, a few more general ones:

please make width of both code and comments be 80 characters - i am not yet consistent in this pkg, but trying to get there
since a lot of code is being added, i imagine there's more edge cases? do add a few more tests to test suite

sckott · 2017-10-02T20:37:01Z

R/class2tree.R

  dat <- rbind.fill(lapply(input, class2tree_helper))
-  df <- dat[ , !apply(dat, 2, function(x) any(is.na(x))) ]
-  if (!inherits(df, "data.frame")) {


fine to remove this, but is there now no chance of resulting in no ranks in common?

uhm, I don't think so. Since we are taking into account even the root rank, so 2 taxa will always have at least one rank in common. Therefore I think checking for !inherits(df, "data.frame")not necessary.

sckott · 2017-10-02T20:39:56Z

R/class2tree.R

+    colnames(mTaxonDf) <- c(taxonName[1],"rank")
+
+    ### merge with index2RankDf (Df contains all available ranks from input data)
+    fullRankIDdf <- merge(fullRankIDdf,mTaxonDf, by=c("rank"), all.x = T)


Please always use TRUE and FALSE throughout instead of T and F

I will check and make all the changes that you requested.

sckott · 2017-10-02T20:41:23Z

R/class2tree.R

+  joinedDf <- within(joinedDf, rankDf[rankDf=='no rank'] <- paste0("norank_",idDf[rankDf=='no rank']))
+
+  df <- data.frame(t(data.frame(rev(joinedDf$rankDf))), stringsAsFactors = FALSE)
+  outDf <- data.frame(tip = x[nrow(x), "name"], df, stringsAsFactors = FALSE)


Is there a reason this fxn and the next below don't explicitly return any output?

@sckott think you don't mean that they should use return() statement, but just have implicit return, ie., last line

outDf

gedankenstuecke · 2017-10-02T20:51:28Z

Thanks for the hints on how to run single files. I looked desperately for that earlier today and couldn't find it. The class2tree tests pass on my machine.

I got some errors/warnings when running the whole of it

downstream: ...........W
…
get_gbifid: ..W..........................
get_ids: ......WW.W..
get_natservid: ........12
get_tpsid: ..................
get_tsn: ..W
…
iucn_id: 3.4
iucn_summary: 56..78
…
tax_rank: W

Also there were some tests that waited for user-input somehow when I ran your commands, I just proceeded by pressing <enter>.

From the logs:

Warnings -----------------------------------------------------------------------
1. downstream - multiple data sources (@test-downstream.R#43) - > 1 result; no direct match found

2. get_gbifid accepts ask-argument (@test-get_gbifid.R#18) - More than one GBIF ID found for taxon 'Dugesia'; refine query or set ask=TRUE

3. works on a variety of names (@test-get_ids.R#40) - More than one id found for taxon 'Boissiera squarrosa'; refine query or set ask=TRUE

4. works on a variety of names (@test-get_ids.R#40) - More than one tpsid found for taxon 'Boissiera squarrosa'; refine query or set ask=TRUE

5. works on a variety of names (@test-get_ids.R#41) - More than one tpsid found for taxon 'Arthrostylidium pubescens'; refine query or set ask=TRUE

6. get_tsn accepts ask and verbose arguments (@test-get_tsn.R#19) - > 1 result; no direct match found

7. tax_rank returns the correct class (@test-tax_rank.R#8) - > 1 result; no direct match found

Failed -------------------------------------------------------------------------
1. Failure: get_natservid fails well (@test-get_natservid.R#36) ----------------
get_natservid("Ruby*", "common", rows = "foobar", verbose = FALSE) did not throw an error.

2. Failure: get_natservid fails well (@test-get_natservid.R#38) ----------------
get_natservid("Ruby*", "common", rows = 0, verbose = FALSE) did not throw an error.
…
3. Error: iucn_id returns the correct class (@test-iucn_id.R#7) ----------------

4. Error: iucn_id fails well (@test-iucn_id.R#24) ------------------------------
…
5. Error: iucn_summary returns the correct value (@test-iucn_summary.R#7) ------
…
6. Error: iucn_summary gives expected result for lots of names (@test-iucn_summary.R#22)
…
7. Failure: iucn_summary and iucn_summary_id fail well (@test-iucn_summary.R#59)
error$message does not match "Not Found".
Actual value: "need an API key for Red List data"
…
8. Error: iucn_summary and iucn_summary_id fail well (@test-iucn_summary.R#61) -
need an API key for Red List data

sckott · 2017-10-02T20:54:16Z

@gedankenstuecke see additional comment above about not cran #634 (comment)

sckott · 2017-10-02T20:56:11Z

tests do currently take a long time to run, so you may only want to deal with the tests for this fxn. (eventually we will mock HTTP requests and it will be much better, but not here yet)

yes, many tests go through a get*() fxn and sometimes those have >1 option, and ideally in tests we would avoid those = will be fixed hopefully soon

gedankenstuecke · 2017-10-02T21:00:49Z

Ah, i ran the tests before the edit was made about cran, will try again!

gedankenstuecke · 2017-10-04T11:38:05Z

I started addressing @sckott's comments and made sure we consistently use TRUE and wrapped all lines to 80 characters width.

gedankenstuecke · 2017-10-04T13:08:42Z

Okay, I think it should now address all of @sckott's comments 🙂
I changed the following:

added the explicit return statements to the two functions where they were missing
added a check that class2tree function will throw an error if there are duplicate taxa in the classification input.
added a test that checks that you get a proper error when there are duplicate taxa (the original function crashed without a proper error too).

Fixes following things: * removes CamelCase for snake_thing * renames variables to remove confusing `list` references that describe data frames * fixes output of `class2tree$classification` to be the expected one.

Before we gave out the internal, wrongly labelled levels. Now we have the right labels and return NA for missing data as expected.

gedankenstuecke · 2017-10-04T16:24:37Z

We had overlooked some details on how the classification is returned from class2tree, this is fixed now and there are no more WIP PR from our side. 👍

sckott · 2017-10-07T01:02:37Z

all looks good, thanks much @gedankenstuecke and @trvinh !

gedankenstuecke · 2017-10-07T07:46:13Z

Thanks for all the support @sckott! And congratulations @trvinh! That’s an amazing first open source contribution! 😍🎉🍾

trvinh · 2017-10-07T07:49:59Z

thanks so much, @sckott, you made my day :) And also thanks to @gedankenstuecke, we did it 👍

sckott · 2017-10-07T20:06:34Z

Indeed, great first contribution @trvinh ! (& thx of course to you too @gedankenstuecke )

trvinh and others added 10 commits October 2, 2017 12:32

replace class2tree.R

15db2d2

first commit

cac465a

Merge pull request #1 from trvinh/class2tree-replace

d1c2b2a

replace class2tree.R

adding zoo (maybe?)

cf4d476

adding zoo try 2 (maybe?)

1832a98

import transpose (maybe?)

cb16911

added missing data.tables declaration

0b5c8d0

corrected classification output

693dc91

fix the classification data type

56c3f03

Merge pull request #2 from trvinh/class2tree-replace

e5c7d3d

fix the classification data type

sckott suggested changes Oct 2, 2017

View reviewed changes

adhering to style guide

0cadd5e

gedankenstuecke added 3 commits October 4, 2017 14:35

added return statements

356f13d

stop on duplicate taxa

4c0c9b4

extended class2tree-test

ed42729

gedankenstuecke added 2 commits October 4, 2017 16:21

rename variables, fix $classification, remove camel case

a0b2aae

Fixes following things: * removes CamelCase for snake_thing * renames variables to remove confusing `list` references that describe data frames * fixes output of `class2tree$classification` to be the expected one.

class2tree $classification correct

90fad80

Before we gave out the internal, wrongly labelled levels. Now we have the right labels and return NA for missing data as expected.

sckott merged commit ca16afe into ropensci:master Oct 7, 2017

sckott added this to the v1.0 milestone Oct 7, 2017

gedankenstuecke mentioned this pull request Oct 7, 2017

Improving the class2tree resolution #611

Closed

gedankenstuecke mentioned this pull request Oct 9, 2017

Making a first release BIONF/PhyloProfile#43

Closed

sckott mentioned this pull request Nov 22, 2017

Node names for class2tree #644

Closed

sckott modified the milestones: v1.0, v0.9.4 Jan 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

class2tree replacement #634

class2tree replacement #634

gedankenstuecke commented Oct 2, 2017

gedankenstuecke commented Oct 2, 2017

sckott commented Oct 2, 2017

sckott commented Oct 2, 2017 •

edited

sckott left a comment

sckott Oct 2, 2017

trvinh Oct 2, 2017

sckott Oct 2, 2017

sckott Oct 2, 2017

trvinh Oct 2, 2017

sckott Oct 2, 2017

sckott Oct 2, 2017

jarioksa Oct 4, 2017

gedankenstuecke commented Oct 2, 2017

sckott commented Oct 2, 2017

sckott commented Oct 2, 2017

gedankenstuecke commented Oct 2, 2017

gedankenstuecke commented Oct 4, 2017

gedankenstuecke commented Oct 4, 2017

gedankenstuecke commented Oct 4, 2017

sckott commented Oct 7, 2017

gedankenstuecke commented Oct 7, 2017

trvinh commented Oct 7, 2017 •

edited

sckott commented Oct 7, 2017

class2tree replacement #634

class2tree replacement #634

Conversation

gedankenstuecke commented Oct 2, 2017

Description

Related Issue

Example

gedankenstuecke commented Oct 2, 2017

sckott commented Oct 2, 2017

sckott commented Oct 2, 2017 • edited

sckott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gedankenstuecke commented Oct 2, 2017

sckott commented Oct 2, 2017

sckott commented Oct 2, 2017

gedankenstuecke commented Oct 2, 2017

gedankenstuecke commented Oct 4, 2017

gedankenstuecke commented Oct 4, 2017

gedankenstuecke commented Oct 4, 2017

sckott commented Oct 7, 2017

gedankenstuecke commented Oct 7, 2017

trvinh commented Oct 7, 2017 • edited

sckott commented Oct 7, 2017

sckott commented Oct 2, 2017 •

edited

trvinh commented Oct 7, 2017 •

edited