Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

class2tree replacement #634

Merged
merged 16 commits into from Oct 7, 2017
Merged

Conversation

gedankenstuecke
Copy link
Contributor

As discussed in #611, this update will use the full NCBI Taxonomy information to create the hierarchical clustering.

Description

The old class2tree function did not take unnamed ranks into account to cluster the species. This led to the trees being unresolved for many splits as the named taxonomy levels were shared between them. The new class2tree function makes full use of the NCBI Taxonomy string, including the unnamed ranks, leading to higher resolution trees that have less multifurcations.

After many months of discussion @trvinh and I (mostly him!) have figured out how to do this in R instead of Python! πŸ˜‚

Related Issue

#611

Example

A full example can be found in this gist

In short:

library("taxize")

spnames <- c('Homo_sapiens','Pan_troglodytes','Macaca_mulatta','Mus_musculus','Rattus_norvegicus','Bos_taurus','Canis_lupus','Ornithorhynchus_anatinus','Xenopus_tropicalis','Takifugu_rubripes','Gallus_gallus','Ciona_intestinalis','Branchiostoma_floridae','Schistosoma_mansoni','Caenorhabditis_elegans','Anopheles_gambiae','Drosophila_melanogaster','Ixodes_scapularis','Ustilago_maydis','Neurospora_crassa','Monodelphis_domestica','Danio_rerio','Nematostella_vectensis','Cryptococcus_neoformans')

outTaxize <- classification(spnames, db='ncbi')
tree <- class2tree(outTaxize)
plot(tree)

will now yield

image

As we just updated an existing function (and – to be honest – as we so far couldn't find out how to run the tests, some update on the docs on that would be appreciated) we didn't include new tests.

@gedankenstuecke
Copy link
Contributor Author

Subsequent to the initial commit we added

  • zoo to the DESCRIPTION for the additional dependency, so that we can use na.locf
  • modified taxize-package.R to import transpose from data.table and complete.cases from stats and na.locf from zoo.
  • Ran roxygen2 to generate the NAMESPACE file
  • fixed the wrong classification output to make sure that we adhere to the expected class2tree output

We don't fully understand why the other tests fail for us though.

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

thanks @gedankenstuecke and @trvinh - having a look

travis failure should just be due to that Travis doesn't share environment variables on pull request builds - so no worries, do check that it passes locally for you though and i'll check as well

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

opened issue for tests #635

basically, in R, you can do e.g., in the taxize directory, start R, then

library(devtools)
library(testthat)
load_all()
# to test an individual test file
test_file("tests/testthat/test-class2tree.R")
# to run all tests
devtools::test()

may want to run Sys.setenv(NOT_CRAN = "true") before running tests as well, as many are skipped on CRAN, and you do want to run them locally presumably

Copy link
Contributor

@sckott sckott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this! made some specific comments inline, a few more general ones:

  • please make width of both code and comments be 80 characters - i am not yet consistent in this pkg, but trying to get there
  • since a lot of code is being added, i imagine there's more edge cases? do add a few more tests to test suite

R/class2tree.R Outdated
dat <- rbind.fill(lapply(input, class2tree_helper))
df <- dat[ , !apply(dat, 2, function(x) any(is.na(x))) ]
if (!inherits(df, "data.frame")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine to remove this, but is there now no chance of resulting in no ranks in common?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, I don't think so. Since we are taking into account even the root rank, so 2 taxa will always have at least one rank in common. Therefore I think checking for !inherits(df, "data.frame")not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

R/class2tree.R Outdated
colnames(mTaxonDf) <- c(taxonName[1],"rank")

### merge with index2RankDf (Df contains all available ranks from input data)
fullRankIDdf <- merge(fullRankIDdf,mTaxonDf, by=c("rank"), all.x = T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always use TRUE and FALSE throughout instead of T and F

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check and make all the changes that you requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

R/class2tree.R Outdated
joinedDf <- within(joinedDf, rankDf[rankDf=='no rank'] <- paste0("norank_",idDf[rankDf=='no rank']))

df <- data.frame(t(data.frame(rev(joinedDf$rankDf))), stringsAsFactors = FALSE)
outDf <- data.frame(tip = x[nrow(x), "name"], df, stringsAsFactors = FALSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this fxn and the next below don't explicitly return any output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sckott think you don't mean that they should use return() statement, but just have implicit return, ie., last line

outDf

@gedankenstuecke
Copy link
Contributor Author

Thanks for the hints on how to run single files. I looked desperately for that earlier today and couldn't find it. The class2tree tests pass on my machine.

I got some errors/warnings when running the whole of it

downstream: ...........W
…
get_gbifid: ..W..........................
get_ids: ......WW.W..
get_natservid: ........12
get_tpsid: ..................
get_tsn: ..W
…
iucn_id: 3.4
iucn_summary: 56..78
…
tax_rank: W

Also there were some tests that waited for user-input somehow when I ran your commands, I just proceeded by pressing <enter>.

From the logs:

Warnings -----------------------------------------------------------------------
1. downstream - multiple data sources (@test-downstream.R#43) - > 1 result; no direct match found

2. get_gbifid accepts ask-argument (@test-get_gbifid.R#18) - More than one GBIF ID found for taxon 'Dugesia'; refine query or set ask=TRUE

3. works on a variety of names (@test-get_ids.R#40) - More than one id found for taxon 'Boissiera squarrosa'; refine query or set ask=TRUE

4. works on a variety of names (@test-get_ids.R#40) - More than one tpsid found for taxon 'Boissiera squarrosa'; refine query or set ask=TRUE

5. works on a variety of names (@test-get_ids.R#41) - More than one tpsid found for taxon 'Arthrostylidium pubescens'; refine query or set ask=TRUE

6. get_tsn accepts ask and verbose arguments (@test-get_tsn.R#19) - > 1 result; no direct match found

7. tax_rank returns the correct class (@test-tax_rank.R#8) - > 1 result; no direct match found

Failed -------------------------------------------------------------------------
1. Failure: get_natservid fails well (@test-get_natservid.R#36) ----------------
get_natservid("Ruby*", "common", rows = "foobar", verbose = FALSE) did not throw an error.

2. Failure: get_natservid fails well (@test-get_natservid.R#38) ----------------
get_natservid("Ruby*", "common", rows = 0, verbose = FALSE) did not throw an error.
…
3. Error: iucn_id returns the correct class (@test-iucn_id.R#7) ----------------

4. Error: iucn_id fails well (@test-iucn_id.R#24) ------------------------------
…
5. Error: iucn_summary returns the correct value (@test-iucn_summary.R#7) ------
…
6. Error: iucn_summary gives expected result for lots of names (@test-iucn_summary.R#22)
…
7. Failure: iucn_summary and iucn_summary_id fail well (@test-iucn_summary.R#59)
error$message does not match "Not Found".
Actual value: "need an API key for Red List data"
…
8. Error: iucn_summary and iucn_summary_id fail well (@test-iucn_summary.R#61) -
need an API key for Red List data

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

@gedankenstuecke see additional comment above about not cran #634 (comment)

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

tests do currently take a long time to run, so you may only want to deal with the tests for this fxn. (eventually we will mock HTTP requests and it will be much better, but not here yet)

yes, many tests go through a get*() fxn and sometimes those have >1 option, and ideally in tests we would avoid those = will be fixed hopefully soon

@gedankenstuecke
Copy link
Contributor Author

Ah, i ran the tests before the edit was made about cran, will try again!

@gedankenstuecke
Copy link
Contributor Author

I started addressing @sckott's comments and made sure we consistently use TRUE and wrapped all lines to 80 characters width.

@gedankenstuecke
Copy link
Contributor Author

Okay, I think it should now address all of @sckott's comments πŸ™‚
I changed the following:

  • added the explicit return statements to the two functions where they were missing
  • added a check that class2tree function will throw an error if there are duplicate taxa in the classification input.
  • added a test that checks that you get a proper error when there are duplicate taxa (the original function crashed without a proper error too).

Fixes following things: 
* removes CamelCase for snake_thing
* renames variables to remove confusing `list` references that describe data frames
* fixes output of `class2tree$classification` to be the expected one.
Before we gave out the internal, wrongly labelled levels. Now we have the right labels and return NA for missing data as expected.
@gedankenstuecke
Copy link
Contributor Author

We had overlooked some details on how the classification is returned from class2tree, this is fixed now and there are no more WIP PR from our side. πŸ‘

@sckott
Copy link
Contributor

sckott commented Oct 7, 2017

all looks good, thanks much @gedankenstuecke and @trvinh !

@sckott sckott merged commit ca16afe into ropensci:master Oct 7, 2017
@sckott sckott added this to the v1.0 milestone Oct 7, 2017
@gedankenstuecke
Copy link
Contributor Author

Thanks for all the support @sckott! And congratulations @trvinh! That’s an amazing first open source contribution! πŸ˜πŸŽ‰πŸΎ

@trvinh
Copy link
Contributor

trvinh commented Oct 7, 2017

thanks so much, @sckott, you made my day :) And also thanks to @gedankenstuecke, we did it πŸ‘

@sckott
Copy link
Contributor

sckott commented Oct 7, 2017

Indeed, great first contribution @trvinh ! (& thx of course to you too @gedankenstuecke )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants