Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_names result order #78

Closed
lisafisler opened this issue Sep 22, 2020 · 9 comments
Closed

get_names result order #78

lisafisler opened this issue Sep 22, 2020 · 9 comments

Comments

@lisafisler
Copy link

lisafisler commented Sep 22, 2020

Hello,

My issue is quite simple: the get_names function gives me the correct names when I feed it with a species code (here "itis", but it's the same with "col") but the result is in a weird order. For example here "ITIS:715228", which gives the species Megapodius decollatus, appears as first element in the second request, although it should be second. This problem does not occur with the get_ids function which gives me the right order.

library(tidyverse)
library(taxadb)
td_create("itis")
get_names("ITIS:715228")
[1] "Megapodius decollatus"
get_names(c("ITIS:553896", "ITIS:715228", NA))
[1] "Megapodius decollatus" "Falcipennis canadensis" NA

Thank you for your help with this issue.

For info, my sessionInfo() gives out:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 tibble_3.0.3
[8] ggplot2_3.3.2 tidyverse_1.3.0 taxadb_0.1.0

loaded via a namespace (and not attached):
[1] progress_1.2.2 tidyselect_1.1.0 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2
[7] yaml_2.2.1 blob_1.2.1 rlang_0.4.7 pillar_1.4.6 glue_1.4.1 withr_2.2.0
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.2 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
[19] lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] curl_4.3 fansi_0.4.1 broom_0.7.0 arkdb_0.0.5 Rcpp_1.0.5 backports_1.1.8
[31] scales_1.1.1 jsonlite_1.7.0 fs_1.5.0 bit_4.0.4 hms_0.5.3 digest_0.6.25
[37] stringi_1.4.6 duckdb_0.2.1 grid_4.0.2 cli_2.0.2 tools_4.0.2 magrittr_1.5
[43] RSQLite_2.2.0 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 prettyunits_1.1.1
[49] reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11 R6_2.4.1
[55] compiler_4.0.2

@cboettig
Copy link
Member

Wow, that's crazy! Sorry about that. I'm having trouble reproducing this.

Does it do that without the NA entry too?

Here's my sessionInfo:


> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxadb_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        pillar_1.4.6      compiler_4.0.2    dbplyr_1.4.4      R.methodsS3_1.8.1 prettyunits_1.1.1 R.utils_2.10.1    tools_4.0.2       progress_1.2.2    bit_4.0.4         digest_0.6.25     packrat_0.5.0     MonetDBLite_0.6.1 RSQLite_2.2.0     jsonlite_1.7.1    evaluate_0.14     memoise_1.1.0     lifecycle_0.2.0   tibble_3.0.3     
[20] pkgconfig_2.0.3   rlang_0.4.7       DBI_1.1.0         rstudioapi_0.11   curl_4.3          yaml_2.2.1        xfun_0.17         duckdb_0.2.1      arkdb_0.0.6       dplyr_1.0.2       knitr_1.29        rappdirs_0.3.1    generics_0.0.2    vctrs_0.3.4       hms_0.5.3         bit64_4.0.5       tidyselect_1.1.0  glue_1.4.2        R6_2.4.1         
[39] rmarkdown_2.3     readr_1.3.1       purrr_0.3.4       blob_1.2.1        magrittr_1.5      codetools_0.2-16  ellipsis_0.3.1    htmltools_0.5.0   assertthat_0.2.1  stringi_1.5.3     crayon_1.3.4      R.oo_1.24.0      

@cboettig
Copy link
Member

p.s. you may already know this, but meanwhile use filter_id etc instead to get the full table, rather than rely on ordering in get_names(). You might try updating packages to the latest versions too (e.g. via update.packages()

@lisafisler
Copy link
Author

It does unfortunately the same with or without the NA. I have updated all my possible packages, and no change.

Thanks yes, it works well with filter_id instead, even though it takes one more step to get to the information I want. Keep me posted if you can find the problem and I will work with filter_id meanwhile.

@cboettig
Copy link
Member

cboettig commented Sep 22, 2020

Thanks, some database backends don't enforce consistent row-ordering. I've added an additional command to assert consistent order, can you please test again with the dev version?

remotes::install_github("ropensci/taxadb")

@lisafisler
Copy link
Author

Great, thanks! It seems to have done the trick! Hurray :-)

The only trouble I see is that the get_names function seems a bit slower now than when I had the other version. It's only slightly slower, but as I can clearly see the difference with my small dataset of 3 species, I am just worried that it would much increase with a huge dataset. But maybe this second step will always take up the same amount of time, no matter how many species, in which case it wouldn't increase that much the time needed and that wouldn't be a problem in the end.

@cboettig
Copy link
Member

thanks! Interesting that it's noticeably slower. I think you won't see that scale linearly with a very large number of names. Can you tell me what td_connect() shows?

If the speed of get_names is important to your workflow; you could be our first beta tester for https://github.com/cboettig/taxalight/ ? 😊

@lisafisler
Copy link
Author

It gives the result almost instantly with the original version, and it takes approximately 3 seconds for one get_names request with the dev version.

> td_connect()
<duckdb_connection 27f60 driver=<duckdb_driver 05450 dbdir='/Users/lisafisler/Library/Application Support/taxadb/database/duckdb' read_only=FALSE>>

I don't really have a very big database, I was just concerned for people who do. But I'd be happy to test taxalight anyway! What are the main differences with taxadb?

@cboettig
Copy link
Member

taxalight has only get_names() get_ids() and tl (which returns the taxonomic table for the requested species and/or ids in question). You can't do operations on the full taxonomic database with taxalight, like asking "how many names are in the family Aves". It's also stricter about the matching; i.e. a scientific name must match case exactly and there's no 'starts with' etc options.

At the moment it only has accepted taxonomic identifiers and scientific names available as queries. We can probably add query by common name and query by synonym identifier (for authorities that assign IDs to synonyms).

@lisafisler
Copy link
Author

Thanks! I'll give it a go.

@cboettig cboettig closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants