Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

Closed
damianooldoni opened this issue Nov 30, 2022 · 2 comments
Milestone

Comments

@damianooldoni
Copy link
Collaborator

I am very happy you had the function name_backbone_checklist() to rgbif! 👍
I had no idea you were interested on something like this. It will make our own inborutils::gbif_species_name_match() unnecessary and we will deprecate it soon.

While using name_backbone_checklist() I found slightly strange that verbose arg is described as:

(logical) should the matching return non-exact matches

but the following occurs:

case1: verbose is FALSE

fuzzy matches are returned. I have always conceived fuzzy matches as non-exact matches.
See example below:

library(rgbif)

name_data <- data.frame(
  name = c(
    "Cirsium arvense (L.) Scop.", # a plant
    "Puma concuolor (Linnaeus, 1771)", # a mis-spelled big cat
    "Fake species (John Waller 2021)", # a fake species
    "Calopteryx" # Just a Genus   
  ), description = c(
    "a plant",
    "a mis-spelled big cat",
    "a fake species",
    "just a GENUS"
  ), 
  kingdom = c(
    "Plantae",
    "Animalia",
    "Johnlia",
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data)
output[,c("scientificName", "verbatim_name", "matchType")]
#> # A tibble: 4 × 3
#>   scientificName                 verbatim_name                   matchType
#>   <chr>                          <chr>                           <chr>    
#> 1 Cirsium arvense (L.) Scop.     Cirsium arvense (L.) Scop.      EXACT    
#> 2 Puma concolor (Linnaeus, 1771) Puma concuolor (Linnaeus, 1771) FUZZY    
#> 3 <NA>                           Fake species (John Waller 2021) NONE     
#> 4 Calopteryx Leach, 1815         Calopteryx                      EXACT
Created on 2022-11-30 with reprex v2.0.2

case2: verbose is TRUE and matchType is NONE

If verbose is TRUE I get, as expected, more rows. However, these new rows have machType equal to EXACT or FUZZY, which seems a contradiction based on documentation of arg verbatim. So, filtering on matchType = EXACT is different depending on the value of verbose arg. The only logic rule to identify these "suspect" exact matches in the output df is that they are linked to the same verbatim_index values with matchType = NONE.

Example:

library(rgbif)
name_data <- data.frame(
  name = c(
    "Cirsium arvense (L.) Scop.", # a plant
    "Puma concuolor (Linnaeus, 1771)", # a mis-spelled big cat
    "Fake species (John Waller 2021)" # a fake species
  ), description = c(
    "a plant",
    "a mis-spelled big cat",
    "a fake species"
  ), 
  kingdom = c(
    "Plantae",
    "Animalia",
    "Johnlia"
  )
)

output <- name_backbone_checklist(name_data, verbose = TRUE)
output[,c("scientificName", "verbatim_name", "matchType")]
#> # A tibble: 6 × 3
#>   scientificName                 verbatim_name                   matchType
#>   <chr>                          <chr>                           <chr>    
#> 1 Cirsium arvense (L.) Scop.     Cirsium arvense (L.) Scop.      EXACT    
#> 2 Cirsium arcense Scop.          Cirsium arvense (L.) Scop.      FUZZY    
#> 3 Cirsium apoense Nakai          Cirsium arvense (L.) Scop.      FUZZY    
#> 4 Puma concolor (Linnaeus, 1771) Puma concuolor (Linnaeus, 1771) FUZZY    
#> 5 <NA>                           Fake species (John Waller 2021) NONE     
#> 6 Faku Péringuey, 1916           Fake species (John Waller 2021) FUZZY
no_match_idx <- subset(output, matchType == "NONE")$verbatim_index

no_match_output_verbatim <- subset(
  output,
  verbatim_index %in% no_match_idx & matchType != "NONE"
)

no_match_output_verbatim
#> # A tibble: 1 × 26
#>   usageKey scientifi…¹ canon…² rank  status confi…³ match…⁴ kingdom phylum order
#>      <int> <chr>       <chr>   <chr> <chr>    <int> <chr>   <chr>   <chr>  <chr>
#> 1  1725165 Faku Périn… Faku    GENUS SYNON…      39 FUZZY   Animal… Arthr… Orth…
#> # … with 16 more variables: family <chr>, genus <chr>, species <chr>,
#> #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
#> #   familyKey <int>, genusKey <int>, speciesKey <int>, synonym <lgl>,
#> #   class <chr>, acceptedUsageKey <int>, verbatim_name <chr>,
#> #   verbatim_kingdom <chr>, verbatim_index <dbl>, and abbreviated variable
#> #   names ¹​scientificName, ²​canonicalName, ³​confidence, ⁴​matchType
Created on 2022-11-30 with reprex v2.0.2

case2: verbose is TRUE and matchType is FUZZY

One exact match with verbose = FALSE:

library(rgbif)
name_data <- data.frame(
  name = c(
    "Calopteryx" # Just a Genus   
  ), description = c(
    "just a GENUS"
  ), 
  kingdom = c(
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data, verbose = FALSE)
output[,c("scientificName", "verbatim_name", "matchType", "kingdom")]
#> # A tibble: 1 × 4
#>   scientificName         verbatim_name matchType kingdom 
#>   <chr>                  <chr>         <chr>     <chr>   
#> 1 Calopteryx Leach, 1815 Calopteryx    EXACT     Animalia
Created on 2022-11-30 with reprex v2.0.2

versus four exact matches with verbose = TRUE (plus three fuzzy, which I would expect as we show non exact matches via verbose):

library(rgbif)
name_data <- data.frame(
  name = c(
    "Calopteryx" # Just a Genus   
  ), description = c(
    "just a GENUS"
  ), 
  kingdom = c(
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data, verbose = TRUE)
output[,c("scientificName", "verbatim_name", "matchType", "kingdom")]
#> # A tibble: 7 × 4
#>   scientificName                  verbatim_name matchType kingdom 
#>   <chr>                           <chr>         <chr>     <chr>   
#> 1 Calopteryx Leach, 1815          Calopteryx    EXACT     Animalia
#> 2 Calopteryx Vander Linden, 1825  Calopteryx    EXACT     Animalia
#> 3 Calopteryx de Charpentier, 1839 Calopteryx    EXACT     Animalia
#> 4 Colopteryx O.Hofmann, 1898      Calopteryx    FUZZY     Animalia
#> 5 Calepteryx Leach, 1815          Calopteryx    FUZZY     Animalia
#> 6 Colopteryx Ridgway, 1888        Calopteryx    FUZZY     Animalia
#> 7 Calopteryx A.C.Sm.              Calopteryx    EXACT     Plantae
Created on 2022-11-30 with reprex v2.0.2

Conclusion

I hope I explained the issue clearly enough. Improving the documentation of verbatim arg can help, but I am not sure will be enough. Maybe reshaping the output with verbatim = TRUE could help (see #515)?

Thanks a lot!

Session Info
> devtools::session_info()
─ Session info ───────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23 ucrt)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  Dutch_Belgium.utf8
 ctype    Dutch_Belgium.utf8
 tz       Europe/Paris
 date     2022-11-30
 rstudio  2022.07.2+576 Spotted Wakerobin (desktop)
 pandoc   NAPackages ───────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.1)
 callr         3.7.3   2022-11-02 [1] CRAN (R 4.2.2)
 cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
 colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.1)
 conditionz    0.1.0   2019-04-24 [1] CRAN (R 4.2.1)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.1)
 data.table    1.14.4  2022-10-17 [1] CRAN (R 4.2.2)
 DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.2.2)
 digest        0.6.30  2022-10-18 [1] CRAN (R 4.2.2)
 dplyr         1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
 fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
 fortunes      1.5-4   2016-12-29 [1] CRAN (R 4.2.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
 ggplot2       3.4.0   2022-11-04 [1] CRAN (R 4.2.2)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
 gtable        0.3.1   2022-09-01 [1] CRAN (R 4.2.1)
 htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
 htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.2.1)
 httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.2.1)
 httr          1.4.4   2022-08-17 [1] CRAN (R 4.2.1)
 jsonlite      1.8.3   2022-10-21 [1] CRAN (R 4.2.2)
 later         1.3.0   2021-08-18 [1] CRAN (R 4.2.1)
 lazyeval      0.2.2   2019-03-15 [1] CRAN (R 4.2.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.1)
 mime          0.12    2021-09-28 [1] CRAN (R 4.2.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
 oai           0.4.0   2022-11-10 [1] CRAN (R 4.2.2)
 pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.2.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
 pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.2.1)
 plyr          1.8.8   2022-11-11 [1] CRAN (R 4.2.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.2.1)
 processx      3.8.0   2022-10-26 [1] CRAN (R 4.2.2)
 profvis       0.3.7   2020-11-02 [1] CRAN (R 4.2.1)
 promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
 ps            1.7.2   2022-10-26 [1] CRAN (R 4.2.2)
 purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.2)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
 Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.2.1)
 rgbif       * 3.7.3   2022-09-03 [1] CRAN (R 4.2.1)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.2.1)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 shiny         1.7.3   2022-10-25 [1] CRAN (R 4.2.2)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
 stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
 tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.2.1)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.2.1)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
 uuid          1.1-0   2022-04-19 [1] CRAN (R 4.2.0)
 vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.2)
 whisker       0.4     2019-08-28 [1] CRAN (R 4.2.1)
 xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.1)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.2.1)

 [1] C:/R/library
 [2] C:/R/R-4.2.1/library
@damianooldoni damianooldoni changed the title Unclear documentation or suspect behavior of non-exact matches name_backbone_checklist: unclear documentation or suspect behavior of non-exact matches Nov 30, 2022
@damianooldoni damianooldoni changed the title name_backbone_checklist: unclear documentation or suspect behavior of non-exact matches name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches Nov 30, 2022
@jhnwllr
Copy link
Collaborator

jhnwllr commented Dec 1, 2022

@damianooldoni thanks for pointing this out.

verbose

I am not sure if I exactly understand what verbose=TRUE does in the name matching API.

According to the API docs "verbose" means:
https://www.gbif.org/developer/species#p_verbose

If true it shows alternative matches which were considered but then rejected

I think I will adopt this language for the rgbif docs.

strict

strict=TRUE will remove fuzzy matches from the results. I should probably add this option for name_backbone_checklist.

There is a strict arg for name_backbone

rgbif::name_backbone("Puma concuolor (Linnaeus, 1771)",strict=TRUE)

Any comments @mdoering?

@mdoering
Copy link

mdoering commented Dec 1, 2022

makes 100% sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants