name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

damianooldoni · 2022-11-30T13:40:57Z

I am very happy you had the function name_backbone_checklist() to rgbif! 👍
I had no idea you were interested on something like this. It will make our own inborutils::gbif_species_name_match() unnecessary and we will deprecate it soon.

While using name_backbone_checklist() I found slightly strange that verbose arg is described as:

(logical) should the matching return non-exact matches

but the following occurs:

case1: `verbose` is `FALSE`

fuzzy matches are returned. I have always conceived fuzzy matches as non-exact matches.
See example below:

library(rgbif)

name_data <- data.frame(
  name = c(
    "Cirsium arvense (L.) Scop.", # a plant
    "Puma concuolor (Linnaeus, 1771)", # a mis-spelled big cat
    "Fake species (John Waller 2021)", # a fake species
    "Calopteryx" # Just a Genus   
  ), description = c(
    "a plant",
    "a mis-spelled big cat",
    "a fake species",
    "just a GENUS"
  ), 
  kingdom = c(
    "Plantae",
    "Animalia",
    "Johnlia",
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data)
output[,c("scientificName", "verbatim_name", "matchType")]
#> # A tibble: 4 × 3
#>   scientificName                 verbatim_name                   matchType
#>   <chr>                          <chr>                           <chr>    
#> 1 Cirsium arvense (L.) Scop.     Cirsium arvense (L.) Scop.      EXACT    
#> 2 Puma concolor (Linnaeus, 1771) Puma concuolor (Linnaeus, 1771) FUZZY    
#> 3 <NA>                           Fake species (John Waller 2021) NONE     
#> 4 Calopteryx Leach, 1815         Calopteryx                      EXACT
Created on 2022-11-30 with reprex v2.0.2

case2: `verbose` is `TRUE` and `matchType` is `NONE`

If verbose is TRUE I get, as expected, more rows. However, these new rows have machType equal to EXACT or FUZZY, which seems a contradiction based on documentation of arg verbatim. So, filtering on matchType = EXACT is different depending on the value of verbose arg. The only logic rule to identify these "suspect" exact matches in the output df is that they are linked to the same verbatim_index values with matchType = NONE.

Example:

library(rgbif)
name_data <- data.frame(
  name = c(
    "Cirsium arvense (L.) Scop.", # a plant
    "Puma concuolor (Linnaeus, 1771)", # a mis-spelled big cat
    "Fake species (John Waller 2021)" # a fake species
  ), description = c(
    "a plant",
    "a mis-spelled big cat",
    "a fake species"
  ), 
  kingdom = c(
    "Plantae",
    "Animalia",
    "Johnlia"
  )
)

output <- name_backbone_checklist(name_data, verbose = TRUE)
output[,c("scientificName", "verbatim_name", "matchType")]
#> # A tibble: 6 × 3
#>   scientificName                 verbatim_name                   matchType
#>   <chr>                          <chr>                           <chr>    
#> 1 Cirsium arvense (L.) Scop.     Cirsium arvense (L.) Scop.      EXACT    
#> 2 Cirsium arcense Scop.          Cirsium arvense (L.) Scop.      FUZZY    
#> 3 Cirsium apoense Nakai          Cirsium arvense (L.) Scop.      FUZZY    
#> 4 Puma concolor (Linnaeus, 1771) Puma concuolor (Linnaeus, 1771) FUZZY    
#> 5 <NA>                           Fake species (John Waller 2021) NONE     
#> 6 Faku Péringuey, 1916           Fake species (John Waller 2021) FUZZY
no_match_idx <- subset(output, matchType == "NONE")$verbatim_index

no_match_output_verbatim <- subset(
  output,
  verbatim_index %in% no_match_idx & matchType != "NONE"
)

no_match_output_verbatim
#> # A tibble: 1 × 26
#>   usageKey scientifi…¹ canon…² rank  status confi…³ match…⁴ kingdom phylum order
#>      <int> <chr>       <chr>   <chr> <chr>    <int> <chr>   <chr>   <chr>  <chr>
#> 1  1725165 Faku Périn… Faku    GENUS SYNON…      39 FUZZY   Animal… Arthr… Orth…
#> # … with 16 more variables: family <chr>, genus <chr>, species <chr>,
#> #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
#> #   familyKey <int>, genusKey <int>, speciesKey <int>, synonym <lgl>,
#> #   class <chr>, acceptedUsageKey <int>, verbatim_name <chr>,
#> #   verbatim_kingdom <chr>, verbatim_index <dbl>, and abbreviated variable
#> #   names ¹scientificName, ²canonicalName, ³confidence, ⁴matchType
Created on 2022-11-30 with reprex v2.0.2

case2: `verbose` is `TRUE` and `matchType` is `FUZZY`

One exact match with verbose = FALSE:

library(rgbif)
name_data <- data.frame(
  name = c(
    "Calopteryx" # Just a Genus   
  ), description = c(
    "just a GENUS"
  ), 
  kingdom = c(
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data, verbose = FALSE)
output[,c("scientificName", "verbatim_name", "matchType", "kingdom")]
#> # A tibble: 1 × 4
#>   scientificName         verbatim_name matchType kingdom 
#>   <chr>                  <chr>         <chr>     <chr>   
#> 1 Calopteryx Leach, 1815 Calopteryx    EXACT     Animalia
Created on 2022-11-30 with reprex v2.0.2

versus four exact matches with verbose = TRUE (plus three fuzzy, which I would expect as we show non exact matches via verbose):

library(rgbif)
name_data <- data.frame(
  name = c(
    "Calopteryx" # Just a Genus   
  ), description = c(
    "just a GENUS"
  ), 
  kingdom = c(
    "Animalia"
  )
)

output <- name_backbone_checklist(name_data, verbose = TRUE)
output[,c("scientificName", "verbatim_name", "matchType", "kingdom")]
#> # A tibble: 7 × 4
#>   scientificName                  verbatim_name matchType kingdom 
#>   <chr>                           <chr>         <chr>     <chr>   
#> 1 Calopteryx Leach, 1815          Calopteryx    EXACT     Animalia
#> 2 Calopteryx Vander Linden, 1825  Calopteryx    EXACT     Animalia
#> 3 Calopteryx de Charpentier, 1839 Calopteryx    EXACT     Animalia
#> 4 Colopteryx O.Hofmann, 1898      Calopteryx    FUZZY     Animalia
#> 5 Calepteryx Leach, 1815          Calopteryx    FUZZY     Animalia
#> 6 Colopteryx Ridgway, 1888        Calopteryx    FUZZY     Animalia
#> 7 Calopteryx A.C.Sm.              Calopteryx    EXACT     Plantae
Created on 2022-11-30 with reprex v2.0.2

Conclusion

I hope I explained the issue clearly enough. Improving the documentation of verbatim arg can help, but I am not sure will be enough. Maybe reshaping the output with verbatim = TRUE could help (see #515)?

Thanks a lot!

Session Info

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23 ucrt)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  Dutch_Belgium.utf8
 ctype    Dutch_Belgium.utf8
 tz       Europe/Paris
 date     2022-11-30
 rstudio  2022.07.2+576 Spotted Wakerobin (desktop)
 pandoc   NA

─ Packages ───────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.1)
 callr         3.7.3   2022-11-02 [1] CRAN (R 4.2.2)
 cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
 colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.1)
 conditionz    0.1.0   2019-04-24 [1] CRAN (R 4.2.1)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.1)
 data.table    1.14.4  2022-10-17 [1] CRAN (R 4.2.2)
 DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.2.2)
 digest        0.6.30  2022-10-18 [1] CRAN (R 4.2.2)
 dplyr         1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
 fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
 fortunes      1.5-4   2016-12-29 [1] CRAN (R 4.2.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
 ggplot2       3.4.0   2022-11-04 [1] CRAN (R 4.2.2)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
 gtable        0.3.1   2022-09-01 [1] CRAN (R 4.2.1)
 htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
 htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.2.1)
 httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.2.1)
 httr          1.4.4   2022-08-17 [1] CRAN (R 4.2.1)
 jsonlite      1.8.3   2022-10-21 [1] CRAN (R 4.2.2)
 later         1.3.0   2021-08-18 [1] CRAN (R 4.2.1)
 lazyeval      0.2.2   2019-03-15 [1] CRAN (R 4.2.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.1)
 mime          0.12    2021-09-28 [1] CRAN (R 4.2.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
 oai           0.4.0   2022-11-10 [1] CRAN (R 4.2.2)
 pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.2.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
 pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.2.1)
 plyr          1.8.8   2022-11-11 [1] CRAN (R 4.2.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.2.1)
 processx      3.8.0   2022-10-26 [1] CRAN (R 4.2.2)
 profvis       0.3.7   2020-11-02 [1] CRAN (R 4.2.1)
 promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
 ps            1.7.2   2022-10-26 [1] CRAN (R 4.2.2)
 purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.2)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
 Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.2.1)
 rgbif       * 3.7.3   2022-09-03 [1] CRAN (R 4.2.1)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.2.1)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 shiny         1.7.3   2022-10-25 [1] CRAN (R 4.2.2)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
 stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
 tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.2.1)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.2.1)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
 uuid          1.1-0   2022-04-19 [1] CRAN (R 4.2.0)
 vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.2)
 whisker       0.4     2019-08-28 [1] CRAN (R 4.2.1)
 xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.1)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.2.1)

 [1] C:/R/library
 [2] C:/R/R-4.2.1/library

The text was updated successfully, but these errors were encountered:

jhnwllr · 2022-12-01T09:21:05Z

@damianooldoni thanks for pointing this out.

verbose

I am not sure if I exactly understand what verbose=TRUE does in the name matching API.

According to the API docs "verbose" means:
https://www.gbif.org/developer/species#p_verbose

If true it shows alternative matches which were considered but then rejected

I think I will adopt this language for the rgbif docs.

strict

strict=TRUE will remove fuzzy matches from the results. I should probably add this option for name_backbone_checklist.

There is a strict arg for name_backbone

rgbif::name_backbone("Puma concuolor (Linnaeus, 1771)",strict=TRUE)

Any comments @mdoering?

mdoering · 2022-12-01T09:25:03Z

makes 100% sense

damianooldoni changed the title ~~Unclear documentation or suspect behavior of non-exact matches~~ name_backbone_checklist: unclear documentation or suspect behavior of non-exact matches Nov 30, 2022

damianooldoni changed the title ~~name_backbone_checklist: unclear documentation or suspect behavior of non-exact matches~~ name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches Nov 30, 2022

jhnwllr added this to the 3.7.4 milestone Dec 1, 2022

jhnwllr added the documentation label Dec 1, 2022

This was referenced Dec 1, 2022

Add strict arg to name_backbone_checklist #565

Closed

updating name_backbone_checklist docs verbose arg #567

Merged

jhnwllr closed this as completed Dec 1, 2022

jhnwllr mentioned this issue Dec 6, 2022

Preparing for CRAN release v3.7.4 #573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

damianooldoni commented Nov 30, 2022

jhnwllr commented Dec 1, 2022

mdoering commented Dec 1, 2022

name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

name_backbone_checklist: unclear documentation or suspect behavior for non-exact matches #564

Comments

damianooldoni commented Nov 30, 2022

case1: verbose is FALSE

case2: verbose is TRUE and matchType is NONE

case2: verbose is TRUE and matchType is FUZZY

Conclusion

jhnwllr commented Dec 1, 2022

verbose

strict

mdoering commented Dec 1, 2022

case1: `verbose` is `FALSE`

case2: `verbose` is `TRUE` and `matchType` is `NONE`

case2: `verbose` is `TRUE` and `matchType` is `FUZZY`