Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-submission inquiry: CoordinateCleaner #199

Closed
azizka opened this issue Mar 1, 2018 · 5 comments
Closed

Pre-submission inquiry: CoordinateCleaner #199

azizka opened this issue Mar 1, 2018 · 5 comments
Assignees

Comments

@azizka
Copy link

azizka commented Mar 1, 2018

Hi,

I'd like to submit the CoordinateCleaner package. It's a package for automated cleaning of biological and paleontological collection data, useful for conservation, ecology and evolutionary biology. There is overlap with scrubr, but CoordinateCleaner has a lot of additional functions. So hopefully it could still be appropriate?

In particular the package adds:

  • additional record-level tests to identify problems (capitals, urban areas, oceans, biodiversity institutions)
  • a novel geo-referenced database of ~10,000 biodiversity institutions, with an related app to add more at http://biodiversity-institutions.surge.sh/
  • novel algorithms to test for problems unidentifiable on record level, in particular conversion errors in coordinate annotation and rasterized sampling schemes
  • functions suited to clean fossil occurrences
  • tutorials on presenting pipelines to clean coordinates from the biggest public data-providers in the field (www.gbif.org, www.paleobiodb.org)

CoordianteCleaner is on CRAN and I had discussed it briefly with @sckott when he was in Stockholm in October 2016. There is also a manuscript ready for submission to Methods in Ecology and Evolution linked with the package.

Thanks,
Alex

Summary

  • What does this package do? (explain in 50 words or less):
    Scan data sets of recent and fossil species occurrence records for geo-referencing and dating imprecision and data-entry errors in a standardized and reproducible way.

  • Paste the full DESCRIPTION file inside a code block below:

Package: CoordinateCleaner
Type: Package
Title: Automated Cleaning of Occurrence Records from Biological Collections
Version: 1.0-7
Date: 2018-03-01
Authors@R: c(person(given = "Alexander", family = "Zizka", email = "alexander.zizka@bioenv.gu.se",
                    role = c("aut", "cre")),
             person(given = "Daniele", family = "Silvestro", role = c("ctb")))
Description: Automated cleaning of geographic species occurrence records by automated flagging of problems common to biodiversity data from biological collections. Includes automated tests to easily flag (and exclude) records assigned to country or province centroid, the open ocean, the headquarters of the Global Biodiversity Information Facility, urban areas or the location of biodiversity institutions (museums, zoos, botanical gardens, universities). Furthermore identifies per species outlier coordinates, zero coordinates, identical latitude/longitude and invalid coordinates. Also implements an algorithm to identify data sets with a significant proportion of rounded coordinates. Especially suited for large data sets. See <https://github.com/azizka/CoordinateCleaner/wiki> for more details and tutorials.
License: GPL-3
Depends: R (>= 3.0.0), sp
Imports: geosphere, ggplot2, methods, raster, rgeos, rnaturalearth, stats
LazyData: true
RoxygenNote: 6.0.1
@karthik
Copy link
Member

karthik commented Mar 1, 2018

Thanks @azizka. We will discuss this and get back to you.

@sckott sckott self-assigned this Mar 5, 2018
@sckott
Copy link
Contributor

sckott commented Mar 6, 2018

hi @azizka - nice to see you here.

As I probably mentioned to you in Stockholm, we do have a overlap policy where we try not to have packages that overlap too much. But here I think we might be okay. What do you think are the areas of overlap for the two packages? Maybe we can just avoid overlapping functionality.

@azizka
Copy link
Author

azizka commented Mar 22, 2018

Hi @sckott & @karthik , thanks!

I dug into both packages, after all the overlap seems rather small. See below for a by-function comparison table with scrubr from CRAN (please correct it if I missed something).

In two sentence: The aim of both packages is identical -- to improve quality of occurrence records from large databases, beyond that there is actually little overlap. Few basic functionalities are virtually identical, beyond that scrubr includes date- and taxonomic cleaning, while CoordinateCleaner includes many unique feature for coordinates and fossils, and enables custom gazetteers and custom precision for the match with political centroids and capitals.

Sorry for the slow reply, I was out of office.

Functionality CoordinateCleaner1.0-7 scrubr 0.1.1 Percent overlap
Missing coordinates cc_val coord_incomplete 100%
Coordinates outside CRS cc_val coord_impossible 100%
Duplicated records cc_dupl dedup The aim is identical, methods differ
0/0 coordinates cc_zero coord_unlikely 100%
Identical lon/lat cc_equ - 0%
Country capitals cc_cap - 0%
Political unit centroids cc_cen "not ready yet" 0%
Coordinates in-congruent with additional location information cc_count coord_within 100%
Coordinates assigned to GBIF headquaters cc_gbif - 0%
Coordinates assigned to the location of biodiversity institutions cc_inst - 0%
Spatial outliers cc_outl - 0%
Coordinates within the ocean cc_sea - 0%
Coordinates in urban area cc_urb - 0%
Coordinate conversion error dc_ddmm - 0%
Rounded coordinates/rasterized collection dc_round - 0%
Fossils: invalid age range tc_equal - 0%
Fossils: excessive age range tc_range - 0%
Fossils: temporal outlier tc_outl - 0%
Fossils: PyRate interface WritePyrate - 0%
Wrapper functions to run all test CleanCoordinates, CleanCoordinatesDS, CleanCoordiantesFOS - 0%
Database of biodiversity institutions institutions - 0%
Taxonomic cleaning - tax_no_epithet 0%
Missing date - date_missing 0%
Add date - date_create 0%
Date format - date_standardize 0%

@sckott
Copy link
Contributor

sckott commented Apr 6, 2018

thanks for this @azizka - Sorry about delay in responding. We agree that the overlap isn't sufficient to warrant concern. We'd like you to submit the package for review. Open a new issue and fill out the issue template you'll see.

@sckott
Copy link
Contributor

sckott commented Apr 6, 2018

closing this, looking forward to your submission

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants