Skip to content

Enhancing bdchecks: a biodiversity data quality checks system in R

Zupiter edited this page Mar 27, 2020 · 6 revisions

Background

The heterogeneous nature of biodiversity data, and the exponential growth rate of data volume, results in problems in data completeness, consistency, and reliability. During the last two decades, various tools have been developed to improve data quality, yet, most data users still struggle to assess data quality and to compare quality issues between datasets. bdchecks is a set of two R packages which serve as a holistic system for performing, developing and managing various biodiversity data checks. bdchecks offers various features for different types of R users:

  • An interactive and user-friendly Shiny app for inexperienced R users.
  • Full command-line functionality for more experienced R users.
  • An Admin app for advanced R users to easily edit, test, add, and manage their collection of data checks.

Related work

We view bdchecks as a data-checks 'factory', therefore, and thus we develop an architecture able to generate functionality to easily create, document, test, maintain and manage hundreds of different data checks, coupled with several user-interfaces. To the best of our knowledge, this is an uncharted territory in R, which means that the entire infrastructure had to be built from the ground-up, using a great deal of creativity and novelty. Within each new development cycle, we were able to enhance bdchecks’s mechanisms and address different issues. bdchecks life cycle will always be to strive to improve and enhance it. bdchecks is a core component of the bdverse - a family of R packages that form a general framework for facilitating biodiversity data science. One might even call it the beating heart of the bdverse. The more robust bdchecks is, the more powerful other bdverse features can become. This project focuses on the software engineering aspect of bdchecks and the development of new features and capabilities.

Details of your coding project

A bdcheck object (i.e. a data check) is built as an S4 object (R based object-oriented programming system), which is well-suited for building large systems that can evolve over time (Wickham 2019). It is created using a YAML file that holds all the necessary metadata, and an R function file. The R documentation (roxygen2 comments) is being generated automatically from the metadata. In addition, a data test YAML file stores all the testing scenarios of each check, and each testing scenario is automatically converted to a unit test. A testing report, summarizing the expected result of each scenario vs. its observed result, can be generated by the execution of one simple function (‘perform_test_dc()’). Your coding project key tasks are:

  • Getting sufficiently familiarized with the bdchecks packages, and bdverse the architecture.
  • Formulating a check standardized code structure and code best practices.
  • Developing Automatic flowchart of checks’s code to easily review code logic, (e.g., funflow(), DiagrammeR).
  • Experimenting with incorporating checks from other packages (e.g., CoordinateCleaner, scrubr).
  • Developing meta-checks (i.e., checks on checks as quality indices).
  • Establishing collaborations with biodiversity data checks key players (e.g., TDWG, Open science lab for biodiversity, OBIS).

Skills Required

R software engineering; R package development; TDD.
Advantage: experience in working with biodiversity data.

Expected impact

bdchecks has the potential to centralize the effort to develop a sustainable infrastructure for biodiversity data quality checks in R. This will promote better scientific software engineering, better user experience, and the engagement of domain experts.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Povilas Gibas povilasgibas@gmail.com is bdchecks papa and a Ph.D. candidate in biomedical data analysis expected to graduate in 2020. Povilas joined bdverse in 2018 as Google Summer of Code student and since then, he is perfecting data-analysis workflows of bdchecks and bdDwC. He is interested in statistical computing, computational data-analysis workflows and data-visualization. When he is not working, you can find him on stackoverflow.org, where he is learning and helping others to learn new things about R.

  • Tomer Gueta tomer.gu@gmail.com is the founding director of the bdverse project. He is a postdoctoral fellow at the Faculty of Civil and Environmental Engineering at the Technion, working with Prof. Yohay Carmel. His research deals with developing tools and methodologies for data-intensive biodiversity research. During the last three years, Tomer served as a GSoC mentor with the R project organization.

  • Thiloshon Nagarajah thiloshon@gmail.com is the Shiny lead of the bdverse development team. He was past GSoC and GCI student for Fedora Project, Sahana Foundation and R Language. Thiloshon joined bdverse as a Google Summer of Code student developer in 2017 and has been a student, contributor, mentor and now, a core member of the bdverse team. All things Shiny of bdverse is the magic of Thiloshon.

  • Vijay Barve vijay.barve@gmail.com is the author and maintainer of bdvis and a key member in the bdverse development team. Vijay is a biodiversity data scientist and has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.

Tests

Students, please do one or more of the following tests before contacting the mentors above. We designed these tests to be incorporated into your proposal rather quickly.

  • Medium: review bdchecks codeand documentation, and formulate an R markdown document describing bdchecks architecture in your own words. We encourage you to read bdchecks draft manuscript.
  • Medium: fork bdchecks, add one new data check (whatever you want), and submit a PR.
  • Hard: formulate an R markdown document describing in detail your ideas regarding a data check code structure and coding best practices, give a concrete example from bdcheck code and practical suggestions.

Solutions of tests

Students, please post here a GitHub link to your test solutions in the format:

  1. Name - Email - University - Link to solutions
  2. Rudra Patil - rudrapatil1@gmail.com - Vellore Institute of Technology (VIT) - https://github.com/Rudra-Patil/Gsoc-2020-Enhancing-bdchecks
  3. Aakash Khandelwal - aakashkhandelwal3021@gmail.com - Indian Institute of Technology Kanpur - https://github.com/Zupiter/GSOC2020_R_Enhancing_bdchecks
  4. Martynas Jočys - martjocys@gmail.com - Vilnius University (VU) - Solutions
Clone this wiki locally