Skip to content

R Markdown

Peter Desmet edited this page Jan 30, 2019 · 8 revisions

The R Markdown is a file with the R code and documentation to transform your source data to a Darwin Core checklist. It is core functionality of the recipe and acts as executable instructions for your mapping. This is the file you will have to adapt the most. We also call it the "mapping script".

R Markdown?

An R Markdown file is a file that mixes text (written in Markdown) with executable code chunks (in R). It is comparable with an R script, in which the comments explaining the code are given as much value as the code itself. It has the advantage that you can describe and then execute each step of your data processing in the same document, nudging you to better document what you are doing. This is called "literate programming" and it is one of the steps to make research more reproducible. You can simply run the code of an R Markdown file by opening it in RStudio and choosing Run > Run all (or code chunk by code chunk) or you can render as a report using knit. R Markdown supports a whole range of file formats for these reports (including html and pdf).

Where is it located?

The mapping script is called dwc_mapping.Rmd and located in the src (= source) directory. It currently contains functional code (and documentation) to map the example source data to a taxon core and distribution extension. You can try it yourself by simply running the code.

Sections

The mapping script is divided in sections, which are described below (and if and how you can adapt these):

Setup

This section includes the general settings in R, such as loading the packages required for the mapping process. We advise you to leave this part untouched as it sets the scene for the code to work.

Read source data

In this section, you import the source checklist as the dataframe input_data. If you want to import another Source data file, you will have to adapt this section. The next step is to inspect whether the source data has been imported correctly. The function head() returns the first lines of input_data, which allows you to quickly screen the content of the dataset. Alternatively, you can use View(input_data) in your R Studio console to see it as a formatted table.

Data preparation

Before you can start the mapping process, a few more preparatory steps are needed. Some of these steps are required, while other steps are optional and are only suggested here to improve the quality of your data. These steps include:

  1. Basic cleaning of input_data (optional) (see Tidy data)
  2. Basic cleaning of scientific names (optional) (see Scientific names)
  3. Addition of taxon ranks (required if you don't have these in your source data) (see Taxon ranks)
  4. Addition of taxon IDs (required if you don't have these in your source data) (see Taxon IDs)

These steps are explained in the Data preparation section of this wiki.

Mapping

The ultimate goal of the mapping script is to transform your source checklist to a Darwin Core Archive. This is what we call "mapping". For both the taxon core and any of the extensions you use, you will have to choose which Darwin Core terms to include and populate, depending on the scope and the content of your checklist. Note however, that some terms in the taxon core are required (see GBIF data quality requirements). We distinguish three types of mapping:

  • Static: the value of this Darwin Core field is independent of the record in the checklist, i.e. the content of this field remains unchanged over the whole checklist. This mostly concerns metadata fields in the taxon core
  • Unaltered (relative to the input data): the value of this Darwin Core field is an exact copy of the corresponding field in the input data
  • Altered (relative to the input data): the value of this Darwin Core field is based on one or more fields in the input data. Altered values are used for Darwin Core terms for which the content in input_data is used as a basis, but it needs to be standardized or corrected.

For each type of mapping, we provide examples in the section Tidyverse functions, where you will also find more information on a number of functions you can use to make your mapping easy and readable.

Whether it is a taxon core or an extension, mapping is sequential:

  1. You start a core or extension by copying the input_data to a new data frame with the name of the extension, e.g. taxon or distribution. In some cases, further processing is required to structure the data correctly, e.g. ignoring multiple distribution rows for a taxon core or transforming columns to rows for a description extension. Tidyverse functions like distinct(), union(), gather() and spread() exist to help you with this. See also example checklists.
  2. You then add Darwin Core terms as columns to the data frame, using the mutate() function. That way, your data frame contains the original data and the newly added Darwin Core term, allowing to compare between those.
  3. When all Darwin Core terms have been added to the data frame, you only keep those and drop their dwc_ prefix. You write this data frame to a .csv file, which can then be uploaded to an IPT for publication.
Clone this wiki locally