/
validation_workflows.Rmd
29 lines (20 loc) · 4.64 KB
/
validation_workflows.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
title: "Overview of Validation Workflows"
output: html_document
---
```{r options, message=FALSE, warning=FALSE, include=FALSE}
library(pointblank)
```
There are six validation workflows in **pointblank**:
1. [**VALID-I: Data Quality Reporting**](../articles/VALID-I.html)
2. [**VALID-II: Pipeline Data Validation**](../articles/VALID-II.html)
3. [**VALID-III: Expectations in Unit Tests**](../articles/VALID-III.html)
4. [**VALID-IV: Data Tests for Conditionals**](../articles/VALID-IV.html)
5. [**VALID-V: Table Scan**](../articles/VALID-V.html)
6. [**VALID-VI: R Markdown Document Validation**](../articles/VALID-VI.html)
The first workflow ([**VALID-I**](../articles/VALID-I.html)) is used for comprehensive reporting of the data quality of a target table. This typically uses as many validation functions as the user wishes to write to get an adequate level of validation coverage for that table. This is an *agent*-based workflow that uses: (1) the `create_agent()` function, (2) one or more validation functions, and (3) the `interrogate()` function. The *agent* generated by `create_agent()` is given a target table and it accepts validation functions (e.g., `col_vals_gt()`, `col_is_numeric()`, `rows_distinct()`, etc.), building up a validation plan. It's not until `interrogate()` is called that the validations are evaluated and subsequent intel is stored. An *agent* object, both pre- and post-interrogation can be printed, yielding a *Validation Report*. A separate set of functions (e.g., the *Post-interrogation* functions that include `get_data_extracts()`, `get_sundered_data()`, and more) can be called on the *agent* to collect the intel or variations/extracts of the target table.
The second workflow ([**VALID-II**](../articles/VALID-II.html)) is meant for repeated data-quality checks in a data-transformation pipeline that involves tabular data. The principal mode of operation there is to use validation functions to either warn the user of unforeseen data integrity problems or stop the pipeline dead so that dependent, downstream processes (that would use the data to some extent) are never initiated. Both the **Data Quality Reporting** and the **Pipeline Data Validation** workflows use a common set of validation functions, but latter doesn't use an *agent*, and the validations eagerly interrogate the data at each invocation.
Data can be tested like function output is tested by using the **Expectations in Unit Tests** workflow ([**VALID-III**](../articles/VALID-III.html)). This uses a suite of `expect_*()` functions that are analogous to the validation functions but with simplified interfaces. These functions are used directly on data (no *agent*) and they serve as tests in the **testthat** testing framework. Unit testing on data is important if your package functions produce or transform data and the **testthat**-compatible functions offered by **pointblank** make testing data a little bit easier and a lot more precise.
Evaluation of data can produce logical output (`TRUE`/`FALSE`) through use of the **Data Tests for Conditionals** workflow ([**VALID-IV**](../articles/VALID-IV.html)). This uses the analogous suite of `test_*()` functions. This workflow is suitable in programming contexts where the result of data validation might be the alteration of a code path given that a logical value is always returned with these. Like the [**VALID-III**](../articles/VALID-III.html) workflow, the function's signature is simplified. In fact, the arguments for the complementary `expect_*()` and `test_*()` functions are the same.
A target table can be scanned and described with the **Table Scan** workflow ([**VALID-V**](../articles/VALID-V.html)). This is useful for getting table dimensions, important statistical values by column, interactions by column, and a view of missingness in easy-to-parse HTML output. Components of this HTML report can be reordered or omitted as needed.
The **R Markdown Document Validation** workflow ([**VALID-VI**](../articles/VALID-VI.html)) can contain a combination of workflow elements. The ideal workflow to use here is that from [**VALID-II**](../articles/VALID-II.html) (**Pipeline Data Validation**) since that in combination with chunks having the option `validation = TRUE` set results in a special display of validation results in a rendered HTML document. The [**VALID-I**](../articles/VALID-I.html) workflow can also be used since the agent report prints nicely as an HTML table. The `table_scan()` function of the [**VALID-V**](../articles/VALID-V.html) workflow likewise produces useful output. Finally, the `test_*()` functions of the [**VALID-IV**]((../articles/VALID-IV.html)) workflow can be used should logical values be needed within the code.