Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
120 lines (83 sloc) 5.69 KB
---
title: "Overview of the inverseRegex Package"
author: "Jasper Watson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{overview}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r, echo = FALSE}
library(inverseRegex)
```
# inverseRegex
The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase "Hello World!" can be translated to `"[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"`.
This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a "nnnn-nn-nn" format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).
## Usage
The main function to use is `inverseRegex(x)` which will identify the different characters that make up the input object `x`. The different groups that can be identified are
- `'[[:digit:]]'`
- `'[[:lower:]]'`
- `'[[:upper:]]'`
- `'[[:alpha:]]'`
- `'[[:alnum:]]'`
- `'[[:space:]]'`
- `'[[:punct:]]'`
See `?regex` for an explanation of their meanings.
By default the only groups that will be identified are `[[:digit:]]`, `[[:upper:]]`, and `[[:lower:]]`, with any other characters being left as is. This can altered with the following arguments:
- `combineCases`: Use `'[[:alpha:]]'` instead of `'[[:lower:]]'` and `'[[:upper:]]'`.
- `combineAlphanumeric`: Use '`[[:alnum:]]`' instead of '`[[:digit:]]`', '`[[:lower:]]`', '`[[:upper:]]`', and '`[[:alpha:]]`'.
- `combinePunctuation`: Use '`[[:punct:]]`' instead of leaving punctuation characters as is.
- `combineSpace`: Use '`[[:space:]]`' instead of leaving space characters as is.
Some examples of these arguments are below:
```{r, echo = TRUE}
inverseRegex('1aA')
inverseRegex('1aA', combineCases = TRUE)
inverseRegex('1aA', combineAlphanumeric = TRUE)
inverseRegex('Hello World!')
inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE)
```
Users can also specify the different run lengths that will be identified. The `inverseRegex` function has an argument called `numbersToKeep` which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is `c(2, 3, 4, 5, 10)`. Run lengths not requested will be identified with a `+`.
```{r, echo = TRUE}
inverseRegex('abcd1234567')
inverseRegex('abcd1234567', numbersToKeep = NULL)
inverseRegex('abcd1234567', numbersToKeep = 1:10)
```
## Non-character Inputs
Many objects with a class other than `character` are supported, including `logical`, `integer`, `numeric`, `Date`, `POSIXct`, `factor`, `matrix`, `data.frame`, and `tibble`. They are all (except `logical`) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See `?inverseRegex` for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling `inverseRegex`.
Special mention of numerics and data frames will be given here:
### Inputs of Class `numeric`
An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector `inverseRegex` will convert it to character using: `vapply(x, format, character(1), nsmall = 1)`. This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example:
```{r, echo = TRUE}
vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1)
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10)
## Vectors of class integer are just converted using as.character.
inverseRegex(1L)
```
Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using `trimws(format(x))`. For data frames and tibbles each column of type numeric will be converted using `trimws(format(x))`. This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places.
```{r, echo = TRUE}
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123))
inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123)))
```
### Inputs of Class `data.frame`
When giving a data frame `inverseRegex` will return a data frame of similar dimensions with each column representing an individual call to inverseRegex.
```{r, echo = TRUE}
unique(inverseRegex(iris, numbersToKeep = 2:10))
```
## Identifying Rare Patterns
One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function `occurrencesLessThan` which will call `inverseRegex` and return logical values with `TRUE` giving the location of any regex patterns that occur less than a certain number of times.
```{r, echo = TRUE}
occurrencesLessThan(c(LETTERS, 1))
## When called on a data frame occurrencesLessThan will assess each column individually.
x <- iris
x$Species <- as.character(x$Species)
x[27, 'Species'] <- 'set0sa'
unique(occurrencesLessThan(x))
```
What constitutes a "rare" pattern can be specified with the `fraction` or `n` arguments. See `?occurrencesLessThan` for a full description.
You can’t perform that action at this time.