Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Overview of the inverseRegex Package" | |
| author: "Jasper Watson" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{overview} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\VignetteEncoding{UTF-8} | |
| --- | |
| ```{r setup, include = FALSE} | |
| knitr::opts_chunk$set( | |
| collapse = TRUE, | |
| comment = "#>" | |
| ) | |
| ``` | |
| ```{r, echo = FALSE} | |
| library(inverseRegex) | |
| ``` | |
| # inverseRegex | |
| The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase "Hello World!" can be translated to `"[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"`. | |
| This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a "nnnn-nn-nn" format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example). | |
| ## Usage | |
| The main function to use is `inverseRegex(x)` which will identify the different characters that make up the input object `x`. The different groups that can be identified are | |
| - `'[[:digit:]]'` | |
| - `'[[:lower:]]'` | |
| - `'[[:upper:]]'` | |
| - `'[[:alpha:]]'` | |
| - `'[[:alnum:]]'` | |
| - `'[[:space:]]'` | |
| - `'[[:punct:]]'` | |
| See `?regex` for an explanation of their meanings. | |
| By default the only groups that will be identified are `[[:digit:]]`, `[[:upper:]]`, and `[[:lower:]]`, with any other characters being left as is. This can altered with the following arguments: | |
| - `combineCases`: Use `'[[:alpha:]]'` instead of `'[[:lower:]]'` and `'[[:upper:]]'`. | |
| - `combineAlphanumeric`: Use '`[[:alnum:]]`' instead of '`[[:digit:]]`', '`[[:lower:]]`', '`[[:upper:]]`', and '`[[:alpha:]]`'. | |
| - `combinePunctuation`: Use '`[[:punct:]]`' instead of leaving punctuation characters as is. | |
| - `combineSpace`: Use '`[[:space:]]`' instead of leaving space characters as is. | |
| Some examples of these arguments are below: | |
| ```{r, echo = TRUE} | |
| inverseRegex('1aA') | |
| inverseRegex('1aA', combineCases = TRUE) | |
| inverseRegex('1aA', combineAlphanumeric = TRUE) | |
| inverseRegex('Hello World!') | |
| inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE) | |
| ``` | |
| Users can also specify the different run lengths that will be identified. The `inverseRegex` function has an argument called `numbersToKeep` which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is `c(2, 3, 4, 5, 10)`. Run lengths not requested will be identified with a `+`. | |
| ```{r, echo = TRUE} | |
| inverseRegex('abcd1234567') | |
| inverseRegex('abcd1234567', numbersToKeep = NULL) | |
| inverseRegex('abcd1234567', numbersToKeep = 1:10) | |
| ``` | |
| ## Non-character Inputs | |
| Many objects with a class other than `character` are supported, including `logical`, `integer`, `numeric`, `Date`, `POSIXct`, `factor`, `matrix`, `data.frame`, and `tibble`. They are all (except `logical`) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See `?inverseRegex` for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling `inverseRegex`. | |
| Special mention of numerics and data frames will be given here: | |
| ### Inputs of Class `numeric` | |
| An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector `inverseRegex` will convert it to character using: `vapply(x, format, character(1), nsmall = 1)`. This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example: | |
| ```{r, echo = TRUE} | |
| vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1) | |
| inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10) | |
| ## Vectors of class integer are just converted using as.character. | |
| inverseRegex(1L) | |
| ``` | |
| Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using `trimws(format(x))`. For data frames and tibbles each column of type numeric will be converted using `trimws(format(x))`. This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places. | |
| ```{r, echo = TRUE} | |
| inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123)) | |
| inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123))) | |
| ``` | |
| ### Inputs of Class `data.frame` | |
| When giving a data frame `inverseRegex` will return a data frame of similar dimensions with each column representing an individual call to inverseRegex. | |
| ```{r, echo = TRUE} | |
| unique(inverseRegex(iris, numbersToKeep = 2:10)) | |
| ``` | |
| ## Identifying Rare Patterns | |
| One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function `occurrencesLessThan` which will call `inverseRegex` and return logical values with `TRUE` giving the location of any regex patterns that occur less than a certain number of times. | |
| ```{r, echo = TRUE} | |
| occurrencesLessThan(c(LETTERS, 1)) | |
| ## When called on a data frame occurrencesLessThan will assess each column individually. | |
| x <- iris | |
| x$Species <- as.character(x$Species) | |
| x[27, 'Species'] <- 'set0sa' | |
| unique(occurrencesLessThan(x)) | |
| ``` | |
| What constitutes a "rare" pattern can be specified with the `fraction` or `n` arguments. See `?occurrencesLessThan` for a full description. |