Paper
Title: Regular expressions and reshaping using data tables and the nc package
Abstract: Regular expressions are powerful tools for extracting tables from non-tabular text data. Capturing regular expressions that describe information to extract from column names can be especially useful when reshaping a data table from wide (one row with many columns) to tall (one column with many rows). We present the R package nc (short for named capture), which provides functions for data reshaping, regular expressions, and a uniform interface to three C libraries (PCRE, RE2, ICU). We describe the main features of nc, then provide detailed comparisons with related R packages (stats, utils, data.table, tidyr, reshape2, cdata).
- Output RJwrapper.pdf
- Main input/source file to edit is hocking.Rnw
- Makefile takes care of creating submission.zip
11 Oct 2020
figure-who-cols-new-data.R runs new timings and figure-who-cols-new.R makes new figure:
5 Oct 2020
figure-who-rows-dt-data.R and figure-iris-rows-dt-data.R compute timings, figure-who-rows-dt.R plots
figure-who-cols-dt-data.R computes timings, figure-who-cols-dt.R plots
figure-iris-cols-dt-valgrind.R run under valgrind, no memory problems.
figure-iris-cols-dt-data.R computes timings of new data table methods, figure-iris-cols-dt.R makes
17 May 2020
maybe add comparison with tidyfast::dt_pivot_longer?
29 Oct 2019
figure-iris-cols-new.R makes a new figure based on timings computed using updated R packages.
28 Oct 2019
figure-iris-cols.R makes a figure, based on data computed by figure-iris-cols-data.R, which shows that wide-to-tall data reshaping using either data.table or nc packages is much faster than other packages (cdata, stats, tidyr). This experiment uses inputs with a fixed number of rows, and a variable number of input reshape columns. Each function in the experiment outputs a table with multiple (2) reshape columns. It shows that the quadratic time complexity of cdata, stats, tidyr results in significant slowdowns when there are at least 10,000 input reshape columns.
In contrast everything below appears to be linear in the number of input columns when the output has only a single reshape column:
Note that stats::reshape is missing in the second plot here, but the result for a smaller N.col size can be seen here https://github.com/tdhock/nc-article/blob/master/figure-who-cols.png
25 Oct 2019
figure-who-both-rows.R makes
24 Oct 2019
figure-who-complex-rows.R makes
23 Oct 2019
figure-who-rows.R makes
figure-who-cols.R makes










