Title: Wide-to-tall data reshaping using regular expressions and the nc package.
Abstract: Regular expressions are powerful tools for extracting tables from non-tabular text data. Capturing regular expressions that describe information to extract from column names can be especially useful when reshaping a data table from wide (few rows with many regularly named columns) to tall (fewer columns with more rows). We present the R package nc (short for named capture), which provides functions for wide-to-tall data reshaping using regular expressions. We describe the main new ideas of nc, and provide detailed comparisons with related R packages (stats, utils, data.table, tidyr, tidyfast, tidyfst, reshape2, cdata).
- Local output RJwrapper.pdf
- Main input/source file to edit is hocking.Rnw
- Makefile takes care of creating submission.zip
compare with tidyfst::longer_dt? should be same as data.table::melt. https://hope-data-science.github.io/tidyfst/articles/example3_reshape.html
figures-iris-dt contains figures to explain melt, for LatinR data.table tutorial.
figure-who-cols-new-data.R runs new timings and figure-who-cols-new.R makes new figure:
figure-who-rows-dt-data.R and figure-iris-rows-dt-data.R compute timings, figure-who-rows-dt.R plots
figure-who-cols-dt-data.R computes timings, figure-who-cols-dt.R plots
figure-iris-cols-dt-valgrind.R run under valgrind, no memory problems.
figure-iris-cols-dt-data.R computes timings of new data table methods, figure-iris-cols-dt.R makes
maybe add comparison with tidyfast::dt_pivot_longer?
figure-iris-cols-new.R makes a new figure based on timings computed using updated R packages.
figure-iris-cols.R makes a figure, based on data computed by figure-iris-cols-data.R, which shows that wide-to-tall data reshaping using either data.table or nc packages is much faster than other packages (cdata, stats, tidyr). This experiment uses inputs with a fixed number of rows, and a variable number of input reshape columns. Each function in the experiment outputs a table with multiple (2) reshape columns. It shows that the quadratic time complexity of cdata, stats, tidyr results in significant slowdowns when there are at least 10,000 input reshape columns.
In contrast everything below appears to be linear in the number of input columns when the output has only a single reshape column:
Note that stats::reshape is missing in the second plot here, but the result for a smaller N.col size can be seen here https://github.com/tdhock/nc-article/blob/master/figure-who-cols.png
figure-who-both-rows.R makes
figure-who-complex-rows.R makes
figure-who-rows.R makes
figure-who-cols.R makes