Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.org

Paper

Title: Regular expressions and reshaping using data tables and the nc package

Abstract: Regular expressions are powerful tools for extracting tables from non-tabular text data. Capturing regular expressions that describe information to extract from column names can be especially useful when reshaping a data table from wide (one row with many columns) to tall (one column with many rows). We present the R package nc (short for named capture), which provides functions for data reshaping, regular expressions, and a uniform interface to three C libraries (PCRE, RE2, ICU). We describe the main features of nc, then provide detailed comparisons with related R packages (stats, utils, data.table, tidyr, reshape2, cdata).

11 Oct 2020

figure-who-cols-new-data.R runs new timings and figure-who-cols-new.R makes new figure:

figure-who-cols-new.png

5 Oct 2020

figure-who-rows-dt-data.R and figure-iris-rows-dt-data.R compute timings, figure-who-rows-dt.R plots

figure-who-rows-dt.png

figure-who-cols-dt-data.R computes timings, figure-who-cols-dt.R plots

figure-who-cols-dt.png

figure-iris-cols-dt-valgrind.R run under valgrind, no memory problems.

figure-iris-cols-dt-data.R computes timings of new data table methods, figure-iris-cols-dt.R makes

figure-iris-cols-dt.png

17 May 2020

maybe add comparison with tidyfast::dt_pivot_longer?

29 Oct 2019

figure-iris-cols-new.R makes a new figure based on timings computed using updated R packages.

figure-iris-cols-new.png

28 Oct 2019

figure-iris-cols.R makes a figure, based on data computed by figure-iris-cols-data.R, which shows that wide-to-tall data reshaping using either data.table or nc packages is much faster than other packages (cdata, stats, tidyr). This experiment uses inputs with a fixed number of rows, and a variable number of input reshape columns. Each function in the experiment outputs a table with multiple (2) reshape columns. It shows that the quadratic time complexity of cdata, stats, tidyr results in significant slowdowns when there are at least 10,000 input reshape columns.

figure-iris-cols.png

In contrast everything below appears to be linear in the number of input columns when the output has only a single reshape column:

figure-who-cols-minimal.png

source: figure, timings.

Note that stats::reshape is missing in the second plot here, but the result for a smaller N.col size can be seen here https://github.com/tdhock/nc-article/blob/master/figure-who-cols.png

25 Oct 2019

figure-who-both-rows.R makes

figure-who-both-rows.png

24 Oct 2019

figure-who-complex-rows.R makes

figure-who-complex-rows.png

23 Oct 2019

figure-who-rows.R makes

figure-who-rows.png

figure-who-cols.R makes

figure-who-cols.png

About

No description, website, or topics provided.

Resources

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.