Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read into a 'melted' structure #760

Merged
merged 33 commits into from Nov 16, 2018
Merged

Read into a 'melted' structure #760

merged 33 commits into from Nov 16, 2018

Conversation

@nacnudus
Copy link
Contributor

@nacnudus nacnudus commented Dec 7, 2017

Referencing #724

readr::tokenize_melt("1,as\n3.3,2017-01-03,5\n\n6")
#> # A tibble: 6 x 4
#>     row   col      value data_type
#>   <int> <int>      <chr>     <chr>
#> 1     0     0          1   integer
#> 2     0     1         as character
#> 3     1     0        3.3    double
#> 4     1     1 2017-01-03      date
#> 5     1     2          5   integer
#> 6     2     0          6   integer

This form is useful when the source is ragged and/or has mixed data types within columns. Once it has been read into R in this form, it can be traversed and transformed into something tidier.

The readr purpose is 'Read Rectangular Text Data', but since the implementation would be a thin wrapper around existing C++ code, it would seem to belong here. Otherwise would it be possible to use the readr headers from another package?

Example csv from the New Zealand 2017 election data. More examples can be found in this paper.

motivation

bool isInteger(const std::string& x, LocaleInfo* pLocale) {
// Leading zero
if (x[0] == '0')
return false;
Copy link
Contributor Author

@nacnudus nacnudus Dec 7, 2017

This is a mistake. I don't think leading-zero test is necessary at all.

src/parse.cpp Outdated
}

row[i] = t.row();
col[i] = t.col();
Copy link
Contributor Author

@nacnudus nacnudus Dec 7, 2017

Off-by-one errors. Add 1 to t.row() and t.col().

@jimhester
Copy link
Member

@jimhester jimhester commented Dec 12, 2017

Thanks for opening the PR! I cleaned up some things and I think the interface makes more sense as a series of melt_*() functions analogous to the read_*() and spec_*() functions. So far I have implemented melt_delim().

If you want to implement the remaining functions and add some tests this could make it into the next release (which is imminent). Otherwise I will do so at a later date, I generally want this release to focus only on bug fixes and save new features for the next one.

@nacnudus
Copy link
Contributor Author

@nacnudus nacnudus commented Dec 18, 2017

Thanks for reviewing and tidying up my code! Making a series of melt_*() functions is a good idea, but introduces complications that I won't be able to address before CRAN shuts down over Christmas and New Year, so I'll aim for the following release.

I'll keep pushing to this branch and update this comment with progress so far.

  • Is there an integer alternative to size_t that supports long vectors but means the row and col columns are printed nicely?
  • Use the Reader.h class to benefit from the progress indicator, chunking, and handling of text encoding.
  • Change column order to row,col,data_type,value, which is easier to read when value contains a long string.
  • When values are missing or empty, say so in the data_type column.
  • Decide how to handle comments in the cases a#comment vs a,#comment. Should the second version return a missing or empty value, or none at all? Consider that comments at the beginnings of lines should probably cause the line to be ignored altogether, include in row numbers.
  • Decide whether a blank first line should be returned as a missing or empty value in column 1, and be counted in the row numbers.
  • Produce a zero-row data frame rather than an empty one when n_max = 0.
  • Produce a zero-row data frame rather than an empty one when the file to be read is empty.

nacnudus added 7 commits Dec 18, 2017
Some of these fail and are being discussed in PR tidyverse#760
This is to start again based on src/read.cpp instead of src/parse.cpp
The `read_*` functions skip empty rows altogether, and still will by default
with this commit.  The new `melt_*` functions in the next commit will *not* skip
empty rows by default, because they can be meaningful when the data is less
regular, e.g. more than one table per file, separated by blank lines.
This was removed from the `read_*()` functions because it would truncate
floating-point values if they weren't encountered until after guess_max.  But
the `melt_*()` functions won't have the same problem.
@nacnudus
Copy link
Contributor Author

@nacnudus nacnudus commented Dec 30, 2017

There are now melt_*() variants of all relevant read_*() functions.

melt_delim()
melt_csv()
melt_csv2()
melt_tsv()

melt_delim_chunked()
melt_csv_chunked()
melt_csv2_chunked()
melt_tsv_chunked()

melt_fwf()
melt_table()
melt_table2()

The read_*() functions that don't have a melt_*() variant (because it wouldn't make sense) are these.

read_file()
read_file_raw()
read_lines()
read_lines_chunked()
read_lines_raw()
read_log()
read_rds()

Reverted commits

The melt_*() functions are brought into line with the read_*() ones (as suggested by @jimhester) by basing them on src/read.cpp rather than src/parse.cpp. This means they now handle locales, encoding and embedded nulls, have *_chunked() variants and can show progress bars. The previous commits in this PR are reverted by 9af36d to make it easier to review the latest ones, which start again from scratch.

Implementation

The implementation is almost the same as the read_*() functions, the main difference being that column names are not required, and column types are not inferred.

Where read_*() has R/read_delim.R, R/read_fwf.R, etc., melt_*() has R/melt_delim.R, R/melt_fwf.R, etc. R/tokenizer.R is untouched (besides the skip_empty_rows argument, see below).

At the C++ level, where read_*() has read_tokens_() and read_tokens_chunked_() in src/read.cpp, melt_*() has melt_tokens_() and melt_tokens_chunked_().

The engine is Reader::melt(), called by Reader::meltToDataFrame(). These correspond to Reader::read() and Reader::readToDataFrame().

melt_*_chunked() variants are generated by generate_melt_chunked_fun() and generate_melt_delimited_chunked(), similar to generate_chunked_fun() and generate_read_delimited_chunked().

Integers

read_*() doesn't infer integer types, because it caused problems if the first doubles weren't encountered until after guess_max rows. That isn't a problem for melt_*(), so integer types are inferred here.

Row and column numbers are returned in doubles rather than integers to support long vectors. It looks odd when printed in a tibble, so I'm open to suggestions -- maybe attempt conversion to integer if possible?

skip_empty_rows argument

One change affects existing read_*() and tokenize_*() functions by adding an argument skip_empty_rows = TRUE. This is backwards-compatible. All tests run unaltered. It exists so that the melt_*() functions can have the opposite default, skip_empty_rows = FALSE, because

  • empty rows are often meaningful to untidy data
  • skipping rows messes around with the row numbers, which are explicit in the output of melt_*()
  • there shouldn't be a distinction between empty rows and empty columns, given untidy data

It would be possible to omit this argument and hide the functionality elsewhere but it is easier to document the difference between melt_*() and read_*() when it is exposed. Users who notice that melt_*() doesn't skip a blank row, but read_*() does (silently), will hopefully find an explanation in the docs.

skip_empty_rows is fed into tokenize_*(), which constructs a list of such settings to pass into the C++ code as an Rcpp::List -- so no change is needed at the C++ level to accommodate the extra setting, except for src/Tokenizer.cpp to hand it off to src/TokenizerDelim.*, src/TokenizerFwf.* and src/TokenizerWs.*, which store it in a member variable. There it is handled by one-liners that avoid skipping and suppress warnings about incomplete lines/fields.

If the argument were ommitted from exported interfaces, then I still think it should be passed to C++ in the spec argument, appending it to the list on the R side, similar to the way chunking settings are handled.

Tests

There is 100% test coverage on the new code, with 79.96% coverage overall. Corner cases are tested, such as trailing newlines, so any implementation details I've forgotten to mention here should be visible in the tests.

@jimhester jimhester merged commit d07eb69 into tidyverse:master Nov 16, 2018
0 of 2 checks passed
@jimhester
Copy link
Member

@jimhester jimhester commented Nov 16, 2018

Thank you so much for this work and I apologize it took so long to merge, I think this will be a very useful thing for people dealing with almost rectangular data and other less structured formats!

@nacnudus
Copy link
Contributor Author

@nacnudus nacnudus commented Nov 16, 2018

Woop! Made my day. Thank you for all the great work you do.

@nacnudus nacnudus deleted the melt branch Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants