Fixing of ragged fixed width format files and reading a subset of columns #353

ghaarsma · 2016-01-19T00:01:07Z

This pull request should fix items 300 and 326. The current fwf format is kind of broken if you want to read a subset of columns.

Ragged fwf (where the last column width if variable) do exists, but they should be a minority. This implementation assumes ragged fwf only when the last end position is NA, Inf or omitted.

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end[1:2],col_names)
df3 <- read_fwf(x,col_positions = col_positions);df3

# Read Ragged, alternate way
col_positions <- fwf_positions(start,end=c(2,5,Inf),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5

jschoeley · 2016-01-25T16:04:46Z

Good solution! Gives the user the power to explicitly specify ragged data via the column position parameter. Serves the stated purpose of fwf_positions -- "to read in only selected fields" -- much better than the current implementation.

Also the single failing check is a positive sign as it checks for the problematic fwf_positions behavior that @ghaarsma fixes in this pull request.

hadley · 2016-06-02T10:24:43Z

R/read_fwf.R

-#'   use \code{fwf_positions}. The width of the last column will be silently
-#'   extended to the next line break.
+#'   use \code{fwf_positions}. If the width of the last column is variable (a
+#'   ragged fwf file), supply the last end position as NA, Inf or simply ommit it.


I think you only need to use a single sentinel value here, and I'd recommend sticking with Inf.

Actually NA would be easier since because Inf is only available in doubles, not integers

hadley · 2016-06-02T10:28:21Z

I think this is a reasonable approach, although it still needs quite a bit of work. @ghaarsma are you interested in continuing to work on it?

@jschoeley A failing test is not a good sign for a PR - tests need to be updated too.

hadley · 2016-06-02T10:28:39Z

src/TokenizerFwf.cpp

@@ -164,6 +171,14 @@ Token TokenizerFwf::nextToken() {
    row_++;
    col_ = 0;

+    // Proceed to the end of the line. This is needed in case the last column
+    // in the file is not being read.
+    while(fieldEnd != end_ && *fieldEnd != '\r' && *fieldEnd != '\n') {


Are you sure this won't end up accidentally skipping blank lines?

Yes there is room for improvement.

You don't have to proceed to the end of line if you are short. tooShort = true (you are already there).

You don't have to proceed to the end of line if format is ragged (reading the ragged column will do this).

However if not short and not ragged then you will not be at the EOL if the user does not wish to read the final column. In that case you have to proceed to EOL and you don't know the width, therefore the while loop.

ghaarsma · 2016-06-03T23:20:30Z

Hadley,
I made a few modifications based on your comments and did a push to https://github.com/ghaarsma/readr. Not sure about the exact work process, so let me know if you need anything else. Code is a little cleaner now. Thanks for the feedback.

The (only) way to indicate a Ragged fixed width format is to have the last position of the end vector as NA. I have checked it with empty lines inside the file (it pushes in a row of NA's), which I assume is the intended behavior.

Some simple tests

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end=c(2,5,NA),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5

hadley · 2016-06-04T19:44:23Z

R/read_fwf.R

@@ -76,7 +77,9 @@ fwf_widths <- function(widths, col_names = NULL) {
 #' @rdname read_fwf
 #' @export
 #' @param start,end Starting and ending (inclusive) positions of each field.
+#'    Use NA or Inf as last end field when reading a ragged fwf file.


I think you missed the Inf here

hadley · 2016-06-04T19:45:44Z

This is looking good. Can you please add those examples as unit tests, and add a bullet point to NEWS? (it should include the original issue number and your github user name)

ghaarsma · 2016-06-06T21:21:28Z

Added new unit tests and a bullet point to the NEWS.md. Let me know if this is all correct. Some of the work process is a first time for me.

hadley · 2016-06-06T21:52:48Z

This looks great!

There's one last thing for you to attempt - can you please rebase/merge to bring your branch up to date with the other changes? If you get stuck, don't worry too much, as I can do it by hand, but it's a good learning experience for future PRs if you want to give it a shot.

You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.

You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted. Updated documentation inf read_fwf to reflect changes.

…Hadley The only way to assume a fwf ragged file is for the last end position to be NA.

ghaarsma · 2016-06-07T00:02:43Z

Hadley,

I tried to follow: https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request. Had some minor issues, but I think I got it. Please check when merging the pull request.

codecov-io · 2016-06-07T00:03:20Z

Current coverage is 70.00%

Merging #353 into master will increase coverage by <.01%

@@             master       #353   diff @@
==========================================
  Files            56         56          
  Lines          2803       2807     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1961       1965     +4   
  Misses          842        842          
  Partials          0          0

Powered by Codecov. Last updated by e41bc7e...f5b91b9

hadley · 2016-06-07T01:06:27Z

Perfect - thanks!

hadley reviewed Jun 2, 2016
View reviewed changes

hadley added the in progress label Jun 2, 2016

hadley reviewed Jun 4, 2016
View reviewed changes

ghaarsma added 5 commits June 6, 2016 18:24

Fixed implementation of reading Ragged fwf (fixed width format) files.

b013814

You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.

Changes made to the reading of fixed width files are discussion with …

8af8bc5

…Hadley The only way to assume a fwf ragged file is for the last end position to be NA.

Changes made to the reading of fixed width files are discussion with …

7d92635

…Hadley The only way to assume a fwf ragged file is for the last end position to be NA.

Updated and appended read_fwf tests and NEWS.md

f5b91b9

hadley merged commit 8b253b2 into tidyverse:master Jun 7, 2016

hadley removed the in progress label Jun 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing of ragged fixed width format files and reading a subset of columns #353

Fixing of ragged fixed width format files and reading a subset of columns #353

ghaarsma commented Jan 19, 2016

jschoeley commented Jan 25, 2016

hadley Jun 2, 2016

hadley Jun 2, 2016

hadley commented Jun 2, 2016

hadley Jun 2, 2016

ghaarsma Jun 3, 2016

ghaarsma commented Jun 3, 2016

hadley Jun 4, 2016

hadley commented Jun 4, 2016

ghaarsma commented Jun 6, 2016

hadley commented Jun 6, 2016

ghaarsma commented Jun 7, 2016

codecov-io commented Jun 7, 2016

hadley commented Jun 7, 2016

Fixing of ragged fixed width format files and reading a subset of columns #353

Fixing of ragged fixed width format files and reading a subset of columns #353

Conversation

ghaarsma commented Jan 19, 2016

jschoeley commented Jan 25, 2016

hadley Jun 2, 2016

Choose a reason for hiding this comment

hadley Jun 2, 2016

Choose a reason for hiding this comment

hadley commented Jun 2, 2016

hadley Jun 2, 2016

Choose a reason for hiding this comment

ghaarsma Jun 3, 2016

Choose a reason for hiding this comment

ghaarsma commented Jun 3, 2016

hadley Jun 4, 2016

Choose a reason for hiding this comment

hadley commented Jun 4, 2016

ghaarsma commented Jun 6, 2016

hadley commented Jun 6, 2016

ghaarsma commented Jun 7, 2016

codecov-io commented Jun 7, 2016

Current coverage is 70.00%

hadley commented Jun 7, 2016