Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing of ragged fixed width format files and reading a subset of columns #353

Merged
merged 5 commits into from
Jun 7, 2016
Merged

Conversation

ghaarsma
Copy link
Contributor

This pull request should fix items 300 and 326. The current fwf format is kind of broken if you want to read a subset of columns.

Ragged fwf (where the last column width if variable) do exists, but they should be a minority. This implementation assumes ragged fwf only when the last end position is NA, Inf or omitted.

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end[1:2],col_names)
df3 <- read_fwf(x,col_positions = col_positions);df3

# Read Ragged, alternate way
col_positions <- fwf_positions(start,end=c(2,5,Inf),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5

@jschoeley
Copy link

Good solution! Gives the user the power to explicitly specify ragged data via the column position parameter. Serves the stated purpose of fwf_positions -- "to read in only selected fields" -- much better than the current implementation.

Also the single failing check is a positive sign as it checks for the problematic fwf_positions behavior that @ghaarsma fixes in this pull request.

#' use \code{fwf_positions}. The width of the last column will be silently
#' extended to the next line break.
#' use \code{fwf_positions}. If the width of the last column is variable (a
#' ragged fwf file), supply the last end position as NA, Inf or simply ommit it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need to use a single sentinel value here, and I'd recommend sticking with Inf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually NA would be easier since because Inf is only available in doubles, not integers

@hadley
Copy link
Member

hadley commented Jun 2, 2016

I think this is a reasonable approach, although it still needs quite a bit of work. @ghaarsma are you interested in continuing to work on it?

@jschoeley A failing test is not a good sign for a PR - tests need to be updated too.

@@ -164,6 +171,14 @@ Token TokenizerFwf::nextToken() {
row_++;
col_ = 0;

// Proceed to the end of the line. This is needed in case the last column
// in the file is not being read.
while(fieldEnd != end_ && *fieldEnd != '\r' && *fieldEnd != '\n') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this won't end up accidentally skipping blank lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is room for improvement.

  1. You don't have to proceed to the end of line if you are short. tooShort = true (you are already there).
  2. You don't have to proceed to the end of line if format is ragged (reading the ragged column will do this).
  3. However if not short and not ragged then you will not be at the EOL if the user does not wish to read the final column. In that case you have to proceed to EOL and you don't know the width, therefore the while loop.

@ghaarsma
Copy link
Contributor Author

ghaarsma commented Jun 3, 2016

Hadley,
I made a few modifications based on your comments and did a push to https://github.com/ghaarsma/readr. Not sure about the exact work process, so let me know if you need anything else. Code is a little cleaner now. Thanks for the feedback.

The (only) way to indicate a Ragged fixed width format is to have the last position of the end vector as NA. I have checked it with empty lines inside the file (it pushes in a row of NA's), which I assume is the intended behavior.

Some simple tests

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end=c(2,5,NA),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5

@@ -76,7 +77,9 @@ fwf_widths <- function(widths, col_names = NULL) {
#' @rdname read_fwf
#' @export
#' @param start,end Starting and ending (inclusive) positions of each field.
#' Use NA or Inf as last end field when reading a ragged fwf file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed the Inf here

@hadley
Copy link
Member

hadley commented Jun 4, 2016

This is looking good. Can you please add those examples as unit tests, and add a bullet point to NEWS? (it should include the original issue number and your github user name)

@ghaarsma
Copy link
Contributor Author

ghaarsma commented Jun 6, 2016

Added new unit tests and a bullet point to the NEWS.md. Let me know if this is all correct. Some of the work process is a first time for me.

@hadley
Copy link
Member

hadley commented Jun 6, 2016

This looks great!

There's one last thing for you to attempt - can you please rebase/merge to bring your branch up to date with the other changes? If you get stuck, don't worry too much, as I can do it by hand, but it's a good learning experience for future PRs if you want to give it a shot.

You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.
You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.

Updated documentation inf read_fwf to reflect changes.
…Hadley

The only way to assume a fwf ragged file is for the last end position to be NA.
…Hadley

The only way to assume a fwf ragged file is for the last end position to be NA.
@ghaarsma
Copy link
Contributor Author

ghaarsma commented Jun 7, 2016

Hadley,

I tried to follow: https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request. Had some minor issues, but I think I got it. Please check when merging the pull request.

@codecov-io
Copy link

Current coverage is 70.00%

Merging #353 into master will increase coverage by <.01%

@@             master       #353   diff @@
==========================================
  Files            56         56          
  Lines          2803       2807     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1961       1965     +4   
  Misses          842        842          
  Partials          0          0          

Powered by Codecov. Last updated by e41bc7e...f5b91b9

@hadley hadley merged commit 8b253b2 into tidyverse:master Jun 7, 2016
@hadley hadley removed the in progress label Jun 7, 2016
@hadley
Copy link
Member

hadley commented Jun 7, 2016

Perfect - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants