New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing of ragged fixed width format files and reading a subset of columns #353

Merged
merged 5 commits into from Jun 7, 2016

Conversation

Projects
None yet
4 participants
@ghaarsma
Contributor

ghaarsma commented Jan 19, 2016

This pull request should fix items 300 and 326. The current fwf format is kind of broken if you want to read a subset of columns.

Ragged fwf (where the last column width if variable) do exists, but they should be a minority. This implementation assumes ragged fwf only when the last end position is NA, Inf or omitted.

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end[1:2],col_names)
df3 <- read_fwf(x,col_positions = col_positions);df3

# Read Ragged, alternate way
col_positions <- fwf_positions(start,end=c(2,5,Inf),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5
@jschoeley

This comment has been minimized.

jschoeley commented Jan 25, 2016

Good solution! Gives the user the power to explicitly specify ragged data via the column position parameter. Serves the stated purpose of fwf_positions -- "to read in only selected fields" -- much better than the current implementation.

Also the single failing check is a positive sign as it checks for the problematic fwf_positions behavior that @ghaarsma fixes in this pull request.

#' use \code{fwf_positions}. The width of the last column will be silently
#' extended to the next line break.
#' use \code{fwf_positions}. If the width of the last column is variable (a
#' ragged fwf file), supply the last end position as NA, Inf or simply ommit it.

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

I think you only need to use a single sentinel value here, and I'd recommend sticking with Inf.

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

Actually NA would be easier since because Inf is only available in doubles, not integers

fwf_positions <- function(start, end, col_names = NULL) {
stopifnot(length(start) == length(end))

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

I think you should leave this check

if (beginOffset_.size() != endOffset_.size())
// File is assumed to be ragged (last column can have variable width)
// when length of end vector is one shorter then begin vector.
isRagged_ = beginOffset_.size()==(endOffset_.size()+1L);

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

This will get simpler once you are checking for explicit NA

This comment has been minimized.

@ghaarsma

ghaarsma Jun 3, 2016

Contributor

I agree that we should only use one single sentinel value (NA).
I think the code will be indeed cleaner if we check for explicit NA inside the cpp code and leave the original length checks on start and end in the R and cpp code.

break;
if (lastCol && isRagged_) {
// Last column is ragged, so read until end of line (ignoring width)
if (lastCol) {

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

Won't this always be true?

This comment has been minimized.

@ghaarsma

ghaarsma Jun 3, 2016

Contributor

Yes It is. The if (lastCol) {} check should be removed.

@hadley

This comment has been minimized.

Member

hadley commented Jun 2, 2016

I think this is a reasonable approach, although it still needs quite a bit of work. @ghaarsma are you interested in continuing to work on it?

@jschoeley A failing test is not a good sign for a PR - tests need to be updated too.

@@ -164,6 +171,14 @@ Token TokenizerFwf::nextToken() {
row_++;
col_ = 0;

// Proceed to the end of the line. This is needed in case the last column
// in the file is not being read.
while(fieldEnd != end_ && *fieldEnd != '\r' && *fieldEnd != '\n') {

This comment has been minimized.

@hadley

hadley Jun 2, 2016

Member

Are you sure this won't end up accidentally skipping blank lines?

This comment has been minimized.

@ghaarsma

ghaarsma Jun 3, 2016

Contributor

Yes there is room for improvement.

  1. You don't have to proceed to the end of line if you are short. tooShort = true (you are already there).
  2. You don't have to proceed to the end of line if format is ragged (reading the ragged column will do this).
  3. However if not short and not ragged then you will not be at the EOL if the user does not wish to read the final column. In that case you have to proceed to EOL and you don't know the width, therefore the while loop.

@hadley hadley added the in progress label Jun 2, 2016

@ghaarsma

This comment has been minimized.

Contributor

ghaarsma commented Jun 3, 2016

Hadley,
I made a few modifications based on your comments and did a push to https://github.com/ghaarsma/readr. Not sure about the exact work process, so let me know if you need anything else. Code is a little cleaner now. Thanks for the feedback.

The (only) way to indicate a Ragged fixed width format is to have the last position of the end vector as NA. I have checked it with empty lines inside the file (it pushes in a row of NA's), which I assume is the intended behavior.

Some simple tests

x <- '12345A\n67890BBBBBBBBB\n54321C'

col_names <- c('A','B','C')
start <- c(1,3,6)
end   <- c(2,5,6)

# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1

# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2

# Read Ragged
col_positions <- fwf_positions(start,end=c(2,5,NA),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4

# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5
@@ -76,7 +77,9 @@ fwf_widths <- function(widths, col_names = NULL) {
#' @rdname read_fwf
#' @export
#' @param start,end Starting and ending (inclusive) positions of each field.
#' Use NA or Inf as last end field when reading a ragged fwf file.

This comment has been minimized.

@hadley

hadley Jun 4, 2016

Member

I think you missed the Inf here


// File is assumed to be ragged (last column can have variable width)
// when the last element of endOffset_ is NA
isRagged_ = endOffset_[endOffset_.size()-1L] == NA_INTEGER;

This comment has been minimized.

@hadley

hadley Jun 4, 2016

Member

Can you put spaces around - please?

This comment has been minimized.

@ghaarsma

ghaarsma Jun 6, 2016

Contributor

Fixed

@@ -12,7 +12,7 @@ class TokenizerFwf : public Tokenizer {

SourceIterator begin_, curLine_, end_;
int row_, col_, cols_, max_;
bool moreTokens_;
bool moreTokens_,isRagged_;

This comment has been minimized.

@hadley

hadley Jun 4, 2016

Member

Missing space after ,

This comment has been minimized.

@ghaarsma

ghaarsma Jun 6, 2016

Contributor

Fixed

@hadley

This comment has been minimized.

Member

hadley commented Jun 4, 2016

This is looking good. Can you please add those examples as unit tests, and add a bullet point to NEWS? (it should include the original issue number and your github user name)

@ghaarsma

This comment has been minimized.

Contributor

ghaarsma commented Jun 6, 2016

Added new unit tests and a bullet point to the NEWS.md. Let me know if this is all correct. Some of the work process is a first time for me.

@hadley

This comment has been minimized.

Member

hadley commented Jun 6, 2016

This looks great!

There's one last thing for you to attempt - can you please rebase/merge to bring your branch up to date with the other changes? If you get stuck, don't worry too much, as I can do it by hand, but it's a good learning experience for future PRs if you want to give it a shot.

ghaarsma added some commits Jan 18, 2016

Fixed implementation of reading Ragged fwf (fixed width format) files.
You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.

Updated documentation inf read_fwf to reflect changes.
Fixed implementation of reading Ragged fwf (fixed width format) files.
You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.
Changes made to the reading of fixed width files are discussion with …
…Hadley

The only way to assume a fwf ragged file is for the last end position to be NA.
Changes made to the reading of fixed width files are discussion with …
…Hadley

The only way to assume a fwf ragged file is for the last end position to be NA.
@ghaarsma

This comment has been minimized.

Contributor

ghaarsma commented Jun 7, 2016

Hadley,

I tried to follow: https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request. Had some minor issues, but I think I got it. Please check when merging the pull request.

@codecov-io

This comment has been minimized.

codecov-io commented Jun 7, 2016

Current coverage is 70.00%

Merging #353 into master will increase coverage by <.01%

@@             master       #353   diff @@
==========================================
  Files            56         56          
  Lines          2803       2807     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1961       1965     +4   
  Misses          842        842          
  Partials          0          0          

Powered by Codecov. Last updated by e41bc7e...f5b91b9

@hadley hadley merged commit 8b253b2 into tidyverse:master Jun 7, 2016

3 checks passed

codecov/patch 87.50% of diff hit (target 69.96%)
Details
codecov/project 70.00% (+<.01%) compared to e41bc7e
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@hadley hadley removed the in progress label Jun 7, 2016

@hadley

This comment has been minimized.

Member

hadley commented Jun 7, 2016

Perfect - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment