New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fwf_cols function #616

Merged
merged 24 commits into from Feb 27, 2017

Conversation

Projects
None yet
3 participants
@jrnold
Contributor

jrnold commented Feb 18, 2017

This adds a helper function fwf_cols that is a more intuitive way of specifying fixed width column start and end points. While fwf_positions requires three vectors for start, end, and names, fwf_cols accepts a named list of length-2 vectors of the column start and end positions.

For example,

# 3. Paired vectors of start and end positions
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
# 4. Named list of start and end positions
read_fwf(fwf_sample, fwf_cols(list(name = c(1, 10), ssn = c(30, 42))))

jrnold added some commits Feb 18, 2017

Add fwf_cols function
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
@hadley

This comment has been minimized.

Member

hadley commented Feb 19, 2017

I think this is an improvement, but I wonder if a wrapper around tribble would be even nicer.

@jrnold

This comment has been minimized.

Contributor

jrnold commented Feb 19, 2017

I was thinking about whether a wrapper around a data frame would be useful and almost included a version in it, but decided against it. My thinking was there's two main way you could get the column specifications (1) if the column specifications are a data frame with two (variable, widths) or three columns (variable, start, end), or (2) they are entering it by hand.

If the column specifications are already in a data frame (i'll call is cols), it's not too unclear to simply call the existing functions by simply referencing the columns of the data frame:

fwf_postions(cols$start, cols$end, cols$varname)

To me, that's still pretty clear, and not too much typing.
Possibly fwf_positions and fwf_widths could be made into generic functions, with methods for the first function being a vector, matrix, or data frame.

The second case is entering it by hand (when it's not too many columns). In that case, having the columns as argument names and the widths or (start, end) as values seems most natural. fwf_cols could be generalized to allow the values to be either (start, end) tuples or widths.

# with widths
fwf_cols(foo = 1, bar = 5)
# with (start, end) tuples
fwf_cols(foo = c(1, 4), bar = c(5, 10)

This came up when I was helping a student read a fixed-width file. I foolishly didn't RTFM before writing code, and assumed that the format was something like what I was just wrote. When we got an error and actually read the documentation, I was too lazy to adjust change the code and used map to convert it into what fwf_positions was looking for.

Fix failed Travis build
It needs the .Rd file.
@hadley

This comment has been minimized.

Member

hadley commented Feb 22, 2017

What if we allowed col_positions to take a data frame? If one row, the values are widths; if two rows, the values are start and end. If >= 3 rows, an error.

Then you'd have:

read_fwf(fwf_sample, tibble(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, tibble(name = 10, skip = 20, ssn = 12))
@jrnold

This comment has been minimized.

Contributor

jrnold commented Feb 23, 2017

I don't know. It seems more natural and easy to document that col_positions accepts a "long" tibble like fwf_positions produces. And the helper functions, fwf_positions, fwf_widths, fwf_cols all return such a tibble. This would minimize the amount of code that's rewritten, and keep these functions backwards compatible while improving the ease of use.

If a user is able to write the following, it's about as concise as the code above, and I'd say as readable.

read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, fwf_cols(name = 10, skip = 20, ssn = 12))

And the following would still work:

x <- tribble(
  ~ col_name, ~start, ~ end
  name, 1, 10,
  ssn, 30, 42
)
read_fwf(fwf_sample, x)

jrnold added some commits Feb 18, 2017

Add fwf_cols function
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
Fix failed Travis build
It needs the .Rd file.
Add fwf_cols function
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
Updates to fwf_* column position functions
- read_fwf arg col_positions will check for column names and whether the data frame is widths or a start/end data frame.
- rewrite fwf_cols to accept named args of length 1 or 2. This makes it more concise. Also accept a data frame as the first argument
- More checks for argument validity
- Use tibbles instead of lists where appropriate

Some tests failing. Still need to debug.

@jrnold jrnold force-pushed the jrnold:fwf_cols branch from 1f39aed to 7a35bac Feb 23, 2017

@jrnold

This comment has been minimized.

Contributor

jrnold commented Feb 24, 2017

Now I have it so that

  • The col_positions argument ofread_fwf accepts a list with either begin and end columns or width and treats them appropriately
  • The fwf_cols(...) calls tibble(...) or uses the first argument if it is a list. It will expect either 1 row, in which case it calls fwf_widths or 2 rows, in which case it calls fwf_positions.
#' # 1. Guess based on position of empty columns
#' read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#' # 2. A vector of field widths
#' read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#' # 3. Paired vectors of start and end positions
#' read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
#' # 4. Named arguments with start and end positions
#' read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Can you include the width form here too?

return(tibble::data_frame())
return(tibble::tibble())
}
if (!is.list(col_positions)) {

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

This feels too complicated to me. If we have fwf_cols() I don't think we need to worry about list/data.frame inputs.

}

tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na, comment = comment)
tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na,

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Can you change this back please?

@@ -1,3 +1,10 @@
fwf_col_names <- function(nm, n) {

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

This should be much lower in the file


#' @rdname read_fwf
#' @export
#' @param ... If the first element is a data frame,

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

This feels too flexible to me. But if you really think it's a good idea to keep it, the function signature should be x, ...

names(x) <- fwf_col_names(names(x), length(x))
x <- tibble::as_tibble(x)
if (nrow(x) == 2) {
fwf_positions(as.integer(x[1, ]),

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Indenting style

if (is.list(x[[1]])) {
x <- x[[1]]
}
x <- try(lapply(x, as.integer), silent = TRUE)

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

I don't think like approach. I'd say just let the error bubble up to the user.

@@ -127,6 +127,43 @@ test_that("error on empty spec (#511, #519)", {
expect_error(read_fwf(txt, pos), "Zero-length.*specifications not supported")
})

# fwf_cols
test_that("fwf_cols produces correct fwf_positions object with elements of length 2", {
expected <- fwf_positions(c(1, 9, 4),

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Can you please fix the indenting here too?

If the arguments don't fit on one line it should look like:

function_name(
   arg1,
   argument_name = arg2,
   ...
)

This comment has been minimized.

@jrnold

jrnold Feb 25, 2017

Contributor

So this is a different style than the one in http://adv-r.had.co.nz/Style.html, which would be

function_name(arg1,
              argument_name = arg2,
              ...)

jrnold added some commits Feb 25, 2017

respond to hadley's comments
Move fwf_col_names function lower in file.

See #616
respond to hadley's comments
Add widths form of fwf_cols to documentation

See #616
respond to hadley's comments
This is too complicated; since we have fwf_cols, don't worry about list inputs.

See #616
respond the hadley's comments
Revert added newline

See #616
respond to hadley's comments
Remove a try() call since, the preference is for errors to bubble up to users.

See #616
respond to hadley's comments
This seems too flexible, so I'll change it to just use ...

See #616
Fix indenting issues
See comments in #616
fix tests
- removed tests that failed after removing features in fwf_cols in previous commits
- remove test that failed because fwf_positions changes columns to numeric
respond to hadley's comments
Revert this section in read_fwf since it is unnecessary to handle data list objects with the availability of fwf_cols

See #616
convert numeric constants in fwf functions
Convert some numeric constants to integer constants so that addition/subtraction does not coerce columns to numeric if they were integer. This is not a big deal, but since the positions represent integers anyways, it might as well keep them as such if they are already specified as such.
@hadley

hadley approved these changes Feb 25, 2017

Looks good. Now just needs a bullet point in NEWS.md in the appropriate place

NEWS.md Outdated
@@ -1,5 +1,31 @@
# readr 1.1.0

* `fwf_cols()` allows for specifying the `col_positions` argument of

This comment has been minimized.

@hadley

hadley Feb 25, 2017

Member

I think something went wrong with your merge 😞

This comment has been minimized.

@jrnold

jrnold Feb 25, 2017

Contributor

Oops. wtf did that merge do? Bad git :-( Sorry about that, and fixed now.

fix merge error
weird things happened to NEWS.md. They are fixed now.

@jimhester jimhester merged commit e7a5b62 into tidyverse:master Feb 27, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jimhester

This comment has been minimized.

Member

jimhester commented Feb 27, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment