Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fwf_cols function #616

Merged
merged 24 commits into from Feb 27, 2017
Merged

Add fwf_cols function #616

merged 24 commits into from Feb 27, 2017

Conversation

@jrnold
Copy link
Contributor

@jrnold jrnold commented Feb 18, 2017

This adds a helper function fwf_cols that is a more intuitive way of specifying fixed width column start and end points. While fwf_positions requires three vectors for start, end, and names, fwf_cols accepts a named list of length-2 vectors of the column start and end positions.

For example,

# 3. Paired vectors of start and end positions
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
# 4. Named list of start and end positions
read_fwf(fwf_sample, fwf_cols(list(name = c(1, 10), ssn = c(30, 42))))
jrnold added 2 commits Feb 18, 2017
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2017

I think this is an improvement, but I wonder if a wrapper around tribble would be even nicer.

@jrnold
Copy link
Contributor Author

@jrnold jrnold commented Feb 19, 2017

I was thinking about whether a wrapper around a data frame would be useful and almost included a version in it, but decided against it. My thinking was there's two main way you could get the column specifications (1) if the column specifications are a data frame with two (variable, widths) or three columns (variable, start, end), or (2) they are entering it by hand.

If the column specifications are already in a data frame (i'll call is cols), it's not too unclear to simply call the existing functions by simply referencing the columns of the data frame:

fwf_postions(cols$start, cols$end, cols$varname)

To me, that's still pretty clear, and not too much typing.
Possibly fwf_positions and fwf_widths could be made into generic functions, with methods for the first function being a vector, matrix, or data frame.

The second case is entering it by hand (when it's not too many columns). In that case, having the columns as argument names and the widths or (start, end) as values seems most natural. fwf_cols could be generalized to allow the values to be either (start, end) tuples or widths.

# with widths
fwf_cols(foo = 1, bar = 5)
# with (start, end) tuples
fwf_cols(foo = c(1, 4), bar = c(5, 10)

This came up when I was helping a student read a fixed-width file. I foolishly didn't RTFM before writing code, and assumed that the format was something like what I was just wrote. When we got an error and actually read the documentation, I was too lazy to adjust change the code and used map to convert it into what fwf_positions was looking for.

It needs the .Rd file.
@hadley
Copy link
Member

@hadley hadley commented Feb 22, 2017

What if we allowed col_positions to take a data frame? If one row, the values are widths; if two rows, the values are start and end. If >= 3 rows, an error.

Then you'd have:

read_fwf(fwf_sample, tibble(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, tibble(name = 10, skip = 20, ssn = 12))
@jrnold
Copy link
Contributor Author

@jrnold jrnold commented Feb 23, 2017

I don't know. It seems more natural and easy to document that col_positions accepts a "long" tibble like fwf_positions produces. And the helper functions, fwf_positions, fwf_widths, fwf_cols all return such a tibble. This would minimize the amount of code that's rewritten, and keep these functions backwards compatible while improving the ease of use.

If a user is able to write the following, it's about as concise as the code above, and I'd say as readable.

read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, fwf_cols(name = 10, skip = 20, ssn = 12))

And the following would still work:

x <- tribble(
  ~ col_name, ~start, ~ end
  name, 1, 10,
  ssn, 30, 42
)
read_fwf(fwf_sample, x)
jrnold added 5 commits Feb 18, 2017
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
It needs the .Rd file.
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
- read_fwf arg col_positions will check for column names and whether the data frame is widths or a start/end data frame.
- rewrite fwf_cols to accept named args of length 1 or 2. This makes it more concise. Also accept a data frame as the first argument
- More checks for argument validity
- Use tibbles instead of lists where appropriate

Some tests failing. Still need to debug.
@jrnold jrnold force-pushed the jrnold:fwf_cols branch to 7a35bac Feb 23, 2017
@jrnold
Copy link
Contributor Author

@jrnold jrnold commented Feb 24, 2017

Now I have it so that

  • The col_positions argument ofread_fwf accepts a list with either begin and end columns or width and treats them appropriately
  • The fwf_cols(...) calls tibble(...) or uses the first argument if it is a list. It will expect either 1 row, in which case it calls fwf_widths or 2 rows, in which case it calls fwf_positions.
#' # 1. Guess based on position of empty columns
#' read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#' # 2. A vector of field widths
#' read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#' # 3. Paired vectors of start and end positions
#' read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
#' # 4. Named arguments with start and end positions
#' read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

Can you include the width form here too?

return(tibble::data_frame())
return(tibble::tibble())
}
if (!is.list(col_positions)) {

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

This feels too complicated to me. If we have fwf_cols() I don't think we need to worry about list/data.frame inputs.

}

tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na, comment = comment)
tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na,

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

Can you change this back please?

@@ -1,3 +1,10 @@
fwf_col_names <- function(nm, n) {

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

This should be much lower in the file


#' @rdname read_fwf
#' @export
#' @param ... If the first element is a data frame,

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

This feels too flexible to me. But if you really think it's a good idea to keep it, the function signature should be x, ...

names(x) <- fwf_col_names(names(x), length(x))
x <- tibble::as_tibble(x)
if (nrow(x) == 2) {
fwf_positions(as.integer(x[1, ]),

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

Indenting style

if (is.list(x[[1]])) {
x <- x[[1]]
}
x <- try(lapply(x, as.integer), silent = TRUE)

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

I don't think like approach. I'd say just let the error bubble up to the user.

@@ -127,6 +127,43 @@ test_that("error on empty spec (#511, #519)", {
expect_error(read_fwf(txt, pos), "Zero-length.*specifications not supported")
})

# fwf_cols
test_that("fwf_cols produces correct fwf_positions object with elements of length 2", {
expected <- fwf_positions(c(1, 9, 4),

This comment has been minimized.

@hadley

hadley Feb 24, 2017
Member

Can you please fix the indenting here too?

If the arguments don't fit on one line it should look like:

function_name(
   arg1,
   argument_name = arg2,
   ...
)

This comment has been minimized.

@jrnold

jrnold Feb 25, 2017
Author Contributor

So this is a different style than the one in http://adv-r.had.co.nz/Style.html, which would be

function_name(arg1,
              argument_name = arg2,
              ...)
jrnold added 12 commits Feb 25, 2017
Move fwf_col_names function lower in file.

See #616
Add widths form of fwf_cols to documentation

See #616
This is too complicated; since we have fwf_cols, don't worry about list inputs.

See #616
Revert added newline

See #616
Remove a try() call since, the preference is for errors to bubble up to users.

See #616
This seems too flexible, so I'll change it to just use ...

See #616
See comments in #616
- removed tests that failed after removing features in fwf_cols in previous commits
- remove test that failed because fwf_positions changes columns to numeric
Revert this section in read_fwf since it is unnecessary to handle data list objects with the availability of fwf_cols

See #616
Convert some numeric constants to integer constants so that addition/subtraction does not coerce columns to numeric if they were integer. This is not a big deal, but since the positions represent integers anyways, it might as well keep them as such if they are already specified as such.
@hadley
hadley approved these changes Feb 25, 2017
Copy link
Member

@hadley hadley left a comment

Looks good. Now just needs a bullet point in NEWS.md in the appropriate place

NEWS.md Outdated
@@ -1,5 +1,31 @@
# readr 1.1.0

* `fwf_cols()` allows for specifying the `col_positions` argument of

This comment has been minimized.

@hadley

hadley Feb 25, 2017
Member

I think something went wrong with your merge 😞

This comment has been minimized.

@jrnold

jrnold Feb 25, 2017
Author Contributor

Oops. wtf did that merge do? Bad git :-( Sorry about that, and fixed now.

weird things happened to NEWS.md. They are fixed now.
@jimhester jimhester merged commit e7a5b62 into tidyverse:master Feb 27, 2017
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jimhester
Copy link
Member

@jimhester jimhester commented Feb 27, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants