Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fwf_cols function #616

Merged
merged 24 commits into from Feb 27, 2017
Merged

Add fwf_cols function #616

merged 24 commits into from Feb 27, 2017

Conversation

jrnold
Copy link
Contributor

@jrnold jrnold commented Feb 18, 2017

This adds a helper function fwf_cols that is a more intuitive way of specifying fixed width column start and end points. While fwf_positions requires three vectors for start, end, and names, fwf_cols accepts a named list of length-2 vectors of the column start and end positions.

For example,

# 3. Paired vectors of start and end positions
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
# 4. Named list of start and end positions
read_fwf(fwf_sample, fwf_cols(list(name = c(1, 10), ssn = c(30, 42))))

This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
@hadley
Copy link
Member

hadley commented Feb 19, 2017

I think this is an improvement, but I wonder if a wrapper around tribble would be even nicer.

@jrnold
Copy link
Contributor Author

jrnold commented Feb 19, 2017

I was thinking about whether a wrapper around a data frame would be useful and almost included a version in it, but decided against it. My thinking was there's two main way you could get the column specifications (1) if the column specifications are a data frame with two (variable, widths) or three columns (variable, start, end), or (2) they are entering it by hand.

If the column specifications are already in a data frame (i'll call is cols), it's not too unclear to simply call the existing functions by simply referencing the columns of the data frame:

fwf_postions(cols$start, cols$end, cols$varname)

To me, that's still pretty clear, and not too much typing.
Possibly fwf_positions and fwf_widths could be made into generic functions, with methods for the first function being a vector, matrix, or data frame.

The second case is entering it by hand (when it's not too many columns). In that case, having the columns as argument names and the widths or (start, end) as values seems most natural. fwf_cols could be generalized to allow the values to be either (start, end) tuples or widths.

# with widths
fwf_cols(foo = 1, bar = 5)
# with (start, end) tuples
fwf_cols(foo = c(1, 4), bar = c(5, 10)

This came up when I was helping a student read a fixed-width file. I foolishly didn't RTFM before writing code, and assumed that the format was something like what I was just wrote. When we got an error and actually read the documentation, I was too lazy to adjust change the code and used map to convert it into what fwf_positions was looking for.

It needs the .Rd file.
@hadley
Copy link
Member

hadley commented Feb 22, 2017

What if we allowed col_positions to take a data frame? If one row, the values are widths; if two rows, the values are start and end. If >= 3 rows, an error.

Then you'd have:

read_fwf(fwf_sample, tibble(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, tibble(name = 10, skip = 20, ssn = 12))

@jrnold
Copy link
Contributor Author

jrnold commented Feb 23, 2017

I don't know. It seems more natural and easy to document that col_positions accepts a "long" tibble like fwf_positions produces. And the helper functions, fwf_positions, fwf_widths, fwf_cols all return such a tibble. This would minimize the amount of code that's rewritten, and keep these functions backwards compatible while improving the ease of use.

If a user is able to write the following, it's about as concise as the code above, and I'd say as readable.

read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, fwf_cols(name = 10, skip = 20, ssn = 12))

And the following would still work:

x <- tribble(
  ~ col_name, ~start, ~ end
  name, 1, 10,
  ssn, 30, 42
)
read_fwf(fwf_sample, x)

This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
It needs the .Rd file.
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
- read_fwf arg col_positions will check for column names and whether the data frame is widths or a start/end data frame.
- rewrite fwf_cols to accept named args of length 1 or 2. This makes it more concise. Also accept a data frame as the first argument
- More checks for argument validity
- Use tibbles instead of lists where appropriate

Some tests failing. Still need to debug.
@jrnold
Copy link
Contributor Author

jrnold commented Feb 24, 2017

Now I have it so that

  • The col_positions argument ofread_fwf accepts a list with either begin and end columns or width and treats them appropriately
  • The fwf_cols(...) calls tibble(...) or uses the first argument if it is a list. It will expect either 1 row, in which case it calls fwf_widths or 2 rows, in which case it calls fwf_positions.

#' # 1. Guess based on position of empty columns
#' read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#' # 2. A vector of field widths
#' read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#' # 3. Paired vectors of start and end positions
#' read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
#' # 4. Named arguments with start and end positions
#' read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the width form here too?

R/read_fwf.R Outdated
return(tibble::data_frame())
return(tibble::tibble())
}
if (!is.list(col_positions)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels too complicated to me. If we have fwf_cols() I don't think we need to worry about list/data.frame inputs.

R/read_fwf.R Outdated
}

tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na, comment = comment)
tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this back please?

@@ -1,3 +1,10 @@
fwf_col_names <- function(nm, n) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be much lower in the file


#' @rdname read_fwf
#' @export
#' @param ... If the first element is a data frame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels too flexible to me. But if you really think it's a good idea to keep it, the function signature should be x, ...

R/read_fwf.R Outdated
names(x) <- fwf_col_names(names(x), length(x))
x <- tibble::as_tibble(x)
if (nrow(x) == 2) {
fwf_positions(as.integer(x[1, ]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting style

R/read_fwf.R Outdated
if (is.list(x[[1]])) {
x <- x[[1]]
}
x <- try(lapply(x, as.integer), silent = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think like approach. I'd say just let the error bubble up to the user.

@@ -127,6 +127,43 @@ test_that("error on empty spec (#511, #519)", {
expect_error(read_fwf(txt, pos), "Zero-length.*specifications not supported")
})

# fwf_cols
test_that("fwf_cols produces correct fwf_positions object with elements of length 2", {
expected <- fwf_positions(c(1, 9, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please fix the indenting here too?

If the arguments don't fit on one line it should look like:

function_name(
   arg1,
   argument_name = arg2,
   ...
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a different style than the one in http://adv-r.had.co.nz/Style.html, which would be

function_name(arg1,
              argument_name = arg2,
              ...)

Move fwf_col_names function lower in file.

See tidyverse#616
Add widths form of fwf_cols to documentation

See tidyverse#616
This is too complicated; since we have fwf_cols, don't worry about list inputs.

See tidyverse#616
Remove a try() call since, the preference is for errors to bubble up to users.

See tidyverse#616
This seems too flexible, so I'll change it to just use ...

See tidyverse#616
- removed tests that failed after removing features in fwf_cols in previous commits
- remove test that failed because fwf_positions changes columns to numeric
Revert this section in read_fwf since it is unnecessary to handle data list objects with the availability of fwf_cols

See tidyverse#616
Convert some numeric constants to integer constants so that addition/subtraction does not coerce columns to numeric if they were integer. This is not a big deal, but since the positions represent integers anyways, it might as well keep them as such if they are already specified as such.
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Now just needs a bullet point in NEWS.md in the appropriate place

NEWS.md Outdated
@@ -1,5 +1,31 @@
# readr 1.1.0

* `fwf_cols()` allows for specifying the `col_positions` argument of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something went wrong with your merge 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. wtf did that merge do? Bad git :-( Sorry about that, and fixed now.

weird things happened to NEWS.md. They are fixed now.
@jimhester jimhester merged commit e7a5b62 into tidyverse:master Feb 27, 2017
@jimhester
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants