New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support overlapping fwf columns #534

Closed
gergness opened this Issue Oct 17, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@gergness
Contributor

gergness commented Oct 17, 2016

I sometimes receive fixed with files with specifications that have overlapping columns. readr's current behavior made it difficult to figure out what was going wrong.

I wish that I was able to read in the data with the original specifications and get the 2 variables as normal, but if that's not possible, perhaps fwf_positions could check for overlaps and throw a better error?

library(readr)
read_fwf("2015a\n2016b", fwf_positions(c(1, 3, 5), c(4, 4, 5)))
#> Warning: 4 parsing failures.
#> row col  expected    actual
#>   1  X2 2 chars   1        
#>   1  -- 3 columns 2 columns
#>   2  X2 2 chars   1        
#>   2  -- 3 columns 2 columns
#> 
#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <int> <chr> <chr>
#> 1  2015     a  <NA>
#> 2  2016     b  <NA>
@lucasmation

This comment has been minimized.

lucasmation commented Oct 23, 2016

You are right. Bump!

I am having an embarrassing problem in a package (microdadosBrasil) because of the error reading overlapping columns in the original fwf data source.

I am posting a new example just to be even more clear

Not only are the overlaping columns wrong, but also all the following columns have the initial and end positions shifted right, thus making all the data wrong.

Let's say I have this import dictionary

# name: initial position to final position
# X1: 1 to 2
# X2: 3 to 4
# X3: 3 to 7 (overlaps first two caracters with X2)
# X4: 8 to 9
# X5: 10 to 14

And this dataset

123456789abcdefg
123456789abcdefg

We can import this using


'123456789abcdefg\n123456789abcdefg' %>% 
  read_fwf(fwf_positions(start=c(1,3,3,8,10 ), 
                         end=  c(2,4,7,9,13 ) )
  ) 

This leads the data being imported with errors. In particular X3 and every column after that are wrong:

# A tibble: 2 × 5
     X1    X2    X3    X4    X5
  <int> <int> <int> <chr> <chr>
1    12    34 56789    ab  cdef
2    12    34 56789    ab  cdef
@gergness

This comment has been minimized.

Contributor

gergness commented Oct 24, 2016

In case it is helpful @lucasmation, here is my work around. It depends on dplyr, stringr and tidyr.

There are 2 limitations that are okay for my use, but might not for all (but you should be able to adapt the code if needed):

  1. It requires that all columns be read in as character.
  2. It requires that the overlapping pattern be such that there are no variables that are subsets of more than one variable.
read_fwf_overlap <- function(file, col_positions, col_types = NULL, ...) {
  if (!is.null(col_types)) {
    warning("read_fwf_overlap does not support col_types, columns will be read in as character", 
         call. = FALSE)
  }
  fixed_specs <- fix_for_fwf_overlap(col_positions)

  out <- readr::read_fwf(file, fixed_specs$vinfo, col_types = cols(.default = col_character()), ...)
  out <- dplyr::mutate_(out, .dots = fixed_specs$mdots)
  out <- out[, col_positions$col_names] # Rearranges to original order
  out
}


# readr doesn't support overlapping fixed with columns. This
# function takes a fwf specificaiton with overlaps, and returns
# a non-overlapping set and another text object that can
# be passed to a mutate_ call on the loaded dataset to
# create the overlapping variables.
# Only works if the overlapping columns fit within exactly
# one other column.
fix_for_fwf_overlap <- function(fwf_positions) {
  vinfo <- dplyr::as_data_frame(fwf_positions)

  # Check to see if there are overlapping columns. The arrange
  # sorts the data so that the widest column with a given start
  # is first, this way the other columns can be created
  # from it.
  vinfo <- dplyr::mutate_(vinfo, width = ~end - begin + 1)
  vinfo <- dplyr::arrange_(vinfo, ~begin, ~desc(width))
  vinfo <- dplyr::mutate_(vinfo, maxend = ~cummax(dplyr::lag(end, default = 0)),
                          is_overlapping = ~begin < maxend,
                          is_overlapper = ~end > dplyr::lead(begin, default = Inf))

  # Get dataframe of non-overlapping columns
  nonoverlaps <- vinfo
  nonoverlaps <- dplyr::filter_(nonoverlaps, ~!is_overlapping)
  nonoverlaps <- dplyr::select_(nonoverlaps, ~col_names, ~begin, ~end)

  # Make mdots to pass into a mutate_ function that will take substrings of
  # The widest columns to make the sub-columns.
  ## Start by filling some of the overlapper information down to the overlapping
  overlaps <- vinfo
  overlaps <- dplyr::filter_(overlaps, ~is_overlapper | is_overlapping)
  if (nrow(overlaps) > 1) {
    overlaps <- dplyr::mutate_(overlaps, overlapper_name = ~ifelse(is_overlapper, col_names, NA),
                               overlapper_begin = ~ifelse(is_overlapper, begin, NA),
                               overlapper_end = ~ifelse(is_overlapper, end, NA))
    overlaps <- tidyr::fill_(overlaps, c("overlapper_name", "overlapper_begin", "overlapper_end"))

    ## Check if there are variables that are partially overlapped, if so, then just throw
    ## an error.
    if (any(overlaps$end > overlaps$overlapper_end)) stop("Columns overlap with more than one othe variable", 
                                                          call. = FALSE)

    ## Make a text representation of the substring calls we'll want to make
    ## on the dataframe after it is loaded.
    overlaps <- dplyr::mutate_(overlaps, rel_begin = ~begin - overlapper_begin + 1,
                               rel_end = ~end - overlapper_begin + 1)
    overlaps <- dplyr::filter_(overlaps, ~is_overlapping)

    mdots <- sprintf("stringr::str_sub(%s, %s, %s)", overlaps$overlapper_name, 
                     overlaps$rel_begin, overlaps$rel_end)

    names(mdots) <- overlaps$col_names
  } else {
    # If no overlapping, don't worry about mdots.
    mdots <- NULL
  }
  # Return list of the nonoverlaps, plus the dots to be passed into mdots
  list(vinfo = nonoverlaps, mdots = mdots)
}
'123456789abcdefg\n123456789abcdefg' %>% 
  read_fwf_overlap(fwf_positions(start=c(1,3,3,8,10 ), 
                                 end=  c(2,4,7,9,13 ) )
  ) 

#> # A tibble: 2 × 5
#>      X1    X2    X3    X4    X5
#>   <chr> <chr> <chr> <chr> <chr>
#> 1    12   345 34567    89  abcd
#> 2    12   345 34567    89  abcd
@lucasmation

This comment has been minimized.

lucasmation commented Oct 25, 2016

that is nice. I would only change it slightly so as to not lose the col_type information. I think the function should:

a) receive the import dictionary parameters (start, end, col_type, names, etc)
b) check for overlaps. If overlaps are not nested throw an error message. else remove the shortest (nested) overlapping collumn(s).
c) save in a different object the information on the removed columns:

  • original information: (start, end, col_type, names, etc)
  • the column (maintained in the dataset) which it overlaps
  • new start and end postions relative to the column overlaps

d) import the data, based on the "cleaned" dictionary (without overlaps)
e) recreate the additional variables based on "c)"

@lucasmation

This comment has been minimized.

lucasmation commented Oct 25, 2016

In any case, I hope the developer fix this ASAP

@hadley

This comment has been minimized.

Member

hadley commented Dec 22, 2016

This seems like an extremely esoteric use of fwf, but this should throw an informative error.

@lucasmation

This comment has been minimized.

lucasmation commented Dec 23, 2016

Tks Hadley. Maybe it is more common than you think. In Brazil every microdata from IBGE, national statistics office, contains overlaping variables. It is always the municipal code (7digits), which contains the state code (2 digits) in its first two digits.

@hadley

This comment has been minimized.

Member

hadley commented Dec 23, 2016

You can always extract that after the fact.

@lucasmation

This comment has been minimized.

lucasmation commented Dec 25, 2016

I had ommited the redundant variables in the import functions of IBGE data in the microdadosBrasil package. The problem is that the import functions in other software (SAS, Stata) do support overlapping columns, so users expect that to find those variables. I can, off course recreate, the variables afterwards, but that bloats the code with variable definitions, which ideally should be separate (I am very inspired by your cupcake recipes generalisation here). And more broadly, as long as import dictionary makes sense and there is a minimum amount of data using the nested definition, I shouldn't be up to the import function to restrict such cases.

@hadley

This comment has been minimized.

Member

hadley commented Dec 28, 2016

I'd consider a PR to implement this, but we're unlikely to do it otherwise.

@hadley hadley closed this Dec 28, 2016

@lucasmation

This comment has been minimized.

lucasmation commented Jan 13, 2017

@nicolassoarespinto, @gutorc92: lets try to do this (see above)

gergness added a commit to gergness/readr that referenced this issue Jan 24, 2017

jimhester added a commit that referenced this issue Feb 27, 2017

Squashed commit of the following:
commit 217cc7751ec8cfdec89e4fe21717c1a8aff421ad
Author: Jim Hester <james.f.hester@gmail.com>
Date:   Mon Feb 27 10:53:49 2017 -0500

    Fix failing tests, reformatting

commit 3545e47df758687a0ad2676bfbd47f7efb146a7d
Author: gfellis <gfellis@umn.edu>
Date:   Thu Jan 26 15:05:24 2017 -0600

    Update news.

commit 7b2e2f20bf92a593b6e79aa895124829ea41cf1b
Author: gfellis <gfellis@umn.edu>
Date:   Tue Jan 24 10:59:23 2017 -0600

    Fix off by one error

commit 41763088413c47b5035163792564427db9ae9d0f
Author: gfellis <gfellis@umn.edu>
Date:   Tue Jan 24 10:27:37 2017 -0600

    Add error for overlapping fwf specification (#534).
@lucasmation

This comment has been minimized.

lucasmation commented Mar 15, 2017

@hadley,
we have been working on PRs to fix the two limitations of read_fwf

  1. Can't deal with decimal places (issue #399). Proposed solution: add a new argument decimal_positions to fwf_positions, modifying the main read_fwf function (pull request #632, implemented in Rcpp, tks to @gutorc92)

  2. Can't deal with overlapping columns (current issue). Proposed solution: parse the import dictionary into a list of dictionaries with no overlapping variables. Then import the data based on each dictionary and cbind the results (pull request #634, implemented in R, tks to @daniellima123). We implemented this as a separate function read_fwf2 (in the same spirit of how read.cvs2 is related to read.cvs), to avoid recursive calls. We could make the changes directly to read_fwf, defining it recursively. What do you think?

Please let us know if these suggestions are acceptable. We have been using them in our forked repos and they seem to work fine.

@hadley

This comment has been minimized.

Member

hadley commented Mar 15, 2017

Please don't discuss on closed issues as it's I'm extremely likely I won't remember when we're next working on readr. It's much better to open a new issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.