Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf confused by ragged lines #469

Closed
MatthieuStigler opened this issue Jul 13, 2016 · 5 comments
Closed

read_fwf confused by ragged lines #469

MatthieuStigler opened this issue Jul 13, 2016 · 5 comments
Labels
feature a feature request or enhancement
Milestone

Comments

@MatthieuStigler
Copy link

In a slightly weird data, where missing values are simply indicated by terminating the line earlier, there seem to be a confusion with fwf_empty(tmp) and wrong output, while if given manually, the output is right but shows many parsing warnings?

library(readr)
library(foreign)
txt <- c("    7.98      6.74          7.75            8.00", "    2.03      2.38", 
     "    5.82      5.09", "    21.6      15.1          14.5            14.5")
tmp <- tempfile()
writeLines(txt, tmp)
cat(read_file(tmp))

7.98      6.74          7.75            8.00
2.03      2.38
5.82      5.09
21.6      15.1          14.5            14.5

It is ok with read.fwf:
wid <- c(8, 10, 14, 16)
read.fwf(tmp, wid) # works

     V1    V2    V3   V4
1  7.98  6.74  7.75  8.0
2  2.03  2.38    NA   NA
3  5.82  5.09    NA   NA
4 21.60 15.10 14.50 14.5

But several issues with read_fwf:

  • fwf_empty(): wrong output

    read_fwf(tmp, fwf_empty(tmp)) ## wrong output?
     X1    X2    X3                                         X4
      <dbl> <dbl> <dbl>                                      <chr>
    1  7.98  6.74  7.75                                       8.00
    2  2.03  2.38    NA .6      15.1          14.5            14.5
    
  • fwf_widths(): right output but parsing warnings?

      read_table(tmp) ## similar issue
    read_fwf(tmp, fwf_widths(wid)) # right output but parsing warnings?
    Warning: 4 parsing failures.
    row col  expected    actual
      2  X3 14 chars  0        
      2  -- 4 columns 3 columns
      3  X3 14 chars  0        
      3  -- 4 columns 3 columns
    
      <dbl> <dbl> <dbl> <dbl>
    1  7.98  6.74  7.75   8.0
    2  2.03  2.38    NA    NA
    3  5.82  5.09    NA    NA
    4 21.60 15.10 14.50  14.5
    
@hadley
Copy link
Member

hadley commented Jul 13, 2016

Slightly more minimal reprex:

txt <- paste(
  "    7.98      6.74          7.75            8.00", 
  "    2.03      2.38", 
  "    5.82      5.09", 
  "    21.6      15.1          14.5            14.5", 
  sep = "\n"
)

read_fwf(txt, fwf_empty(txt))
read_fwf(txt, fwf_widths(c(8, 10, 14, 16)))

I think the output from fwf_widths() is correct (except that the parsing information complains about the wrong number of columns seen) - I don't think this file should parse without warnings because normally every line in a fwf is the same length.

The output from fwf_empty() is definitely screwy

@hadley
Copy link
Member

hadley commented Jul 13, 2016

@ghaarsma any thoughts on this problem?

@hadley hadley changed the title read_fwf confused if last value has NA? read_fwf confused by ragged lines Jul 13, 2016
@hadley hadley modified the milestone: 0.3.0 Jul 13, 2016
@ghaarsma
Copy link
Contributor

Agree with your assessment. read_fwf is correct and fwf_empty is screwy.
Looking into whitespaceColumns to see if it is fixable.

@hadley hadley added feature a feature request or enhancement ready labels Jul 13, 2016
@hadley
Copy link
Member

hadley commented Jul 13, 2016

@ghaarsma I think it might be that fwf_empty() generates a non-exhaustive set of columns, and somewhere we must skip over the \n without inspecting it.

@hadley
Copy link
Member

hadley commented Jul 13, 2016

I think it's probably this code: curLine_ + beginOffset_[col_] - it doesn't check that we don't hit any new lines along the way

@hadley hadley closed this as completed in 851220f Jul 13, 2016
@hadley hadley removed the ready label Jul 13, 2016
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants