Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf confused by ragged lines #469

Closed
MatthieuStigler opened this issue Jul 13, 2016 · 5 comments
Closed

read_fwf confused by ragged lines #469

MatthieuStigler opened this issue Jul 13, 2016 · 5 comments
Labels
Milestone

Comments

@MatthieuStigler
Copy link

@MatthieuStigler MatthieuStigler commented Jul 13, 2016

In a slightly weird data, where missing values are simply indicated by terminating the line earlier, there seem to be a confusion with fwf_empty(tmp) and wrong output, while if given manually, the output is right but shows many parsing warnings?

library(readr)
library(foreign)
txt <- c("    7.98      6.74          7.75            8.00", "    2.03      2.38", 
     "    5.82      5.09", "    21.6      15.1          14.5            14.5")
tmp <- tempfile()
writeLines(txt, tmp)
cat(read_file(tmp))

7.98      6.74          7.75            8.00
2.03      2.38
5.82      5.09
21.6      15.1          14.5            14.5

It is ok with read.fwf:
wid <- c(8, 10, 14, 16)
read.fwf(tmp, wid) # works

     V1    V2    V3   V4
1  7.98  6.74  7.75  8.0
2  2.03  2.38    NA   NA
3  5.82  5.09    NA   NA
4 21.60 15.10 14.50 14.5

But several issues with read_fwf:

  • fwf_empty(): wrong output

    read_fwf(tmp, fwf_empty(tmp)) ## wrong output?
     X1    X2    X3                                         X4
      <dbl> <dbl> <dbl>                                      <chr>
    1  7.98  6.74  7.75                                       8.00
    2  2.03  2.38    NA .6      15.1          14.5            14.5
    
  • fwf_widths(): right output but parsing warnings?

      read_table(tmp) ## similar issue
    read_fwf(tmp, fwf_widths(wid)) # right output but parsing warnings?
    Warning: 4 parsing failures.
    row col  expected    actual
      2  X3 14 chars  0        
      2  -- 4 columns 3 columns
      3  X3 14 chars  0        
      3  -- 4 columns 3 columns
    
      <dbl> <dbl> <dbl> <dbl>
    1  7.98  6.74  7.75   8.0
    2  2.03  2.38    NA    NA
    3  5.82  5.09    NA    NA
    4 21.60 15.10 14.50  14.5
    
@hadley
Copy link
Member

@hadley hadley commented Jul 13, 2016

Slightly more minimal reprex:

txt <- paste(
  "    7.98      6.74          7.75            8.00", 
  "    2.03      2.38", 
  "    5.82      5.09", 
  "    21.6      15.1          14.5            14.5", 
  sep = "\n"
)

read_fwf(txt, fwf_empty(txt))
read_fwf(txt, fwf_widths(c(8, 10, 14, 16)))

I think the output from fwf_widths() is correct (except that the parsing information complains about the wrong number of columns seen) - I don't think this file should parse without warnings because normally every line in a fwf is the same length.

The output from fwf_empty() is definitely screwy

Loading

@hadley
Copy link
Member

@hadley hadley commented Jul 13, 2016

@ghaarsma any thoughts on this problem?

Loading

@hadley hadley changed the title read_fwf confused if last value has NA? read_fwf confused by ragged lines Jul 13, 2016
@hadley hadley added this to the 0.3.0 milestone Jul 13, 2016
@hadley hadley added this to the 0.3.0 milestone Jul 13, 2016
@ghaarsma
Copy link
Contributor

@ghaarsma ghaarsma commented Jul 13, 2016

Agree with your assessment. read_fwf is correct and fwf_empty is screwy.
Looking into whitespaceColumns to see if it is fixable.

Loading

@hadley
Copy link
Member

@hadley hadley commented Jul 13, 2016

@ghaarsma I think it might be that fwf_empty() generates a non-exhaustive set of columns, and somewhere we must skip over the \n without inspecting it.

Loading

@hadley
Copy link
Member

@hadley hadley commented Jul 13, 2016

I think it's probably this code: curLine_ + beginOffset_[col_] - it doesn't check that we don't hit any new lines along the way

Loading

@hadley hadley closed this in 851220f Jul 13, 2016
@hadley hadley removed the ready label Jul 13, 2016
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants