Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip bug in read_lines and read_fwf if unpaired quotes in the data #991

Closed
juangomezduaso opened this issue Apr 15, 2019 · 7 comments
Closed
Labels
bug an unexpected problem or unintended behavior

Comments

@juangomezduaso
Copy link

juangomezduaso commented Apr 15, 2019

Functions read_lines() and read_fwf() don't behave correctly wrt skip parameter if there are unpaired double quotes (") in the data.

library(readr)
#> Warning: package 'readr' was built under R version 3.5.3
data <-
"a\"b
cde
f\"g
hij"
read_fwf(data, fwf_widths(1:2, LETTERS[1:2]), skip=1)
#> # A tibble: 1 x 2
#>   A     B    
#>   <chr> <chr>
#> 1 h     ij
read_lines(data, skip=1)
#> [1] "hij"

Created on 2019-04-15 by the reprex package (v0.2.1)

Created on 2019-04-15 by the reprex package (v0.2.1)

@juangomezduaso
Copy link
Author

juangomezduaso commented Apr 15, 2019

This might be the cause of issue #986

@dan-reznik
Copy link

dan-reznik commented Apr 15, 2019

ouch, that makes sense. we do agree that for consistency read_lines() should not care about any characters (quotes or otherwise) present in the input stream, except CR-LF, correct? note: those "quote" field separators are relevant for field-oriented readers (read_csv, read_delim), via the "quote" parameter, but shouldn't be for read_lines() or read_file(). the authors probably inadvertently reused some code, hopefully they will be kind enough to review this in time, as read_lines() is a major workhorse for most people trying to investigate structural problems with files.

@juangomezduaso
Copy link
Author

juangomezduaso commented Apr 15, 2019

Yes. And I think the authors wll do as well, because this just happens with skip, and when reading all lines there is no problem and these "enquoted" newlines are considered as line separators

@dan-reznik
Copy link

dan-reznik commented Apr 15, 2019

i guess one workaround is to simply substitute all quotes for some other unusual char in a file prior to pushing it into read_lines():

read_file("file.txt") %>%
   str_replace_all(fixed('"'),"^") %>%
   read_lines(skip=xxx, n_max=yyy) %>%
   ...

however this defeats the main use of read_lines() w/ a skip and an n_max: the entire file (potentially large) will not be brought into memory. if loading the entire file were acceptable, read_lines() could simply become this:

read_file("file.txt") %>%
   str_split("\\r\\n") %>%
   first %>%
   tail(-skip) %>%
   head(n_max) %>%
   ...

cheers

@juangomezduaso
Copy link
Author

juangomezduaso commented Apr 15, 2019

Cumprimentos

@jimhester jimhester added the bug an unexpected problem or unintended behavior label May 3, 2019
@joachim-gassen
Copy link

joachim-gassen commented Apr 20, 2020

Hi there. Came here to file a new issue as this has bit me quite badly when working on a huge data import project using either vroom() or read_*. The issue is also present when using, e.g., read_tsv() with quote = "". See below.

library(readr)

# I am on 1.3.1 - same behavior with current Github version and on Mac as well as Ubuntu 18.04

dta <- c(
  "a\tb\tc",
  "row 1 a\trow 1 b\trow 1 c",
  "row 2 a\trow 2 b\trow 2 c\"",
  "row 3 a\trow 3 b\trow 3 c\"",
  "row 4 a\trow 4 b\trow 4 c"
)

tmp_file <- tempfile("repex", fileext = ".tsv")
writeLines(dta, tmp_file)

# all good
read_tsv(tmp_file, quote = "")

# should start reading in data row 2 and it does
read_tsv(tmp_file, col_names = c("a", "b", "c"), skip = 2, quote = "")

# should start reading in data row 3 but it reads row 4 instead
read_tsv(tmp_file, col_names = c("a", "b", "c"), skip = 3, quote = "")

# Same for read_lines
read_lines(tmp_file, skip = 3)

# ...and for vroom
library(vroom)
vroom(tmp_file, col_names = c("a", "b", "c"), delim = "\t", skip = 3, quote = "")

# Good old base works....
read.delim(tmp_file, quote = "", header = FALSE, col.names = c("a", "b", "c"), skip = 3)

This causes all sorts of issues in user land as in many cases skip will silently skip an inconsistent number of data lines because of this bug. Any pointers on how to fix this? I am back to using base now but I would rather use {vroom} for speed...

@juangomezduaso
Copy link
Author

juangomezduaso commented Jan 4, 2021

@jimhester:
I think that elf86e5 doesn't solve this issue for read_fwf().
It seems to me that it would be easy to do by just appending " skip_quote = FALSE" in the two calls to datasource() that read_fwf() makes, but I am not confident enought to do a pull request, sorry.

jimhester added a commit to tidyverse/vroom that referenced this issue May 25, 2021
We only want to try to find embedded newlines when skipping lines if our
format uses quoting. Otherwise we don't want to check for quoted newlines when
skipping

Fixes tidyverse/readr#991 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants