Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_lines(fname,skip=xxx) skipping empty lines causes unexpected, backward-incompatible behavior #923

Closed
dan-reznik opened this issue Nov 23, 2018 · 18 comments

Comments

@dan-reznik
Copy link

@dan-reznik dan-reznik commented Nov 23, 2018

Let 'file.txt' contain the following 8 lines, with its 4th line empty:

L1
L2
L3
<empty line>
Hello,Weird,Behavior
1,2,3
3,4,5
6,7,8

Calling read_lines() with no skip parameter produces the expected results, with sk0's 4th element being the empty string:

library(tidyverse)
fname <- 'file.txt'
sk0 <- read_lines(fname)
sk0
#> "L1","L2","L3","","Hello,Weird,Behavior","1,2,3","3,4,5,"6,7,8"    

Skipping the first 3 lines also works, now with the 1st element being an empty string:

sk3 <- read_lines(fname,skip=3)
sk3
#> "","Hello,Weird,Behavior","1,2,3","3,4,5","6,7,8"               

However, and here lies the bug, skipping 4 lines misses the 5th line entirely! (in fact skip=4 will miss any sequence of empty lines).

sk4 <- read_lines(fname,skip=4)
sk4
#> "1,2,3","3,4,5","6,7,8"

Though skipping empty lines is a documented feature of this version of readr, it seems inconsistent and unnecessary, and will result in backward incompatibilities.

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.11.8 snakecase_0.9.2   bindrcpp_0.2.2    assertthat_0.2.0  stringi_1.2.4    
 [6] jsonlite_1.5      lubridate_1.7.4   glue_1.3.0        readxl_1.1.0      forcats_0.3.0    
[11] stringr_1.3.1     dplyr_0.7.8       purrr_0.2.5       readr_1.2.1       tidyr_0.8.2      
[16] tibble_1.4.2      ggplot2_3.1.0     tidyverse_1.2.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       cellranger_1.1.0 pillar_1.3.0     compiler_3.5.1   plyr_1.8.4      
 [6] bindr_0.1.1      tools_3.5.1      nlme_3.1-137     gtable_0.2.0     lattice_0.20-38 
[11] pkgconfig_2.0.2  rlang_0.3.0.1    cli_1.0.1        rstudioapi_0.8   yaml_2.2.0      
[16] haven_2.0.0      withr_2.1.2      xml2_1.2.0       httr_1.3.1       hms_0.4.2       
[21] grid_3.5.1       tidyselect_0.2.5 R6_2.3.0         modelr_0.1.2     magrittr_1.5    
[26] backports_1.1.2  scales_1.0.0     rvest_0.3.2      colorspace_1.3-2 lazyeval_0.2.1  
[31] munsell_0.5.0    broom_0.5.0      crayon_1.3.4 
@dan-reznik dan-reznik changed the title read_lines(fname,skip=xxx) wrong if line skipped is empty! read_lines(fname,skip=xxx) inadvertently skipping empty lines! Nov 23, 2018
@jimhester
Copy link
Member

@jimhester jimhester commented Nov 23, 2018

@dan-reznik
Copy link
Author

@dan-reznik dan-reznik commented Nov 23, 2018

@zhangj5
Copy link

@zhangj5 zhangj5 commented Nov 24, 2018

This appears to also cause an issue in "GEOquery". Downgrading to 1.1.1 fixed the problem.
seandavi/GEOquery#74

@rogiersbart
Copy link

@rogiersbart rogiersbart commented Nov 26, 2018

Also, when programmatically determining the number of lines to skip, this requires additional checking of the number of empty lines, instead of just finding the line number you want to start at minus 1. It's also a question of semantics I guess: For me, an empty line is still a line. An additional argument seems great, and I'd love the default to remain like it was before (empty line == line), but that's just how I feel about it, and maybe there is good reason to change the default behaviour ...

@nacnudus
Copy link
Contributor

@nacnudus nacnudus commented Nov 29, 2018

This seems to have appeared in 6517708.

@seandavi
Copy link

@seandavi seandavi commented Nov 30, 2018

Thanks, @zhangj5 for pulling in the GEOquery issue seandavi/GEOquery#74. This is a breaking change for GEOquery.

@jimhester
Copy link
Member

@jimhester jimhester commented Nov 30, 2018

@dan-reznik please refrain from commenting further unless you have more detail to add to the issue.

@nacnudus
Copy link
Contributor

@nacnudus nacnudus commented Nov 30, 2018

I think the problem is in the first phase of line-skipping from the top of the file to the start of the data.

while (n > 0 && cur != end) {

n is the number of lines the user wants to skip. While n > 0 all lines are skipped including blank lines and comments, but once n == 0 any remaining blank lines and comments remain. This is usually disguised by read_delim() and read_table() because they do more phases of skipping to read column names and then data, but you can force some strange behaviour.

txt <- "note\n\n\ncol_name\ndata"
readr::read_lines(txt, skip = 1)
#> [1] ""         ""         "col_name" "data"
readr::read_csv(txt, skip = 1)
#> # A tibble: 1 x 1
#>   col_name
#>   <chr>   
#> 1 data
readr::read_csv(txt, skip = 1, skip_empty_rows = FALSE)
#> Warning: Missing column names filled in: 'X1' [1]
#> # A tibble: 1 x 1
#>   X1   
#>   <chr>
#> 1 data

Created on 2018-11-30 by the reprex package (v0.2.0.9000).

@seandavi
Copy link

@seandavi seandavi commented Nov 30, 2018

@jimhester, when you get a chance, could you clarify the path forward? Should we be making changes to code to address the change, or is there a plan to revert the change? If there has been no decision made, is there a timeframe for doing so? Just looking for an update if you have one.

@jennybc
Copy link
Member

@jennybc jennybc commented Nov 30, 2018

I'll chime in, in case @jimhester has already entered the weekend zone.

Yes there's a bug and a fix is already in progress. But it requires some effort to fix this w/o breaking something else.

I advise you hold off on any sort of workaround on your end.

@seandavi
Copy link

@seandavi seandavi commented Nov 30, 2018

Thx, @jennybc, for chiming in. I totally understand how fixes when dependent packages are involved can be a tightrope walk.

@pabgamu
Copy link

@pabgamu pabgamu commented Dec 4, 2018

@jennybc guess you are already aware but this behaviour also affects read_tsv, using similar sample

txt <- "note\n\n\ncol_name\ndata\n"

txt
[1] "note\n\n\ncol_name\ndata\n"

read_tsv(txt, skip = 0, skip_empty_rows = FALSE, col_names = FALSE)
# A tibble: 5 x 1
  X1      
  <chr>   
1 note    
2 NA      
3 NA      
4 col_name
5 data    

read_tsv(txt, skip = 1, skip_empty_rows = FALSE, col_names = FALSE)
# A tibble: 4 x 1
  X1      
  <chr>   
1 NA      
2 NA      
3 col_name
4 data    

read_tsv(txt, skip = 2, skip_empty_rows = FALSE, col_names = FALSE)
# A tibble: 1 x 1
  X1   
  <chr>
1 data 

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] reprex_0.2.1    bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1  
 [5] dplyr_0.7.8     purrr_0.2.5     readr_1.2.1     tidyr_0.8.2    
 [9] tibble_1.4.2    ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 haven_2.0.0      lattice_0.20-38 
 [4] colorspace_1.3-2 htmltools_0.3.6  base64enc_0.1-3 
 [7] yaml_2.2.0       utf8_1.1.4       rlang_0.3.0.1   
[10] pillar_1.3.0     glue_1.3.0       withr_2.1.2     
[13] modelr_0.1.2     readxl_1.1.0     bindr_0.1.1     
[16] plyr_1.8.4       munsell_0.5.0    gtable_0.2.0    
[19] cellranger_1.1.0 rvest_0.3.2      evaluate_0.12   
[22] knitr_1.20       callr_3.0.0      ps_1.2.1        
[25] fansi_0.4.0      broom_0.5.0      Rcpp_1.0.0      
[28] clipr_0.4.1      scales_1.0.0     backports_1.1.2 
[31] debugme_1.1.0    jsonlite_1.5     fs_1.2.6        
[34] hms_0.4.2        digest_0.6.18    stringi_1.2.4   
[37] processx_3.2.0   grid_3.5.1       rprojroot_1.3-2 
[40] cli_1.0.1        tools_3.5.1      magrittr_1.5    
[43] lazyeval_0.2.1   whisker_0.3-2    crayon_1.3.4    
[46] pkgconfig_2.0.2  xml2_1.2.0       lubridate_1.7.4 
[49] assertthat_0.2.0 rmarkdown_1.10   httr_1.3.1      
[52] rstudioapi_0.8   R6_2.3.0         nlme_3.1-137    
[55] compiler_3.5.1  
#> Error: <text>:4:1: unexpected '['
#> 3: txt
#> 4: [
#>    ^

Created on 2018-12-04 by the reprex package (v0.2.1)

@bknakker
Copy link

@bknakker bknakker commented Dec 4, 2018

The fact that I'm unhappy with this breaking change is my issue, but please don't leave the parameter that can reverse this undocumented.

@dan-reznik dan-reznik closed this Dec 4, 2018
@dan-reznik dan-reznik reopened this Dec 4, 2018
@dan-reznik dan-reznik changed the title read_lines(fname,skip=xxx) inadvertently skipping empty lines! read_lines(fname,skip=xxx) skipping empty lines causes unexpected, backward-incompatible behavior Dec 4, 2018
jimhester added a commit that referenced this issue Dec 4, 2018
We now always count empty lines and comments as part of the number of
skips, but also skip any remaining blank lines are comments after the
skips. This allows you to have consistent skip behavior but also makes
it easy to ignore blanks, which is useful with messy inputs.

Fixes #923

Fix test
@seandavi
Copy link

@seandavi seandavi commented Dec 4, 2018

Thanks, @jimhester, for the really quick change and addition of the additional parameter.

I am still seeing something that looks different than I used to see with readr 1.1. In particular, blank lines before the first non-blank line are still treated differently when skip>0 and when skip is not specified. Here is an example that illustrates the issue.

> txt = ("\n\nheader\n1\n2\n3")
> read_tsv(txt) # skips two blank lines
# A tibble: 3 x 1
  header
   <dbl>
1      1
2      2
3      3
> read_tsv(txt,skip=1) # skips 3 lines
# A tibble: 2 x 1
    `1`
  <dbl>
1     2
2     3
> read_tsv(txt,skip=1,skip_empty_rows=FALSE) # still skips three lines, despite two being empty.
# A tibble: 2 x 1
    `1`
  <dbl>
1     2
2     3
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GEOquery_2.51.3     testthat_2.0.1      Biobase_2.42.0     
[4] BiocGenerics_0.28.0 readr_1.2.1.9000   

If you want the current version to be the expected behavior, that is fine, but I just wanted to point out that the current implementation still shows an inconsistency, even after adding the additional skip_empty_rows = FALSE.

@jimhester
Copy link
Member

@jimhester jimhester commented Dec 4, 2018

The changes are only in a PR / branch for now. It looks like you are still using the current master branch?

Make sure you have installed the code from the PR devtools::install_github("tidyverse/readr#933") and restarted R.

This is the behavior from that PR

library(readr)
txt <- "\n\nheader\n1\n2\n3"
read_tsv(txt)
#> # A tibble: 3 x 1
#>   header
#>    <dbl>
#> 1      1
#> 2      2
#> 3      3
read_tsv(txt,skip=1)
#> # A tibble: 3 x 1
#>   header
#>    <dbl>
#> 1      1
#> 2      2
#> 3      3
read_tsv(txt,skip=1,col_names = FALSE, skip_empty_rows=FALSE)
#> # A tibble: 5 x 1
#>   X1    
#>   <chr> 
#> 1 <NA>  
#> 2 header
#> 3 1     
#> 4 2     
#> 5 3

Created on 2018-12-04 by the reprex package (v0.2.1)

@seandavi
Copy link

@seandavi seandavi commented Dec 4, 2018

My bad. Yes, the PR works as expected. It also appears to deal with #925. Thanks again.

jimhester added a commit that referenced this issue Dec 5, 2018
We now always count empty lines and comments as part of the number of
skips, but also skip any remaining blank lines are comments after the
skips. This allows you to have consistent skip behavior but also makes
it easy to ignore blanks, which is useful with messy inputs.

Fixes #923

Fix test
jimhester added a commit that referenced this issue Dec 5, 2018
We now always count empty lines and comments as part of the number of
skips, but also skip any remaining blank lines are comments after the
skips. This allows you to have consistent skip behavior but also makes
it easy to ignore blanks, which is useful with messy inputs.

Fixes #923

Fix test
@dan-reznik
Copy link
Author

@dan-reznik dan-reznik commented Dec 5, 2018

you guys are awesome!

@lock
Copy link

@lock lock bot commented Jun 3, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants