Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected parsing behavior of readr (returns 0 rows) #944

Closed
seandavi opened this issue Dec 13, 2018 · 5 comments
Closed

Unexpected parsing behavior of readr (returns 0 rows) #944

seandavi opened this issue Dec 13, 2018 · 5 comments

Comments

@seandavi
Copy link

@seandavi seandavi commented Dec 13, 2018

This is a report related to seandavi/GEOquery#78. Here, I am comparing the parsing behavior of read.table and read_tsv. In particular, read_tsv returns a tibble <0x0>. Interestingly, if one replaces the GSE14308 with GSE14309, I get the expected behavior.

I have not tried this with a prior readr version yet, but I can if that is helpful.

destfile = 'GSE14308_series_matrix.txt.gz'
f = curl::curl_download("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE14nnn/GSE14308/matrix/GSE14308_series_matrix.txt.gz",
                        destfile = destfile)

z1 = readr::read_tsv(destfile,
                 skip=65, col_names=TRUE, comment = '!')
#> Parsed with column specification:
#> cols()
dim(z1)
#> [1] 0 0

z2 = read.table(gzfile(destfile),
                     skip=65, header=TRUE, comment = '!')
dim(z2)
#> [1] 45101    13
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0      crayon_1.3.4    digest_0.6.18   rprojroot_1.3-2
#>  [5] R6_2.3.0        backports_1.1.2 magrittr_1.5    evaluate_0.12  
#>  [9] pillar_1.3.0    rlang_0.3.0.1   stringi_1.2.4   curl_3.2       
#> [13] rmarkdown_1.10  tools_3.5.1     stringr_1.3.1   readr_1.3.0    
#> [17] hms_0.4.2       yaml_2.2.0      compiler_3.5.1  pkgconfig_2.0.2
#> [21] htmltools_0.3.6 knitr_1.20      tibble_1.4.2

Created on 2018-12-13 by the reprex package (v0.2.1)

@jimhester jimhester closed this in 0e727c7 Dec 13, 2018
@jimhester
Copy link
Member

@jimhester jimhester commented Dec 13, 2018

Thanks, this should now be fixed. There was a bug in how quoted strings were handled during the skip parsing. The 'GSE14308_series_matrix' file had

!Series_contributor "John,J,O'Shea"

in line 14, and because the single quote in O'Shea was treated the same as the double quote this meant the quotes were mismatched, and the rest of the data was essentially ignored.

Should now be fixed however, thank you for opening the issue!

@seandavi
Copy link
Author

@seandavi seandavi commented Dec 14, 2018

Thanks, @jimhester, for the quick diagnosis and fix. Will this be coming out in a bug-fix release anytime soon? I'd rather not rely on having folks install from github when using GEOquery, but that decision will be driven by timing of release.

jimhester added a commit that referenced this issue Dec 17, 2018
Single quotes used as apostrophes caused the similar problems to #944,
even if the line was skipped.

The skipping logic is now rewritten to hopefully be more clear and
easier to maintain. It is also more robust to these sorts of problems,
we no longer try to ignore content within single quotes, only double quotes.

Fixes #945
@assaron
Copy link

@assaron assaron commented Dec 19, 2018

@jimhester thanks for the quick fix. I'm also interested in this getting into a release. Are there any estimates for this?

@stufield
Copy link

@stufield stufield commented Dec 21, 2018

I also ran into this issue. @jimhester seems to have quickly tracked it down to terminating single quotes during skip lines. Just for future reference in case someone else runs across this with v1.3.0 installed, here's a simplified reprex of the issue:

library(readr)
packageVersion("readr")            # broken in 1.3.0; fixed in 1.3.1
txt <- "\t\t\tTarget\t5'-thymine\n\t\t\tNew\tUnknown"  # note the single quote (5')
read_lines(txt)                        # works as expected; both lines
read_lines(txt, n_max = 1)             # works as expected; 1st line
read_lines(txt, n_max = 1, skip = 0)   # works as expected; 1st line
read_lines(txt, n_max = 1, skip = 1)   # unexpected empty character(0); 2nd line
txt <- "\t\t\tTarget\t5'-'thymine\n\t\t\tNew\tUnknown"  # note closing quote
read_lines(txt, n_max = 1, skip = 1)   # works as expected with the closing single quote

Thanks @jimhester for getting a bug fix version 1.3.1 out to CRAN so quickly!

@lock
Copy link

@lock lock bot commented Jun 19, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants