Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip not working the same? #945

Closed
ldecicco-USGS opened this issue Dec 14, 2018 · 8 comments
Closed

skip not working the same? #945

ldecicco-USGS opened this issue Dec 14, 2018 · 8 comments

Comments

@ldecicco-USGS
Copy link

@ldecicco-USGS ldecicco-USGS commented Dec 14, 2018

This has historically worked for me:

obs_url <- "https://waterservices.usgs.gov/nwis/dv/?site=02177000&format=rdb,1.0&ParameterCd=00060&StatCd=00003&startDT=2012-09-01&endDT=2012-10-01"
doc <- httr::GET(obs_url, encoding='gzip')
doc_cont <- httr::content(doc)
x <- read_delim(doc_cont, delim = "\t", skip = 29)
nrow(x)
[1] 0

If you look at the url:

https://waterservices.usgs.gov/nwis/dv/?site=02177000&format=rdb,1.0&ParameterCd=00060&StatCd=00003&startDT=2012-09-01&endDT=2012-10-01

If you skip the top stuff, there should be some rows to parse.

If I get to the read_delim line and don't specify the skip argument, I get 59 rows:

x <- read_delim(doc_cont, delim = "\t")
nrow(x)
[1] 59

So it seems like it should allow me to skip those first 29.

devtools::session_info()
- Session info ---------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/Chicago             
 date     2018-12-14                  

- Packages -------------------------------------------------
 package       * version date       lib source        
 assertthat      0.2.0   2017-04-11 [1] CRAN (R 3.5.1)
 backports       1.1.2   2017-12-13 [1] CRAN (R 3.5.0)
 bindr           0.1.1   2018-03-13 [1] CRAN (R 3.5.1)
 bindrcpp      * 0.2.2   2018-03-29 [1] CRAN (R 3.5.1)
 callr           3.1.0   2018-12-10 [1] CRAN (R 3.5.1)
 cli             1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
 crayon          1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
 curl            3.2     2018-03-28 [1] CRAN (R 3.5.1)
 data.table      1.11.8  2018-09-30 [1] CRAN (R 3.5.1)
 dataRetrieval * 2.7.4   2018-12-14 [1] local         
 desc            1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
 devtools        2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
 digest          0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 dplyr           0.7.8   2018-11-10 [1] CRAN (R 3.5.1)
 fansi           0.4.0   2018-10-05 [1] CRAN (R 3.5.1)
 fs              1.2.6   2018-08-23 [1] CRAN (R 3.5.1)
 glue            1.3.0   2018-07-17 [1] CRAN (R 3.5.1)
 hms             0.4.2   2018-03-10 [1] CRAN (R 3.5.1)
 httr            1.4.0   2018-12-11 [1] CRAN (R 3.5.1)
 lubridate       1.7.4   2018-04-11 [1] CRAN (R 3.5.1)
 magrittr        1.5     2014-11-22 [1] CRAN (R 3.5.1)
 memoise         1.1.0   2017-04-21 [1] CRAN (R 3.5.1)
 packrat         0.5.0   2018-11-14 [1] CRAN (R 3.5.1)
 pillar          1.3.0   2018-07-14 [1] CRAN (R 3.5.1)
 pkgbuild        1.0.2   2018-10-16 [1] CRAN (R 3.5.1)
 pkgconfig       2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
 pkgload         1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
 prettyunits     1.0.2   2015-07-13 [1] CRAN (R 3.5.1)
 processx        3.2.1   2018-12-05 [1] CRAN (R 3.5.1)
 ps              1.2.1   2018-11-06 [1] CRAN (R 3.5.1)
 purrr           0.2.5   2018-05-29 [1] CRAN (R 3.5.1)
 R6              2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
 Rcpp            1.0.0   2018-11-07 [1] CRAN (R 3.5.1)
 readr           1.3.0   2018-12-11 [1] CRAN (R 3.5.1)
 remotes         2.0.2   2018-10-30 [1] CRAN (R 3.5.1)
 rlang           0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
 rprojroot       1.3-2   2018-01-03 [1] CRAN (R 3.5.1)
 rstudioapi      0.8     2018-10-02 [1] CRAN (R 3.5.1)
 sessioninfo     1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
 stringi         1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
 stringr         1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
 testthat      * 2.0.1   2018-10-13 [1] CRAN (R 3.5.1)
 tibble          1.4.2   2018-01-22 [1] CRAN (R 3.5.1)
 tidyselect      0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
 usethis         1.4.0   2018-08-14 [1] CRAN (R 3.5.1)
 utf8            1.1.4   2018-05-24 [1] CRAN (R 3.5.1)
 withr           2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
 xml2            1.2.0   2018-01-24 [1] CRAN (R 3.5.1)

[1] C:/Users/ldecicco/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library
@seandavi
Copy link

@seandavi seandavi commented Dec 14, 2018

I think this looks like the same bug I just reported in #944 and that @jimhester just fixed. It looks like there is a "'" in your text that may be causing the problem. It might be worth installing from github to see if that fixes your problem.

@ldecicco-USGS
Copy link
Author

@ldecicco-USGS ldecicco-USGS commented Dec 15, 2018

Hmm....installing from github did not solve my issues....but it does seem similarly related to the skip parsing.

devtools::session_info()
- Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/Chicago             
 date     2018-12-15                  

- Packages -----------------------------------------------------------------------------
 package       * version    date       lib source                          
 assertthat      0.2.0      2017-04-11 [1] CRAN (R 3.5.1)                  
 backports       1.1.2      2017-12-13 [1] CRAN (R 3.5.0)                  
 bindr           0.1.1      2018-03-13 [1] CRAN (R 3.5.1)                  
 bindrcpp        0.2.2      2018-03-29 [1] CRAN (R 3.5.1)                  
 callr           3.1.0      2018-12-10 [1] CRAN (R 3.5.1)                  
 cli             1.0.1      2018-09-25 [1] CRAN (R 3.5.1)                  
 crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.1)                  
 curl            3.2        2018-03-28 [1] CRAN (R 3.5.1)                  
 dataRetrieval * 2.7.4      2018-12-15 [1] local                           
 desc            1.2.0      2018-05-01 [1] CRAN (R 3.5.1)                  
 devtools        2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                  
 digest          0.6.18     2018-10-10 [1] CRAN (R 3.5.1)                  
 dplyr           0.7.8      2018-11-10 [1] CRAN (R 3.5.1)                  
 fs              1.2.6      2018-08-23 [1] CRAN (R 3.5.1)                  
 glue            1.3.0      2018-07-17 [1] CRAN (R 3.5.1)                  
 hms             0.4.2      2018-03-10 [1] CRAN (R 3.5.1)                  
 httr            1.4.0      2018-12-11 [1] CRAN (R 3.5.1)                  
 magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.1)                  
 memoise         1.1.0      2017-04-21 [1] CRAN (R 3.5.1)                  
 packrat         0.5.0      2018-11-14 [1] CRAN (R 3.5.1)                  
 pillar          1.3.0      2018-07-14 [1] CRAN (R 3.5.1)                  
 pkgbuild        1.0.2      2018-10-16 [1] CRAN (R 3.5.1)                  
 pkgconfig       2.0.2      2018-08-16 [1] CRAN (R 3.5.1)                  
 pkgload         1.0.2      2018-10-29 [1] CRAN (R 3.5.1)                  
 prettyunits     1.0.2      2015-07-13 [1] CRAN (R 3.5.1)                  
 processx        3.2.1      2018-12-05 [1] CRAN (R 3.5.1)                  
 ps              1.2.1      2018-11-06 [1] CRAN (R 3.5.1)                  
 purrr           0.2.5      2018-05-29 [1] CRAN (R 3.5.1)                  
 R6              2.3.0      2018-10-04 [1] CRAN (R 3.5.1)                  
 Rcpp            1.0.0      2018-11-07 [1] CRAN (R 3.5.1)                  
 readr           1.3.0.9000 2018-12-15 [1] Github (tidyverse/readr@1f84c49)
 remotes         2.0.2      2018-10-30 [1] CRAN (R 3.5.1)                  
 rlang           0.3.0.1    2018-10-25 [1] CRAN (R 3.5.1)                  
 rprojroot       1.3-2      2018-01-03 [1] CRAN (R 3.5.1)                  
 rstudioapi      0.8        2018-10-02 [1] CRAN (R 3.5.1)                  
 sessioninfo     1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                  
 testthat      * 2.0.1      2018-10-13 [1] CRAN (R 3.5.1)                  
 tibble          1.4.2      2018-01-22 [1] CRAN (R 3.5.1)                  
 tidyselect      0.2.5      2018-10-11 [1] CRAN (R 3.5.1)                  
 usethis         1.4.0      2018-08-14 [1] CRAN (R 3.5.1)                  
 withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.1)                  
 xml2            1.2.0      2018-01-24 [1] CRAN (R 3.5.1)                  

[1] C:/Users/ldecicco/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library

I do have many tests failing on a package that is on CRAN. They are all "skip_on_cran" because they rely on web services. I remember getting emails from other tidyverse packages to developers who manage packages with dependencies warning us about future CRAN releases. I either missed the readr email, or there wasn't one sent. I just want to say I hope you guys continue those courtesy emails....II'm a big fan, I'll do a big thorough test each time I get an email like that!

@jimhester jimhester closed this in 43b7725 Dec 17, 2018
@jimhester
Copy link
Member

@jimhester jimhester commented Dec 17, 2018

This should now be fixed.

readr::read_tsv("https://waterservices.usgs.gov/nwis/dv/?site=02177000&format=rdb,1.0&ParameterCd=00060&StatCd=00003&startDT=2012-09-01&endDT=2012-10-01",
  skip = 29, col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_date(format = ""),
#>   X4 = col_double(),
#>   X5 = col_character()
#> )
#> # A tibble: 31 x 5
#>    X1    X2       X3            X4 X5   
#>    <chr> <chr>    <date>     <dbl> <chr>
#>  1 USGS  02177000 2012-09-01   191 A    
#>  2 USGS  02177000 2012-09-02   213 A    
#>  3 USGS  02177000 2012-09-03   409 A    
#>  4 USGS  02177000 2012-09-04   722 A    
#>  5 USGS  02177000 2012-09-05   634 A    
#>  6 USGS  02177000 2012-09-06   414 A    
#>  7 USGS  02177000 2012-09-07   320 A    
#>  8 USGS  02177000 2012-09-08   276 A    
#>  9 USGS  02177000 2012-09-09   251 A    
#> 10 USGS  02177000 2012-09-10   227 A    
#> # ... with 21 more rows

Created on 2018-12-17 by the reprex package (v0.2.1)

@ldecicco-USGS skipping all tests on CRANs machines is unfortunate. I would encourage you to include at least a few example datasets in your tests so they can be run on CRANs machines.

At least this example is not big and could easily be included in the package, which would have caused this error to be found in the reverse depednency checks before readr was released on CRAN.

@ldecicco-USGS
Copy link
Author

@ldecicco-USGS ldecicco-USGS commented Dec 17, 2018

Thanks @jimhester !

I'll work on adding a few tests that run on a small local dataset I can include directly in the package (I'll look for the ugliest data I can find...shouldn't be too hard). I'm planning to do an update to CRAN soon, so that's a great suggestion. The web services can be flaky and the data itself is a moving target, which is why I hesitate to ever include a web service test on a CRAN check....but I've got other local tests running on CRAN for our XML parser....I'm slapping my forehead to wonder how I didn't get a local test for the tab-delimited parser.

I'm unclear on how I can test your fix. I've tried:

devtools::install_github("tidyverse/readr", ref = "43b77253cd21a01125c83380a74ac935a5b5cb2a")

and:

devtools::install_github("tidyverse/readr")

After running those, neither of those ran the example you posted successfully (that is, I'm still seeing 0 rows).

@jimhester
Copy link
Member

@jimhester jimhester commented Dec 17, 2018

Make sure you install the package without it already being loaded, or restart R if it is already loaded.

@ldecicco-USGS
Copy link
Author

@ldecicco-USGS ldecicco-USGS commented Dec 17, 2018

Great. That fixed some issues.

Here's a similar query that has worked historically and still doesn't with the github version of readr:

readr::read_tsv("https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_output=0&rdb_inventory_output=file&TZoutput=0&pm_cd_compare=Greater%20than&radio_parm_cds=previous_parm_cds&qw_attributes=0&format=rdb&rdb_qw_attributes=expanded&date_format=YYYY-MM-DD&rdb_compression=value&qw_sample_wide=0&begin_date=2010-11-03",
skip = 124, col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_character()
)
Warning: 206 parsing failures.
row col  expected     actual                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   file
 61  -- 1 columns 33 columns 'https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_o [... truncated]
# A tibble: 266 x 1
   X1                                                             
   <chr>    

By comparison, fread would get 17 columns:

data.table::fread("https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_output=0&rdb_inventory_output=file&TZoutput=0&pm_cd_compare=Greater%20than&radio_parm_cds=previous_parm_cds&qw_attributes=0&format=rdb&rdb_qw_attributes=expanded&date_format=YYYY-MM-DD&rdb_compression=value&qw_sample_wide=0&begin_date=2010-11-03",
 skip = 124, header =  FALSE, data.table = FALSE)

Working through more of my tests, I do have a file in the package itself that doesn't parse as expected:
RDB1Example.txt

readr::read_tsv("https://github.com/tidyverse/readr/files/2687672/RDB1Example.txt",skip = 24, col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_character()
)
# this works....but why 48?
readr::read_tsv("https://github.com/tidyverse/readr/files/2687672/RDB1Example.txt",skip = 48, col_names = FALSE)

If you open that file in a text editor, it looks to me like you should only need to skip 24 lines. Something in the top section is being considered an end-of-line?

jimhester added a commit that referenced this issue Dec 18, 2018
@jimhester
Copy link
Member

@jimhester jimhester commented Dec 18, 2018

@ldecicco-USGS there was an issue with skipping and windows newlines in the previous version, it should now be fixed

library(readr)
url <- "https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_output=0&rdb_inventory_output=file&TZoutput=0&pm_cd_compare=Greater%20than&radio_parm_cds=previous_parm_cds&qw_attributes=0&format=rdb&rdb_qw_attributes=expanded&date_format=YYYY-MM-DD&rdb_compression=value&qw_sample_wide=0&begin_date=2010-11-03"
read_tsv(url, skip = 124, col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   X3 = col_date(format = ""),
#>   X4 = col_time(format = ""),
#>   X5 = col_logical(),
#>   X6 = col_logical(),
#>   X12 = col_logical(),
#>   X13 = col_logical(),
#>   X14 = col_logical(),
#>   X15 = col_double(),
#>   X16 = col_double(),
#>   X19 = col_double(),
#>   X21 = col_double(),
#>   X25 = col_double(),
#>   X27 = col_logical(),
#>   X28 = col_double(),
#>   X29 = col_double(),
#>   X31 = col_double()
#> )
#> See spec(...) for full column specifications.
#> # A tibble: 204 x 33
#>    X1    X2    X3         X4    X5    X6    X7    X8    X9    X10   X11  
#>    <chr> <chr> <date>     <tim> <lgl> <lgl> <chr> <chr> <chr> <chr> <chr>
#>  1 USGS  0402… 2011-03-15 10:35 NA    NA    CDT   K     USGS… WS    GR11…
#>  2 USGS  0402… 2011-03-15 10:35 NA    NA    CDT   K     USGS… WS    GR11…
#>  3 USGS  0402… 2011-03-15 10:35 NA    NA    CDT   K     USGS… WS    GR11…
#>  4 USGS  0402… 2011-03-15 10:35 NA    NA    CDT   K     USGS… WS    GR11…
#>  5 USGS  0402… 2011-04-20 10:00 NA    NA    CDT   K     USGS… WS    GR11…
#>  6 USGS  0402… 2011-04-20 10:00 NA    NA    CDT   K     USGS… WS    GR11…
#>  7 USGS  0402… 2011-04-20 10:00 NA    NA    CDT   K     USGS… WS    GR11…
#>  8 USGS  0402… 2011-04-20 10:00 NA    NA    CDT   K     USGS… WS    GR11…
#>  9 USGS  0402… 2011-05-01 13:00 NA    NA    CDT   K     USGS… WS    GR11…
#> 10 USGS  0402… 2011-05-01 13:00 NA    NA    CDT   K     USGS… WS    GR11…
#> # ... with 194 more rows, and 22 more variables: X12 <lgl>, X13 <lgl>,
#> #   X14 <lgl>, X15 <dbl>, X16 <dbl>, X17 <chr>, X18 <chr>, X19 <dbl>,
#> #   X20 <chr>, X21 <dbl>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <dbl>,
#> #   X26 <chr>, X27 <lgl>, X28 <dbl>, X29 <dbl>, X30 <chr>, X31 <dbl>,
#> #   X32 <chr>, X33 <chr>

read_tsv("https://github.com/tidyverse/readr/files/2687672/RDB1Example.txt",skip = 24, col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_date(format = ""),
#>   X4 = col_double(),
#>   X5 = col_character()
#> )
#> # A tibble: 31 x 5
#>    X1    X2       X3            X4 X5   
#>    <chr> <chr>    <date>     <dbl> <chr>
#>  1 USGS  02177000 2012-09-01   191 A    
#>  2 USGS  02177000 2012-09-02   213 A    
#>  3 USGS  02177000 2012-09-03   409 A    
#>  4 USGS  02177000 2012-09-04   722 A    
#>  5 USGS  02177000 2012-09-05   634 A    
#>  6 USGS  02177000 2012-09-06   414 A    
#>  7 USGS  02177000 2012-09-07   320 A    
#>  8 USGS  02177000 2012-09-08   276 A    
#>  9 USGS  02177000 2012-09-09   251 A    
#> 10 USGS  02177000 2012-09-10   227 A    
#> # ... with 21 more rows

Created on 2018-12-18 by the reprex package (v0.2.1)

@lock
Copy link

@lock lock bot commented Jun 16, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants