Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readr no longer reproduces the problems from challenge.csv #1398

Open
ganong123 opened this issue Apr 18, 2022 · 3 comments · May be fixed by #1431
Open

readr no longer reproduces the problems from challenge.csv #1398

ganong123 opened this issue Apr 18, 2022 · 3 comments · May be fixed by #1431

Comments

@ganong123
Copy link

challenge.csv is designed to teach some key challenges of parsing and features of readr. However, the example is broken.

challenge <- read_csv(readr_example("challenge.csv"))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
#> Warning: 1000 parsing failures.
#>  row col           expected     actual                                                           file
#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> .... ... .................. .......... ..............................................................
#> See problems(...) for more details.

source

Here's what I get when I run this code on my computer (this is a screen cap from the vignette, but I have the same issue on my computer)
Screen Shot 2022-04-17 at 9 46 39 PM

@sbearrows sbearrows self-assigned this Aug 25, 2022
@sbearrows
Copy link
Contributor

@jennybc It seems like this is due to vroom since I can replicate the parsing error with edition 1

library(readr)
with_edition(
  1,
  read_csv(readr_example("challenge.csv"), show_col_types = FALSE)
)
#> Warning: 1000 parsing failures.
#>  row col           expected     actual                                                                                               file
#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> .... ... .................. .......... ..................................................................................................
#> See problems(...) for more details.
#> # A tibble: 2,000 × 2
#>        x y    
#>    <dbl> <lgl>
#>  1   404 NA   
#>  2  4172 NA   
#>  3  3004 NA   
#>  4   787 NA   
#>  5    37 NA   
#>  6  2332 NA   
#>  7  2489 NA   
#>  8  1449 NA   
#>  9  3665 NA   
#> 10  3863 NA   
#> # … with 1,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-30 by the reprex package (v2.0.1.9000)

This is happening in a vignette that we already want/need to update so maybe a minimal solution is best for now. We could specify the column types like we do for problems() tests:

library(readr)
read_csv(
  readr_example("challenge.csv"),
  show_col_types = FALSE,
  col_types = "dl"
)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 2,000 × 2
#>        x y    
#>    <dbl> <lgl>
#>  1   404 NA   
#>  2  4172 NA   
#>  3  3004 NA   
#>  4   787 NA   
#>  5    37 NA   
#>  6  2332 NA   
#>  7  2489 NA   
#>  8  1449 NA   
#>  9  3665 NA   
#> 10  3863 NA   
#> # … with 1,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-30 by the reprex package (v2.0.1.9000)

Otherwise we'd need to modify/replace challenge.csv with something that actually trips vroom. An easy candidate would be something with a varying number of columns per row, because vroom will always warn:

library(readr)
# create a file like this
df <- glue::glue('x,y
                 1,2
                 3,4
                 5,6
                 7
                 8,9
                 10')

tf <- withr::local_tempfile(lines = df)

read_csv(tf, show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 6 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     3     4
#> 3     5     6
#> 4     7    NA
#> 5     8     9
#> 6    10    NA

Created on 2022-08-30 by the reprex package (v2.0.1.9000)

@jennybc
Copy link
Member

jennybc commented Aug 30, 2022

I think you should update that section of the vignette for the modern readr 2e / vroom era.

Each of these functions firsts calls spec_xxx() (as described above), and then parses the file according to that column specification:

^ this is no longer true, needs rewording

The rectangular parsing functions almost always succeed; they’ll only fail if the format is severely messed up. Instead, readr will generate a data frame of problems. The first few will be printed out, and you can access them all with problems():

You can either force parsing problems to happen with challenge.csv, which at least allows some discussion of problems(). As you say, you can force it by providing (bad) column types. But this is pretty artificial / it's a stop gap.

It would be better (but harder) to think about what sort of realistic problems we want to demonstrate and create a new small dataset that has such a problem. Issues relating to parsing problems in readr and vroom would be a good source of inspiration.

@sbearrows
Copy link
Contributor

The most common usage of problems() that I can see are either a user specifies col_types but there is some data later on that is unexpected and doesn't match col_types #1376 or the number of columns per row varies #1328, tidyverse/vroom#439 so I don't think either of the two situations is necessarily artificial/forced

@sbearrows sbearrows linked a pull request Sep 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants