Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overzealous guessing/parsing for the "number" format based on grouping marks? #1520

Open
jmobrien opened this issue Oct 24, 2023 · 0 comments

Comments

@jmobrien
Copy link

jmobrien commented Oct 24, 2023

I'm encountering situations where the parsing guess rules seem to be overzealously deciding on the "number" category based on the presence of grouping mark[s] (for me, commas)--which in turn is leading to data quality problems during import. Wondering if it would be worth adding a few more checks before a "numbers" guess is made.

An example, similar to what I actually encountered:

### Can work well:
readr::guess_parser("1,234,567")     # Fine - 1234567
#> [1] "number"
readr::guess_parser("0,234,567")    # Thoughtful--leading zero inconsistent w/idea of "number"
#> [1] "character"

### But:
readr::guess_parser("1,2,4")            # Not a standard number (in my locale)
#> [1] "number"
readr::guess_parser("1,2,")              # Farther afield
#> [1] "number"
readr::guess_parser("1,2,,,,,4,,,")     # Even farther
#> [1] "number"


### Real-world example--encountering data that uses quotes for comma sequestration, 
### including groups of numeric reference codes:
csv_dat <-
  c(
    'char,     num,   numeric_codes,  mixed',
    'a,        1,     1,              a',
    '"oh,my",  2,     2,              2',
    'c,        3,     "1,23,4",       c',
    'd,        4,     "1,2,3,4",      d'
  )

### Write it out:
tmp <- tempfile(fileext = ".csv")

csv_dat |> 
  stringr::str_remove_all(" ") |> 
  writeLines(file(tmp))


###  numeric_codes read back in as "number" thanks to the commas:
dat <- readr::read_csv(tmp)
#> Rows: 4 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): char, mixed
#> dbl (1): num
#> num (1): numeric_codes

### And info can be lost--e.g., in numeric_codes elements 3 and 4 are now indistinguishable:
dat
#> # A tibble: 4 × 4
#>   char    num numeric_codes mixed
#>   <chr> <dbl>         <dbl> <chr>
#> 1 a         1             1 a    
#> 2 oh,my     2             2 2    
#> 3 c         3          1234 c    
#> 4 d         4          1234 d

Created on 2023-10-24 with reprex v2.0.2

Of course, one could just explicitly specify columns, or fix everything to character. But I don't always know the complete structure of my data preemptively, so while those are options, they aren't optimal. It would be great to be able to at least partly lean on the (default mode of) guessing to streamline things.

(PS - riffing a bit, but as an alternative, what if it were possible to specify a subset of possible types of data to guess from? For instance, I may be starting with an unknown mix/arrangement of character, numeric, and logical columns--but I know there aren't going to be any pretty-formatted numbers, factors, or times. That's probably a pretty common scenario, I'd think?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant