Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many delimiters in a row causes final column to be "overstuffed" with values #439

Closed
TMRHarrison opened this issue May 12, 2022 · 2 comments
Assignees

Comments

@TMRHarrison
Copy link

Vroom doesn't fail, stop, or raise any errors when a file has a row with more columns than expected. Instead, any remaining values (separator and all) are forced into the final column of the output. A warning is given, but it's cryptic.

Take this tsv file:

num     chr     num2    num3
1       charab  123     434
2       charact         345     2345
3               chaaa   3123    1231

The function used and following result:

> vroom::vroom("test.tsv")
Rows: 3 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): chr, num2, num3
dbl (1): num

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
    num chr     num2  num3
  <dbl> <chr>   <chr> <chr>
1     1 charab  123   "434"
2     2 charact NA    "345\t2345"
3     3 NA      chaaa "3123\t1231"
Warning message:
One or more parsing issues, see `problems()` for details
> problems()
Error in vroom_materialize(x, replace = FALSE) :
  argument "x" is missing, with no default

The output I would expect is a more descriptive error, like data.table::fread() gives:

   num    chr num2 num3
1:   1 charab  123  434
Warning message:
In data.table::fread("test.tsv") :
  Stopped early on line 3. Expected 4 fields but found 5. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2	charact		345	2345>>

And either raising an error, discarding the offending rows, or stopping the read after the first offending row.

@TMRHarrison
Copy link
Author

Upon reading into problems(), and passing the returned array as an argument, the error descriptions are sufficient, though a big obfuscated for my tastes. It would still be nice to have the offending rows dealt with in some other way then forcing all values into the last column.

@sbearrows sbearrows self-assigned this Aug 24, 2022
@sbearrows
Copy link
Contributor

Your data would actually be read correctly with readr::read_table() which handles whitespace delimited files with any number of whitespace characters between columns. Unfortunately, we are not currently pursuing replicating this feature in vroom (see #186).

text <- glue::glue(
'x\ty\tz\n
1\t2\t\t3\n
4\t\t5\t6\n')

tf <- withr::local_tempfile(lines = text)

# read_table() handles this messy data
readr::read_table(tf, show_col_types = FALSE)
#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

Created on 2022-08-26 by the reprex package (v2.0.1.9000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants