Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_delim delimiters at end of line not treated as delimiters #1328

Closed
hlynurhallgrims opened this issue Nov 18, 2021 · 2 comments
Closed

Comments

@hlynurhallgrims
Copy link

I wanted to point out the following change between versions 1.4.0 and 2.0.0 and ask if this is a feature or a bug.

Delimiters, in this case ";" used to be treated as delimiters at the end of line, even if there was not a delimiter in the header line in the text document. But this has changed with the new parsing engine, and I didn't see it mentioned in the release notes. This is definitely not a big issue, especially thanks to the brilliant with_edition() function, but I wanted to point this out as some services which hand off data delimited with ";", (erroneously?) have delimiters at the end of lines for the data.

Anyway, thanks for all your work on this incredible package!

# read_delim delimiters at eol not treated as delimiters

library(readr)

text <- "Aldur;Fjöldi
0-5 ára;1.287;
6-12 ára;1.438;
13-16 ára;730;
17-24 ára;1.409;
25-34 ára;3.891;
35-66 ára;6.561;
67 ára og eldri;1.683;
Samtals;16.999;"

# Version 1.4 and earlier
with_edition(1, read_delim(text, delim = ";", locale = locale(decimal_mark = ",", grouping_mark = ".")))
#> Warning: 8 parsing failures.
#> row col  expected    actual         file
#>   1  -- 2 columns 3 columns literal data
#>   2  -- 2 columns 3 columns literal data
#>   3  -- 2 columns 3 columns literal data
#>   4  -- 2 columns 3 columns literal data
#>   5  -- 2 columns 3 columns literal data
#> ... ... ......... ......... ............
#> See problems(...) for more details.
#> # A tibble: 8 × 2
#>   Aldur           Fjöldi
#>   <chr>            <dbl>
#> 1 0-5 ára           1287
#> 2 6-12 ára          1438
#> 3 13-16 ára          730
#> 4 17-24 ára         1409
#> 5 25-34 ára         3891
#> 6 35-66 ára         6561
#> 7 67 ára og eldri   1683
#> 8 Samtals          16999

# Version 2+
read_delim(text, delim = ";", locale = locale(decimal_mark = ",", grouping_mark = "."))
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 8 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): Aldur, Fjöldi
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 8 × 2
#>   Aldur           Fjöldi 
#>   <chr>           <chr>  
#> 1 0-5 ára         1.287; 
#> 2 6-12 ára        1.438; 
#> 3 13-16 ára       730;   
#> 4 17-24 ára       1.409; 
#> 5 25-34 ára       3.891; 
#> 6 35-66 ára       6.561; 
#> 7 67 ára og eldri 1.683; 
#> 8 Samtals         16.999;

Created on 2021-11-18 by the reprex package (v2.0.0)

@jimhester
Copy link
Collaborator

The issue is the first header line does not match the rest of the file, the first line has only one delimiter, but the rest of the file has two. e.g. if the first line was Aldur;Fjöldi; then I think it would do what you expect.

library(readr)

text <- "Aldur;Fjöldi;
0-5 ára;1.287;
6-12 ára;1.438;
13-16 ára;730;
17-24 ára;1.409;
25-34 ára;3.891;
35-66 ára;6.561;
67 ára og eldri;1.683;
Samtals;16.999;"

is_locale <- locale("is", decimal_mark = ",", grouping_mark = ".")
read_delim(text, delim = ";", locale = is_locale)
#> New names:
#> * `` -> ...3
#> Rows: 8 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): Aldur
#> lgl (1): ...3
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 8 × 3
#>   Aldur           Fjöldi ...3 
#>   <chr>            <dbl> <lgl>
#> 1 0-5 ára           1287 NA   
#> 2 6-12 ára          1438 NA   
#> 3 13-16 ára          730 NA   
#> 4 17-24 ára         1409 NA   
#> 5 25-34 ára         3891 NA   
#> 6 35-66 ára         6561 NA   
#> 7 67 ára og eldri   1683 NA   
#> 8 Samtals          16999 NA

A workaround would be to skip the first line and give the column names explicitly

text <- "Aldur;Fjöldi
0-5 ára;1.287;
6-12 ára;1.438;
13-16 ára;730;
17-24 ára;1.409;
25-34 ára;3.891;
35-66 ára;6.561;
67 ára og eldri;1.683;
Samtals;16.999;"

read_delim(text, delim = ";", skip = 1, col_names = strsplit(readr::read_lines(text, n_max = 1), ";")[[1]], locale = is_locale)
#> Rows: 8 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): Aldur
#> lgl (1): X3
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 8 × 3
#>   Aldur           Fjöldi X3   
#>   <chr>            <dbl> <lgl>
#> 1 0-5 ára           1287 NA   
#> 2 6-12 ára          1438 NA   
#> 3 13-16 ára          730 NA   
#> 4 17-24 ára         1409 NA   
#> 5 25-34 ára         3891 NA   
#> 6 35-66 ára         6561 NA   
#> 7 67 ára og eldri   1683 NA   
#> 8 Samtals          16999 NA

Created on 2021-11-18 by the reprex package (v2.0.1)

@hlynurhallgrims
Copy link
Author

Thank you for taking the time to clear this up, Jim (and again, thanks for a fantastic package).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants