Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor parsing is incorrect when escaping backslashes #184

Closed
mllg opened this issue Oct 14, 2019 · 2 comments
Closed

Factor parsing is incorrect when escaping backslashes #184

mllg opened this issue Oct 14, 2019 · 2 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@mllg
Copy link

mllg commented Oct 14, 2019

In the following example, the first escaped quote disappears in the first column. I expected to maybe get an error here because I've not set escape_backslash:

Even worse, the "A" in the first row now is a "B". 馃槦

path = tempfile()

writeLines(c(
  "'\\'A\\'',T",
  "'\\'B\\'',F"
), con = path)
vroom::vroom(path, col_names = c("x", "y"), col_types = list(vroom::col_factor(), vroom::col_factor()), quote = "'")

# A tibble: 2 x 2
  x        y
  <fct>    <fct>
1 "\\B\\'" T
2 "\\B\\'" F

The later also happens with escapse_backslash:

vroom::vroom(path, col_names = c("x", "y"), col_types = list(vroom::col_factor(), vroom::col_factor()), quote = "'", escape_backslash = TRUE)

# A tibble: 2 x 2
  x     y
  <fct> <fct>
1 'B'   F
2 'B'   F

Another example without escapes:

writeLines(c(
  "''A'',T",
  "''B'',F"
), con = path)
vroom::vroom(path, col_names = c("x", "y"), col_types = list(vroom::col_factor(), vroom::col_factor()), quote = "'", escape_backslash = TRUE)

# A tibble: 2 x 2
  x     y
  <fct> <fct>
1 B'    F
2 B'    F
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.6.so

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] nvimcom_0.9-83    gtfo_0.0.0.9000   data.table_1.12.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       fansi_0.4.0      assertthat_0.2.1 utf8_1.1.4       crayon_1.3.4     withr_2.1.2      magrittr_1.5     pillar_1.4.2     cli_1.1.0
[10] rlang_0.4.0      vroom_1.0.2      tools_3.6.1      glue_1.3.1       purrr_0.3.2      parallel_3.6.1   compiler_3.6.1   pkgconfig_2.0.3  tidyselect_0.2.5
[19] tibble_2.1.3
@mllg
Copy link
Author

mllg commented Oct 14, 2019

Seems like it is not related to quotes, this also reads in the data incorrectly:

path = tempfile()
writeLines(c(
  "A,T",
  "B,F"
), con = path)
vroom::vroom(path, col_names = c("x", "y"), col_types = list(vroom::col_factor(), vroom::col_factor()), escape_backslash = TRUE)

# A tibble: 2 x 2
  x     y
  <fct> <fct>
1 B     F
2 B     F

There are duplicated factor levels:

Classes 'tbl_df', 'tbl' and 'data.frame':       2 obs. of  2 variables:
 $ x: Factor w/ 2 levels "B","B": 1 2
 $ y: Factor w/ 2 levels "F","F": 1 2
 - attr(*, "spec")=
  .. cols(
  ..   x = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  ..   y = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
  .. )

@jimhester
Copy link
Collaborator

A minimal reprex is

vroom::vroom("A,T\nB,F\n", col_names = FALSE, col_types = list("f", "f"), escape_backslash = TRUE)
#> # A tibble: 2 x 2
#>   X1    X2   
#>   <fct> <fct>
#> 1 B     F    
#> 2 B     F
vroom::vroom("A,T\nB,F\n", col_names = FALSE, col_types = list("f", "f"), escape_backslash = FALSE)
#> # A tibble: 2 x 2
#>   X1    X2   
#>   <fct> <fct>
#> 1 A     T    
#> 2 B     F

Created on 2019-10-14 by the reprex package (v0.3.0)

@jimhester jimhester added the bug an unexpected problem or unintended behavior label Dec 3, 2019
@jimhester jimhester changed the title vroom parses data with quotes incorrectly Factor parsing is incorrect Dec 17, 2019
@jimhester jimhester changed the title Factor parsing is incorrect Factor parsing is incorrect when escaping backslashes Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants