Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding problem leading to errors in distinct(), write_tsv(), unique() ... ? #2971

Closed
breichholf opened this issue Jul 14, 2017 · 8 comments
Labels

Comments

@breichholf
Copy link

@breichholf breichholf commented Jul 14, 2017

I've been trying to reproduce this error, but I'm having difficulties. Please bare with me. Code to reproduce appears below!

I have a file with a few columns, which gets red in via read_tsv. I can then go on to group_by and mutate, and if I pipe in to distinct() it throws an error, if and only if, i add .keep_all = TRUE (or if this is the case implicitly, as in dplyr >= 0.7.0.

The error I get is:

Error in distinct_impl(dist$data, dist$vars, dist$keep) :
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'

In an effort for reproducibility, I created a gist from the file, hoping this would help reproducibility. But sometimes the error 'magically' disappears, sometimes I can reproduce it.

Here's the code that should reproduce it:

library(tidyverse)

gist <- 'https://gist.githubusercontent.com/breichholf/3b2e5eb253a932b8b0e540812811ecb6/raw/2798b2a58e281fcd3867e4dbf4adbe11f8a7b4f3/test.bed'

bed <- read_tsv(gist, col_names = c('chromosome', 'start', 'end', 'gene', 'score', 'strand', 'anno.id', 'interval.id', 'window.id'))

geneBed <- 
  bed %>%
  group_by(interval.id) %>%
  mutate(min.start = min(start),
         max.end = max(end),
         dist.to.start = start - min.start,
         exon.len = end - start,
         cds.start = min.start,
         cds.end = max.end,
         all.starts = paste(dist.to.start, collapse=","),
         all.lens = paste(exon.len, collapse=","))

> geneBed %>% distinct(interval.id, .keep_all = TRUE)
Error in distinct_impl(dist$data, dist$vars, dist$keep) :
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'

The reason I figured it might have something to do with the encoding is that write_tsv also throws an error:

> geneBed %>% write_tsv('test.txt')
Error in stream_delim_(df, path, ...) :
  'translateCharUTF8' must be called on a CHARSXP

However, as mentioned above geneBed %>% distinct(interval.id) without .keep_all = TRUE performs as expected. Additionally, perhaps of note: unique() also throws an error:

> geneBed %>% unique()
Error in paste(chromosome = c("chr1", "chr10", "chr11", "chr12", "chr11",  :
  'translateChar' must be called on a CHARSXP

I've tried the same code on another machine (OSX instead of linux), and can reproduce the error if it's from a fresh R session. I've (strangely only sometimes) managed to resolve the error, by splitting up mutate in to several statements, or piping directly into distinct after mutate, but haven't been able to work out how to reproduce the fix so far, unfortunately.

If there's anything I can do or try on my end please let me know.

Relevant session info:

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.1     purrr_0.2.2.2   readr_1.1.1
[5] tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4
 [5] bindr_0.1        forcats_0.2.0    tools_3.4.0      jsonlite_1.5
 [9] lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0     lattice_0.20-35
[13] pkgconfig_2.0.1  rlang_0.1.1      psych_1.7.5      curl_2.7
[17] parallel_3.4.0   haven_1.1.0      xml2_1.1.1       stringr_1.2.0
[21] httr_1.2.1       hms_0.3          grid_3.4.0       glue_1.1.1
[25] R6_2.2.2         readxl_1.0.0     foreign_0.8-69   reshape2_1.4.2
[29] modelr_0.1.0     magrittr_1.5     scales_0.4.1     rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

Edit

FWIW, after downgrading to dplyr == 0.5.0 makes the above code fine.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 15, 2017

Thanks. This is worrisome, could you please try to run with gctorture() or e.g. gctorture2(99) and see if you can replicate the error more reliably. I'll look into it, too.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 15, 2017

Actually, I can replicate the error without gctorture(), so there seems to be no need to use it. Will investigate further.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 15, 2017

Have you looked into nested tibbles? Try this:

geneBed <- 
  bed %>%
  group_by(interval.id) %>%
  mutate(min.start = min(start),
         max.end = max(end),
         dist.to.start = start - min.start,
         exon.len = end - start,
         cds.start = min.start,
         cds.end = max.end) %>%
  nest(dist.to.start, exon.len)

Your original problem seems to be caused by the grouped mutate that assigns a string. Looks like a protection error to me. Simpler reprex:

library(dplyr)

set.seed(20170715L)

df <-
  data_frame(x = 1:10000) %>%
  group_by(x) %>%
  mutate(y = as.character(runif(1L)),
         z = as.character(runif(1L)))

df %>% distinct(x, .keep_all = TRUE)
#> Error in distinct_impl(dist$data, dist$vars, dist$keep): Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'list'

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 15, 2017

Can you please try with:

# install.packages("remotes")
remotes::install_github("tidyverse/dplyr#2976")

@breichholf
Copy link
Author

@breichholf breichholf commented Jul 17, 2017

Sorry for the late reply. #2976 fixed it! 👍

Tried both with my code above, and your reprex.

@Fablepongiste
Copy link

@Fablepongiste Fablepongiste commented Jul 25, 2017

Do we know when will this fix be in a released version of dplyr ?

@krlmlr krlmlr closed this in #2976 Jul 25, 2017
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 25, 2017

It is now in the dev version, but a CRAN release is likely to take a while.

@Fablepongiste
Copy link

@Fablepongiste Fablepongiste commented Jul 26, 2017

So will be on next dplyr release ? 0.7.3 ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants