Skip to content

Character encoding problem leading to errors in distinct(), write_tsv(), unique() ... ? #2971

@breichholf

Description

@breichholf

I've been trying to reproduce this error, but I'm having difficulties. Please bare with me. Code to reproduce appears below!

I have a file with a few columns, which gets red in via read_tsv. I can then go on to group_by and mutate, and if I pipe in to distinct() it throws an error, if and only if, i add .keep_all = TRUE (or if this is the case implicitly, as in dplyr >= 0.7.0.

The error I get is:

Error in distinct_impl(dist$data, dist$vars, dist$keep) :
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'

In an effort for reproducibility, I created a gist from the file, hoping this would help reproducibility. But sometimes the error 'magically' disappears, sometimes I can reproduce it.

Here's the code that should reproduce it:

library(tidyverse)

gist <- 'https://gist.githubusercontent.com/breichholf/3b2e5eb253a932b8b0e540812811ecb6/raw/2798b2a58e281fcd3867e4dbf4adbe11f8a7b4f3/test.bed'

bed <- read_tsv(gist, col_names = c('chromosome', 'start', 'end', 'gene', 'score', 'strand', 'anno.id', 'interval.id', 'window.id'))

geneBed <- 
  bed %>%
  group_by(interval.id) %>%
  mutate(min.start = min(start),
         max.end = max(end),
         dist.to.start = start - min.start,
         exon.len = end - start,
         cds.start = min.start,
         cds.end = max.end,
         all.starts = paste(dist.to.start, collapse=","),
         all.lens = paste(exon.len, collapse=","))

> geneBed %>% distinct(interval.id, .keep_all = TRUE)
Error in distinct_impl(dist$data, dist$vars, dist$keep) :
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'

The reason I figured it might have something to do with the encoding is that write_tsv also throws an error:

> geneBed %>% write_tsv('test.txt')
Error in stream_delim_(df, path, ...) :
  'translateCharUTF8' must be called on a CHARSXP

However, as mentioned above geneBed %>% distinct(interval.id) without .keep_all = TRUE performs as expected. Additionally, perhaps of note: unique() also throws an error:

> geneBed %>% unique()
Error in paste(chromosome = c("chr1", "chr10", "chr11", "chr12", "chr11",  :
  'translateChar' must be called on a CHARSXP

I've tried the same code on another machine (OSX instead of linux), and can reproduce the error if it's from a fresh R session. I've (strangely only sometimes) managed to resolve the error, by splitting up mutate in to several statements, or piping directly into distinct after mutate, but haven't been able to work out how to reproduce the fix so far, unfortunately.

If there's anything I can do or try on my end please let me know.

Relevant session info:

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.1     purrr_0.2.2.2   readr_1.1.1
[5] tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4
 [5] bindr_0.1        forcats_0.2.0    tools_3.4.0      jsonlite_1.5
 [9] lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0     lattice_0.20-35
[13] pkgconfig_2.0.1  rlang_0.1.1      psych_1.7.5      curl_2.7
[17] parallel_3.4.0   haven_1.1.0      xml2_1.1.1       stringr_1.2.0
[21] httr_1.2.1       hms_0.3          grid_3.4.0       glue_1.1.1
[25] R6_2.2.2         readxl_1.0.0     foreign_0.8-69   reshape2_1.4.2
[29] modelr_0.1.0     magrittr_1.5     scales_0.4.1     rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

Edit

FWIW, after downgrading to dplyr == 0.5.0 makes the above code fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions