Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_join/left_join crashes rsession #1559

Closed
inscaven opened this issue Nov 30, 2015 · 6 comments
Closed

inner_join/left_join crashes rsession #1559

inscaven opened this issue Nov 30, 2015 · 6 comments
Assignees
Labels
Milestone

Comments

@inscaven
Copy link

@inscaven inscaven commented Nov 30, 2015

I have strange behavoiur of inner_join (and left_join as well) with following simple example:

df1 <- structure(list(id = c(102, 102, 102, 121), name = c("qwer", "qwer", 
"qwer", "asdf"), k = structure(c(1L, 2L, 3L, 1L), .Label = c("one", 
"two", "total"), class = "factor"), type = structure(c(3L, 3L, 
3L, 3L), .Label = c("tot", "plan", "fact"), class = "factor"), 
    v = c(NA_real_, NA_real_, NA_real_, NA_real_), btm = c(25654.957609, 
    29375.7547216667, 55030.7123306667, 10469.3523273333), top = c(22238.368946, 
    30341.516924, 52579.88587, 9541.893144)), .Names = c("id", 
"name", "k", "type", "v", "btm", "top"), row.names = c(NA, -4L
), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(id = c(102, 102, 102, 121), name = c("qwer", "qwer", 
"qwer", "asdf"), k = structure(c(1L, 2L, 3L, 1L), .Label = c("one", 
"two", "total"), class = "factor"), type = structure(c(1L, 1L, 
1L, 1L), .Label = c("fact", "plan", "tot"), class = "factor"), 
    perc = c(0.15363485835208, -0.0318297270618471, 0.0466114830816894, 
    0.0971986553754823)), .Names = c("id", "name", "k", "type", 
"perc"), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

inner_joining them, with or without explicit definition of by parameter, leads to different results, and often causes rsession to crash. Here are results of 3 consequential calls of the same command with outputs:

> df1 %>% inner_join(df2)
Joining by: c("id", "name", "k", "type")
Source: local data frame [2 x 8]

     id  name      k  type     v      btm       top        perc
  (dbl) (chr) (fctr) (chr) (dbl)    (dbl)     (dbl)       (dbl)
1   102  qwer    two  plan    NA 29375.75 30341.517 -0.03182973
2   121  asdf    one   tot    NA 10469.35  9541.893  0.09719866
Warning message:
In inner_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector
> df1 %>% inner_join(df2)
Joining by: c("id", "name", "k", "type")
Source: local data frame [1 x 8]

     id  name      k  type     v      btm      top        perc
  (dbl) (chr) (fctr) (chr) (dbl)    (dbl)    (dbl)       (dbl)
1   102  qwer    two  plan    NA 29375.75 30341.52 -0.03182973
Warning message:
In inner_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector
> df1 %>% inner_join(df2)
Joining by: c("id", "name", "k", "type")

The last time it crashed R, so there is no output.
I have installed development version of dplyr from here last week (Nov 25 to be exact), my R is 3.1.3.

@iangow
Copy link

@iangow iangow commented Dec 1, 2015

I get weird behaviour too. I'm on RStudio 0.99.688 and R 3.2.2. The first time it crashed after 3-4 inner_join runs. The next time it got further before this:

Error in inner_join_impl(x, y, by$x, by$y) : 
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'logical'
In addition: Warning message:
In inner_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Dec 1, 2015

The crash is somewhat more reproduciblewhen using gctorture().

library(dplyr)

df1 <- structure(list(id = c(102, 102, 102, 121), name = c("qwer", "qwer", 
"qwer", "asdf"), k = structure(c(1L, 2L, 3L, 1L), .Label = c("one", 
"two", "total"), class = "factor"), type = structure(c(3L, 3L, 
3L, 3L), .Label = c("tot", "plan", "fact"), class = "factor"), 
    v = c(NA_real_, NA_real_, NA_real_, NA_real_), btm = c(25654.957609, 
    29375.7547216667, 55030.7123306667, 10469.3523273333), top = c(22238.368946, 
    30341.516924, 52579.88587, 9541.893144)), .Names = c("id", 
"name", "k", "type", "v", "btm", "top"), row.names = c(NA, -4L
), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(id = c(102, 102, 102, 121), name = c("qwer", "qwer", 
"qwer", "asdf"), k = structure(c(1L, 2L, 3L, 1L), .Label = c("one", 
"two", "total"), class = "factor"), type = structure(c(1L, 1L, 
1L, 1L), .Label = c("fact", "plan", "tot"), class = "factor"), 
    perc = c(0.15363485835208, -0.0318297270618471, 0.0466114830816894, 
    0.0971986553754823)), .Names = c("id", "name", "k", "type", 
"perc"), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

gctorture(TRUE)
df1 %>% inner_join(df2)

Some output when running with a debugger. The context:

* thread #1: tid = 0xa243, 0x0000000105e7c19d libR.dylib`SET_STRING_ELT(x=0x00007fb637922c78, i=1, v=0x0000000000000000) + 29 at memory.c:3458, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000105e7c19d libR.dylib`SET_STRING_ELT(x=0x00007fb637922c78, i=1, v=0x0000000000000000) + 29 at memory.c:3458
   3455     if(TYPEOF(x) != STRSXP)
   3456         error("%s() can only be applied to a '%s', not a '%s'",
   3457               "SET_STRING_ELT", "character vector", type2char(TYPEOF(x)));
-> 3458     if(TYPEOF(v) != CHARSXP)
   3459        error("Value of SET_STRING_ELT() must be a 'CHARSXP' not a '%s'",
   3460              type2char(TYPEOF(v)));
   3461     if (i < 0 || i >= XLENGTH(x))

Backtrace with potentially relevant bits:

(lldb) bt
* thread #1: tid = 0xa243, 0x0000000105e7c19d libR.dylib`SET_STRING_ELT(x=0x00007fb637922c78, i=1, v=0x0000000000000000) + 29 at memory.c:3458, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000105e7c19d libR.dylib`SET_STRING_ELT(x=0x00007fb637922c78, i=1, v=0x0000000000000000) + 29 at memory.c:3458
    frame #1: 0x0000000115c2940f dplyr.so`dplyr::JoinFactorFactorVisitor::subset(std::__1::vector<int, std::__1::allocator<int> > const&) + 159
    frame #2: 0x0000000115ba147e dplyr.so`Rcpp::DataFrame_Impl<Rcpp::PreserveStorage> subset<std::__1::vector<int, std::__1::allocator<int> > >(Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, std::__1::vector<int, std::__1::allocator<int> > const&, std::__1::vector<int, std::__1::allocator<int> > const&, Rcpp::Vector<16, Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>) + 2814
    frame #3: 0x0000000115b87a47 dplyr.so`inner_join_impl(Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>) + 1703
    frame #4: 0x0000000115b39f9d dplyr.so`dplyr_inner_join_impl + 205
... other Rf_eval stuff ...

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Dec 1, 2015

Some printf debugging noted that the get_pos function was returning negative values in some instances, which I guess implies a bug since it should be producing a >= offset as used here.

cc: @romainfrancois

@jennybc
Copy link
Member

@jennybc jennybc commented Dec 29, 2015

I just experienced this with inner_join(). The first time, the R session aborted and then RStudio crashed. In the new instance of RStudio, there was no crash but the join errored:

Error in inner_join_impl(x, y, by$x, by$y) : 
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'logical'
In addition: Warning message:
In inner_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector

(The warning about factors is expected and proper.)

Version:

 dplyr      * 0.4.3.9000 2015-11-24 Github (hadley/dplyr@4f2d7f8)

@hadley hadley added the bug label Mar 1, 2016
@hadley hadley added this to the 0.5 milestone Mar 1, 2016
@hadley
Copy link
Member

@hadley hadley commented Mar 1, 2016

This is for you @romainfrancois 😄

@hadley
Copy link
Member

@hadley hadley commented Mar 8, 2016

I found replicate(100, df1 %>% inner_join(df2)) somewhat more replicable, but strangely I couldn't replicate with this version of the data frames which as far as I can tell are identical:

df3 <- data_frame(
  id = c(102, 102, 102, 121), 
  name = c("qwer", "qwer", "qwer", "asdf"), 
  k = factor(c("one", "two", "total", "one"), levels = c("one", "two", "total")),
  total = factor(c("tot", "tot", "tot", "tot"), levels = c("tot", "plan", "fact")),
  v = c(NA_real_, NA_real_, NA_real_, NA_real_), 
  btm = c(25654.957609, 29375.7547216667, 55030.7123306667, 10469.3523273333), 
  top = c(22238.368946, 30341.516924, 52579.88587, 9541.893144)
)
df4 <- data_frame(
  id = c(102, 102, 102, 121), 
  name = c("qwer", "qwer", "qwer", "asdf"), 
  k = factor(c("one", "two", "total", "one"), levels = c("one", "two", "total")),
  type = factor(c("fact", "fact", "fact", "fact"), levels = c("tot", "plan", "fact")), 
  perc = c(0.15363485835208, -0.0318297270618471, 0.0466114830816894, 0.0971986553754823)
)

replicate(100, df3 %>% inner_join(df4))

sicarul added a commit to sicarul/dplyr that referenced this issue May 4, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants