Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when using melt in a dplyr operation #357

Closed
eipi10 opened this issue Mar 26, 2014 · 7 comments
Closed

segfault when using melt in a dplyr operation #357

eipi10 opened this issue Mar 26, 2014 · 7 comments
Assignees
Milestone

Comments

@eipi10
Copy link
Contributor

@eipi10 eipi10 commented Mar 26, 2014

I'm getting a segfault when I use melt at the end of a series of chained operations in dplyr.

Here's the code I'm running:

df %.%
    filter(!is.na(ftf.flg), !is.na(m.rmd), !is.na(C1Apass), !is.na(C1Apal),
           ftf.flg=="FTF") %.%
    group_by(ftf.flg, C1Apass, C1Apal, m.rmd) %.%
    summarise(num=length(!is.na(C1Agrd.pts)),
              avgGrd=round(mean(C1Agrd.pts, na.rm=TRUE),1),
              p25Grd=round(quantile(C1Agrd.pts, probs=0.25, na.rm=TRUE),1)) %.%
    mutate(pct=round(num/sum(num)*100,2)) %.%
    melt(id.var=1:4)   

It runs fine if I stop at the end of the mutate operation. But if I include melt, I get a segfault. It also runs fine, including the melt operation, if I exclude the line in the summarise section that begins p25Grd=round(quantile....

The segfault is repeatable regardless of how many rows of the data I subset down to. See below for a small subset of the data for reproducing the error (the actual data set has tens of thousands of rows and dozens of variables), the segfault message, and the output of sessionInfo().

df = structure(list(term.desc = structure(c(18L, 15L, 17L, 16L, 15L, 
18L, 16L, 17L, 17L, 16L, 18L, 15L, 16L, 17L, 18L, 15L, 18L, 17L, 
15L, 16L, 18L, 16L, 17L, 15L, 18L, 15L, 16L, 17L, 16L, 15L, 18L, 
17L, 16L, 15L, 17L, 18L, 17L, 18L, 16L, 15L), .Label = c("Spring 2005", 
"Fall 2005", "Spring 2006", "Fall 2006", "Spring 2007", "Fall 2007", 
"Spring 2008", "Fall 2008", "Spring 2009", "Fall 2009", "Spring 2010", 
"Fall 2010", "Spring 2011", "Fall 2011", "Spring 2012", "Fall 2012", 
"Spring 2013", "Fall 2013", "Spring 2014"), class = c("ordered", 
"factor")), ftf.flg = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("FTF", "TRF"), class = "factor"), m.rmd = c(NA, 
NA, NA, NA, "Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Remedial", "Remedial", "Remedial", "Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial", "Not Remedial", 
"Not Remedial", "Not Remedial", "Not Remedial"), C1Apass = c(0, 
0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 
2, 2, 2, 2, NA, NA, NA, NA, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 
0, 0), C1Apal = c(2, 2, 2, 2, 1, 1, 1, 1, NA, NA, NA, NA, 2, 
2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 2, 2, 2, 2, 
NA, NA, NA, NA, 0, 0, 0, 0), C1Agrd.pts = c(3.3, 3.3, 3.3, 3.3, 
2, 2, 2, 2, NA, NA, NA, NA, 1.7, 1.7, 1.7, 1.7, 1.7, 1.7, 1.7, 
1.7, 0, 0, 0, 0, NA, NA, NA, NA, 4, 4, 4, 4, NA, NA, NA, NA, 
4, 4, 4, 4)), .Names = c("term.desc", "ftf.flg", "m.rmd", "C1Apass", 
"C1Apal", "C1Agrd.pts"), row.names = c(4866L, 4868L, 4870L, 4876L, 
7496L, 7500L, 7501L, 7503L, 12606L, 12609L, 12610L, 12612L, 15335L, 
15337L, 15342L, 15351L, 22897L, 22899L, 22900L, 22907L, 25027L, 
25032L, 25035L, 25038L, 28737L, 28738L, 28740L, 28744L, 29280L, 
29284L, 29290L, 29296L, 41366L, 41368L, 41371L, 41378L, 42468L, 
42472L, 42473L, 42475L), class = "data.frame")

Here's the information R displays when the segfault occurs:

 *** caught segfault ***
address 0x1200000b5, cause 'memory not mapped'

Traceback:
 1: unlist(unname(data[var$measure]))
 2: melt.data.frame(`__prev`, id.var = 1:4)
 3: melt(`__prev`, id.var = 1:4)
 4: eval(expr, envir, enclos)
 5: eval(new_call, e)
 6: chain_q(list(substitute(x), substitute(y)), env = parent.frame())
 7: df %.% filter(!is.na(ftf.flg), !is.na(m.rmd), !is.na(C1Apass),     !is.na(C1Apal), ftf.flg == "FTF") %.% group_by(ftf.flg, C1Apass,     C1Apal, m.rmd) %.% summarise(num = length(!is.na(C1Agrd.pts)),     avgGrd = round(mean(C1Agrd.pts, na.rm = TRUE), 1), p25Grd = round(quantile(C1Agrd.pts,         probs = 0.25, na.rm = TRUE), 1)) %.% mutate(pct = round(num/sum(num) *     100, 2)) %.% melt(id.var = 1:4)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Here's the Session Info:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.2 dplyr_0.1.3   

loaded via a namespace (and not attached):
[1] assertthat_0.1 plyr_1.8.1     Rcpp_0.11.1    stringr_0.6.2  tools_3.0.2  
@hadley
Copy link
Member

@hadley hadley commented Mar 26, 2014

Can you make a reproducible example by giving us the data that melt gets? This is probably a reshape2 (or R) bug, rather than dplyr bug.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Mar 26, 2014

FWIW, I can replicate the segfault with reshape2_1.2.2, but not the development version reshape2_1.3.0.99.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Mar 26, 2014

The structure of output with everything run up to the melt step is pretty odd:

Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of  8 variables:
 $ ftf.flg: Factor w/ 2 levels "FTF","TRF": 1 1 1 1
 $ C1Apass: num  0 0 0 2
 $ C1Apal : num  0 1 2 0
 $ m.rmd  : chr  "Not Remedial" "Not Remedial" "Remedial" "Not Remedial"
 $ num    : int  4 4 4 4
 $ avgGrd : num  4 2 1.7 0
 $ p25Grd : Named num  4 2 1.7 0
  ..- attr(*, "names")= chr "25%" ## names attribute doesn't match p25Grd
 $ pct    : num  100 100 100 100
 - attr(*, "vars")=List of 3
  ..$ : symbol ftf.flg
  ..$ : symbol C1Apass
  ..$ : symbol C1Apal
 - attr(*, "labels")='data.frame':  4 obs. of  3 variables:
  ..$ ftf.flg: Factor w/ 2 levels "FTF","TRF": 1 1 1 1
  ..$ C1Apass: num  0 0 0 2
  ..$ C1Apal : num  0 1 2 0
  ..- attr(*, "vars")=List of 3
  .. ..$ : symbol ftf.flg
  .. ..$ : symbol C1Apass
  .. ..$ : symbol C1Apal
 - attr(*, "indices")=List of 4
  ..$ : int 0
  ..$ : int 1
  ..$ : int 2
  ..$ : int 3

Here's a dput, but I cannot load that back into an R session. Eg:

> dput(m)
structure(list(ftf.flg = structure(c(1L, 1L, 1L, 1L), .Label = c("FTF", 
"TRF"), class = "factor"), C1Apass = c(0, 0, 0, 2), C1Apal = c(0, 
1, 2, 0), m.rmd = c("Not Remedial", "Not Remedial", "Remedial", 
"Not Remedial"), num = c(4L, 4L, 4L, 4L), avgGrd = c(4, 2, 1.7, 
0), p25Grd = structure(c(4, 2, 1.7, 0), .Names = "25%"), pct = c(100, 
100, 100, 100)), .Names = c("ftf.flg", "C1Apass", "C1Apal", "m.rmd", 
"num", "avgGrd", "p25Grd", "pct"), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -4L), vars = list(ftf.flg, 
    C1Apass, C1Apal), labels = structure(list(ftf.flg = structure(c(1L, 
1L, 1L, 1L), .Label = c("FTF", "TRF"), class = "factor"), C1Apass = c(0, 
0, 0, 2), C1Apal = c(0, 1, 2, 0)), class = "data.frame", row.names = c(NA, 
-4L), .Names = c("ftf.flg", "C1Apass", "C1Apal"), vars = list(
    ftf.flg, C1Apass, C1Apal)), indices = list(0L, 1L, 2L, 3L))

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Mar 26, 2014

In fact, it's the printing of vector element p25Grd that crashes R, because an R object whose names don't match the length of the object are invalid. So the question is, "what happened to that vector's names?"

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Mar 26, 2014

The problem is the summarise call: quantile produces a named vector, and this interferes with the grouping. A small reproducible example:

df <- data.frame(x=c(1:3), y=letters[1:3])
df <- group_by(df, y)
m <- df %.% summarise(
  a=length(x),
  b=quantile(x, 0.5)
)

gives

> str(m)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  3 variables:
 $ y: Factor w/ 3 levels "a","b","c": 1 2 3
 $ a: int  1 1 1
 $ b: Named num  1 2 3
  ..- attr(*, "names")= chr "50%"
 - attr(*, "drop")= logi TRUE

If you try to access m$b, R dies as the names have been incorrectly propagated. quantile returns a named vector; I guess summarise needs to strip those names if the data frame is grouped, or else figure out how to generate a new, proper set of names.

@eipi10
Copy link
Contributor Author

@eipi10 eipi10 commented Mar 26, 2014

Thanks for the detailed analysis of the cause.

As a workaround, I ran the code up through the mutate line, saving the result in a new object called df1.
I then did the following:

names(df1$p25Grd) = NULL
df1 =  melt(df1, id.var=1:4)
df1 = dcast(df1, ftf.flg + C1Apass + C1Apal ~ variable + m.rmd, value.var='value')

That worked, but is there a better approach? Are there any undesirable side effects of wiping out the name of the vector like that? Is there anything I can do within the dplyr chain to reset the name to NULL?

@romainfrancois romainfrancois self-assigned this Apr 2, 2014
@romainfrancois romainfrancois added this to the v0.2 milestone Apr 2, 2014
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 2, 2014

Thanks. I've discarded the names attribute now.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants