I have a workflow that requires reading in nested JSON files and binding them into a large data frame. I didn't realize some new files (a new workflow) were producing data.frame columns and just used standard code to read them in and noticed that the attempt to bind the rows either with purrr::map_df() or dplyr::bind_rows() causes RStudio to crash whereas datatable::rbindlist() just errors out. I don't need the tidyverse binders to actually do the binding (I can filter out or munge the columns just fine), but it'd be super handy if this didn't crash RStudio / R.
Now, it crashes RStudio immediately since RStudio tries to incorporate the new variable into the Environment tab. R (i.e. from a terminal prompt) doesn't crash until I try to look at the data frames that were created.
Some code (json gz file is attached):
library(purrr)
library(dplyr)
library(data.table)
library(jsonlite)
fils <- rep("sample.json.gz", 2)
# aborts
map_df(fils, function(f) {
stream_in(gzfile(f))
}) -> df
# aborts
map(fils, function(f) {
stream_in(gzfile(f))
}) %>% bind_rows() -> df
# errors
map(fils, function(f) {
stream_in(gzfile(f))
}) %>% rbindlist(fill=TRUE) -> df
sample.json.gz
The error upon attempting to view is:
*** caught segfault ***
address 0x91000013, cause 'memory not mapped'
Traceback:
1: lapply(X = x, FUN = function(xx, ...) format.default(unlist(xx), ...), trim = trim, digits = digits, >nsmall = nsmall, justify = justify, width = width, na.encode = na.encode, scientific = scientific, >big.mark = big.mark, big.interval = big.interval, small.mark = small.mark, small.interval = >small.interval, decimal.mark = decimal.mark, zero.print = zero.print, drop0trailing = >drop0trailing, ...)
2: format.default(x[[i]], ..., justify = justify)
3: format(x[[i]], ..., justify = justify)
4: format.data.frame(x, digits = digits, na.encode = FALSE)
5: as.matrix(format.data.frame(x, digits = digits, na.encode = FALSE))
6: print.data.frame(x)
7: function (x, ...) UseMethod("print")(x)
The datatable::rbindlist() error is:
Error in rbindlist(., fill = TRUE) :
Column 16 of item 1 is length 6, inconsistent with first column of that item which is length 10. >rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or >data.table
OS X 11.3.5, R 3.3.1, RStudio 0.99.1251, dply 0.5.0, tibble 1.1, purr 0.2.2, datatable 1.9.6
I have a workflow that requires reading in nested JSON files and binding them into a large data frame. I didn't realize some new files (a new workflow) were producing
data.framecolumns and just used standard code to read them in and noticed that the attempt to bind the rows either withpurrr::map_df()ordplyr::bind_rows()causes RStudio to crash whereasdatatable::rbindlist()just errors out. I don't need the tidyverse binders to actually do the binding (I can filter out or munge the columns just fine), but it'd be super handy if this didn't crash RStudio / R.Now, it crashes RStudio immediately since RStudio tries to incorporate the new variable into the Environment tab. R (i.e. from a terminal prompt) doesn't crash until I try to look at the data frames that were created.
Some code (json gz file is attached):
sample.json.gz
The error upon attempting to view is:
The
datatable::rbindlist()error is:OS X 11.3.5, R 3.3.1, RStudio 0.99.1251, dply 0.5.0, tibble 1.1, purr 0.2.2, datatable 1.9.6