Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RStudio (& R) crash when attempting to bind "complex" JSON files #2015

Closed
brudis-r7 opened this issue Jul 13, 2016 · 15 comments
Closed

RStudio (& R) crash when attempting to bind "complex" JSON files #2015

brudis-r7 opened this issue Jul 13, 2016 · 15 comments
Assignees
Labels
Milestone

Comments

@brudis-r7
Copy link

@brudis-r7 brudis-r7 commented Jul 13, 2016

I have a workflow that requires reading in nested JSON files and binding them into a large data frame. I didn't realize some new files (a new workflow) were producing data.frame columns and just used standard code to read them in and noticed that the attempt to bind the rows either with purrr::map_df() or dplyr::bind_rows() causes RStudio to crash whereas datatable::rbindlist() just errors out. I don't need the tidyverse binders to actually do the binding (I can filter out or munge the columns just fine), but it'd be super handy if this didn't crash RStudio / R.

Now, it crashes RStudio immediately since RStudio tries to incorporate the new variable into the Environment tab. R (i.e. from a terminal prompt) doesn't crash until I try to look at the data frames that were created.

Some code (json gz file is attached):

library(purrr)
library(dplyr)
library(data.table)
library(jsonlite)

fils <- rep("sample.json.gz", 2)

# aborts
map_df(fils, function(f) {
  stream_in(gzfile(f))
}) -> df

# aborts
map(fils, function(f) {
  stream_in(gzfile(f))
}) %>% bind_rows() -> df

# errors
map(fils, function(f) {
  stream_in(gzfile(f))
}) %>% rbindlist(fill=TRUE) -> df

sample.json.gz

The error upon attempting to view is:

*** caught segfault ***
address 0x91000013, cause 'memory not mapped'

Traceback:
1: lapply(X = x, FUN = function(xx, ...) format.default(unlist(xx), ...), trim = trim, digits = digits, >nsmall = nsmall, justify = justify, width = width, na.encode = na.encode, scientific = scientific, >big.mark = big.mark, big.interval = big.interval, small.mark = small.mark, small.interval = >small.interval, decimal.mark = decimal.mark, zero.print = zero.print, drop0trailing = >drop0trailing, ...)
2: format.default(x[[i]], ..., justify = justify)
3: format(x[[i]], ..., justify = justify)
4: format.data.frame(x, digits = digits, na.encode = FALSE)
5: as.matrix(format.data.frame(x, digits = digits, na.encode = FALSE))
6: print.data.frame(x)
7: function (x, ...) UseMethod("print")(x)

The datatable::rbindlist() error is:

Error in rbindlist(., fill = TRUE) :
Column 16 of item 1 is length 6, inconsistent with first column of that item which is length 10. >rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or >data.table

OS X 11.3.5, R 3.3.1, RStudio 0.99.1251, dply 0.5.0, tibble 1.1, purr 0.2.2, datatable 1.9.6

@hadley
Copy link
Member

@hadley hadley commented Jul 19, 2016

Any chance you're using 32 bit? If so, try the preview edition of RStudio

Loading

@mdavy86
Copy link

@mdavy86 mdavy86 commented Jul 22, 2016

@hadley, here is another RStudio IDE crash using dplyr with jsonlite on windows operating systems only, it works fine on OS X And linux.

A reproducible example;

library(dplyr)
library(jsonlite)

## Create dataset
ex <- data.frame(num=1:2, metadata = c('{"Time":1,"demand":8.3}', '{"Time":2,"demand":10.3}'), stringsAsFactors = FALSE)

## Test fromJSON parsing
metadata <-ex$metadata
jsonlite::fromJSON(paste0("[",paste0(metadata,collapse=","),"]"))

## Use dplyr to fetch column from JSON
ex %>% 
  mutate(demand=fromJSON(paste0("[",paste0(metadata,collapse=","),"]"))$demand)

The error should be;

rstudio_crash

This fails for me on windows RStudio versions;

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252    LC_MONETARY=English_New Zealand.1252
[4] LC_NUMERIC=C                         LC_TIME=English_New Zealand.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] jsonlite_0.9.22 dplyr_0.5.0    

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.2       assertthat_0.1 DBI_0.4-1      tools_3.3.0    tibble_1.0     Rcpp_0.12.5 

If 64-bit R is selected using RStudio the example above works fine.

> ## Create dataset
> ex <- data.frame(num=1:2, metadata = c('{"Time":1,"demand":8.3}', '{"Time":2,"demand":10.3}'), stringsAsFactors = FALSE)
> 
> ## Test fromJSON parsing
> metadata <-ex$metadata
> jsonlite::fromJSON(paste0("[",paste0(metadata,collapse=","),"]"))
  Time demand
1    1    8.3
2    2   10.3
> 
> ## Use dplyr to fetch column from JSON
> ex %>% 
+   mutate(demand=fromJSON(paste0("[",paste0(metadata,collapse=","),"]"))$demand)
  num                 metadata demand
1   1  {"Time":1,"demand":8.3}    8.3
2   2 {"Time":2,"demand":10.3}   10.3
> version
               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.0                         
year           2016                        
month          05                          
day            03                          
svn rev        70573                       
language       R                           
version.string R version 3.3.0 (2016-05-03)
nickname       Supposedly Educational      
> 

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 22, 2016

I can't reproduce your crash using the preview release of RStudio (v0.99.1265).

> sessionInfo()
R version 3.3.1 Patched (2016-06-27 r70840)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] jsonlite_0.9.21 dplyr_0.5.0    

loaded via a namespace (and not attached):
[1] lazyeval_0.2.0 magrittr_1.5   R6_2.1.2       assertthat_0.1 DBI_0.4-1     
[6] tools_3.3.1    tibble_1.0     Rcpp_0.12.5 

Loading

@brudis-r7
Copy link
Author

@brudis-r7 brudis-r7 commented Jul 22, 2016

I'm going to run it again now on the latest preview

Loading

@brudis-r7
Copy link
Author

@brudis-r7 brudis-r7 commented Jul 22, 2016

No go:

image

$mode
[1] "desktop"

$version
[1] ‘0.99.1266
Session info -------------------------------------------------------------------------------------------
 setting  value                                      
 version  R version 3.3.1 Patched (2016-06-22 r70818)
 system   x86_64, darwin13.4.0                       
 ui       RStudio (0.99.1266)                        
 language (EN)                                       
 collate  en_US.UTF-8                                
 tz       America/New_York                           
 date     2016-07-22                                 

Packages -----------------------------------------------------------------------------------------------
 package    * version     date       source                          
 assertthat   0.1         2013-12-06 CRAN (R 3.3.0)                  
 chron        2.3-47      2015-06-24 CRAN (R 3.3.0)                  
 data.table * 1.9.6       2015-09-19 CRAN (R 3.3.0)                  
 DBI          0.4-1       2016-05-08 CRAN (R 3.3.0)                  
 devtools   * 1.12.0.9000 2016-07-20 Github (hadley/devtools@5cd2d80)
 digest       0.6.9       2016-01-08 CRAN (R 3.3.0)                  
 dplyr      * 0.5.0.9000  2016-07-22 Github (hadley/dplyr@8b28b0b)   
 jsonlite   * 1.0         2016-07-01 CRAN (R 3.3.0)                  
 magrittr     1.5         2014-11-22 CRAN (R 3.3.0)                  
 memoise      1.0.0       2016-01-29 CRAN (R 3.3.0)                  
 purrr      * 0.2.2       2016-06-18 CRAN (R 3.3.0)                  
 R6           2.1.2       2016-01-26 CRAN (R 3.3.0)                  
 Rcpp         0.12.6      2016-07-19 cran (@0.12.6)                  
 rstudioapi   0.6         2016-06-27 CRAN (R 3.3.0)                  
 tibble       1.1         2016-07-04 CRAN (R 3.3.0)                  
 withr        1.0.2       2016-06-20 CRAN (R 3.3.0)                  

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 22, 2016

Sorry @brudis-r7, my response was to @mdavy86's post (unfortunately his issue is related to a bug with RStudio + R 32bit 3.3.x, which appears separate from yours, and that one is fixed in the preview release).

I'm guessing that yours is indeed a bug in dplyr. I'll see if I can reproduce / figure out where the crash is occurring.

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 23, 2016

The over-arching issue -- jsonlite is generating an invalid data.frame, with columns not having equal length.

> file <- "sample.json.gz"
> contents <- stream_in(gzfile(file))
opening gzfile input connection.
 Imported 10 records. Simplifying into dataframe...
closing gzfile input connection.
> class(contents)
[1] "data.frame"
> unique(unlist(lapply(contents, length)))
[1] 10  6  3

cc: @jeroenooms; it looks like jsonlite is over-aggressively simplifying the read-in data? (assuming it can be represented as a data.frame even though columns are of different lengths)

Loading

@jeroen
Copy link

@jeroen jeroen commented Jul 23, 2016

@kevinushey I don't think jsonlite is to blame here. Each column in the data frame has length 10. However because it has nested dataframes length(x) gives you the number of columns:

> length(contents$http.headers)
[1] 6
> nrow(contents$http.headers) 
[1] 10
> length(contents$hb.tls_cipher)
[1] 3
> nrow(contents$hb.tls_cipher)
[1] 10

You can flatten the data frame to get rid of the nested structure:

> contents2 <- jsonlite::flatten(contents)
> unique(unlist(lapply(contents2, length)))
[1] 10

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 23, 2016

An MRE:

library(dplyr)
df <- list(
  x = 1:10,
  y = data.frame(a = 1:10, y = 1:10)
)

class(df) <- "data.frame"
attr(df, "row.names") <- 1:10
df2 <- dplyr::bind_rows(df, df)
str(df2)

Although, I still think that this is an invalid data.frame -- isn't the restriction that the length of each column must be the same across columns? Wouldn't this be true for nested data.frames as well?

Loading

@jeroen
Copy link

@jeroen jeroen commented Jul 23, 2016

Isn't the restriction that the length of each column must be the same across columns?

The R language specification states:

A data frame is a list of vectors, factors, and/or matrices all having the same length (number
of rows in the case of matrices).

Unfortunately the spec is ambiguous for data frames but all data frame methods are implemented assuming that for high dimensional objects the _length of the first dimension_ must be equal. In case of a atomic vector this is it's length, in case of a matrix or dataframe this is the number of rows.

For example (to take your MRE) R will only let you assign a data frame to a column if the number of rows match, irrespective of the length (number of columns) of the data frames:

library(dplyr)
df <- data.frame(x = 1:10)
df$y <- data.frame(a = 1:10, y = 1:10)

This is actually a really powerful feature once you embrace it. Unfortunately @hadley disagrees :)

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 23, 2016

Interesting, thanks @jeroenooms -- I had always assumed that most methods would treat data.frames as lists (since that's what they are under the hood).

It looks like this is ultimately something that should be handled in dplyr, then -- bind_rows should either learn to accept data.frames that contain data.frames, or bail early, rather than generating an invalid R object?

Loading

@jeroen
Copy link

@jeroen jeroen commented Jul 23, 2016

@kevinushey dplyr already checks for that in most places:

df <- data.frame(x = 1:10)
df$y <- data.frame(a = 1:10, y = 1:10)
dplyr::transmute(df, z = y)
# Error: Each variable must be a 1d atomic vector or list.
# Problem variables: 'y'

I guess somehow bind_rows forgot to check.

Loading

@mdavy86
Copy link

@mdavy86 mdavy86 commented Jul 23, 2016

@kevinushey, I probably should have made a separate issue for my example even though it is related to using dplyr/jsonlite with RStudio. I confirmed the example works fine with R-3.3.0 and RStudio 0.99.1266.

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 24, 2016

Thanks for reporting @mdavy86 -- can definitely understand why you might've expected both of these issues to have the same root cause.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 7, 2016

Confirmed SIGSEGV with @kevinushey's example in #2015 (comment).

Loading

@krlmlr krlmlr self-assigned this Feb 10, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr closed this in #2446 Feb 20, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants