Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C Stack Overlfow in from_json when parsing large file #61

Closed
SymbolixAU opened this issue Dec 10, 2019 · 17 comments
Closed

C Stack Overlfow in from_json when parsing large file #61

SymbolixAU opened this issue Dec 10, 2019 · 17 comments

Comments

@SymbolixAU
Copy link
Owner

@SymbolixAU SymbolixAU commented Dec 10, 2019

For example, my google timeline history.


This is not a FileReadStream or d.Parse() errror as this works

ls <- readLines(json)
js <- paste0(ls, collapse = "")
jsonify::validate_json( js )

where json is my google timeline history

So the error is in the parsing-to-R phase...

... into the rabbit hole we go...

@dcooley dcooley changed the title from_json crash when reading large file from disk from_json crash when parsing large file Jan 31, 2020
@dcooley dcooley changed the title from_json crash when parsing large file C Stack Overlfow in from_json when parsing large file Feb 2, 2020
dcooley added a commit that referenced this issue Feb 4, 2020
@knapply
Copy link
Contributor

@knapply knapply commented May 24, 2020

I cut a large-ish file up into smaller pieces to see if I could narrow it down to a certain recursion level, but no luck yet.

@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 26, 2020

For what it's worth, geojsonsf (which uses rapidjson via jsonify) has no issue reading & parsing it. So the issue is definitely in my from_json() or simplify logic

library(sf)
sf <- geojsonsf::geojson_sf( "~/Documents/github/sf-city-lots-json/citylots.json")
sf

Screen Shot 2020-05-26 at 8 27 23 pm

@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 26, 2020

Note for me:

I've left branch issue68 in a 'crash state'. It's during the multiple recursions into json_to_sexp() that cause the crash.

Am I making too many nested lists? Am I going round more than R_xlen_t times??

@knapply
Copy link
Contributor

@knapply knapply commented May 26, 2020

My guess is that it's not exiting a level of recursion somewhere in json_to_sexp(). It doesn't look like it's in the simplify logic though.

json <- paste0(
   readLines("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json"),
   collapse = ""
)

test <- jsonify::from_json(json, simplify = FALSE)
Error: segfault from C stack overflow

test <- jsonify::from_json(json, simplify = TRUE)
Error: segfault from C stack overflow
@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 26, 2020

testing recursion depth

library(Rcpp)

cppFunction(
  code = '
  void recurse(unsigned int i) {
      Rcpp::Rcout << i << std::endl;;
      recurse(i+1);
    }
')

recurse(0)
...
165736
# Error: C stack usage  7969204 is too close to the limit
dcooley added a commit that referenced this issue May 27, 2020
dcooley added a commit that referenced this issue May 27, 2020
dcooley added a commit that referenced this issue May 27, 2020
@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 27, 2020

@knapply following your pars_json example, I now have a non-crashing version (and it simplifies too)

rl <- readLines("~/Documents/github/sf-city-lots-json/citylots.json")
js <- paste(rl, collapse = "")

system.time({
  res <- jsonify:::rcpp_from_json2( js, TRUE, TRUE )
})
  user  system elapsed 
 13.171   0.469  13.675 

head( res$features[[1]][[2]] )
  MAPBLKLOT  BLKLOT BLOCK_NUM LOT_NUM FROM_ST TO_ST    STREET ST_TYPE ODD_EVEN
1   0001001 0001001      0001     001       0     0   UNKNOWN    <NA>        E
2   0002001 0002001      0002     001       0     0   UNKNOWN    <NA>        E
3   0004002 0004002      0004     002       0     0   UNKNOWN    <NA>        E
4   0005001 0005001      0005     001     206   286 JEFFERSON      ST        E
5   0006001 0006001      0006     001     350   366 JEFFERSON      ST        E
6   0007001 0007001      0007     001    2936  2936      HYDE      ST        E
@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 27, 2020

on branch issue61, all tests pass, and I believe the segfault is fixed

rl <- readLines("~/Documents/github/sf-city-lots-json/citylots.json")
js <- paste(rl, collapse = "")

system.time({
  res <- from_json( js )
})
# user  system elapsed 
# 11.628   0.340  11.981

system.time({
  res <- from_json( js, simplify = FALSE )
})
# user  system elapsed 
# 5.563   0.277   5.846

@knapply I'm going to send to CRAN in a day or so to get this segfault issue resolved. But I sitll think some sort of standardisation between the two libraries is worth discussing.

Also, how should I add you to the DESCRIPTION field?

@knapply
Copy link
Contributor

@knapply knapply commented May 28, 2020

I didn't realize you were using this here, but...

person("Brendan", "Knapp", 
       email = "brendan.g.knapp@gmail.com", 
       role = c("ctb"),
       comment = c(ORCID = "0000-0003-3284-4972"))

... is what I use elsewhere. Thanks.

dcooley added a commit that referenced this issue May 28, 2020
@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 28, 2020

merged into master

@knapply knapply mentioned this issue May 28, 2020
@dcooley dcooley closed this May 30, 2020
@knapply
Copy link
Contributor

@knapply knapply commented May 31, 2020

I hate to bring bad news 😬😬😬 ... but it still comes up in some (probably extreme) cases:

fail <- jsonify::from_json( # simplifying fails
   jsonify::to_json(matrix("yo", nrow = 1e3, ncol = 1e3))
)
#> Error: segfault from C stack overflow

fail <- jsonify::from_json(
   jsonify::to_json(replicate(1e5, letters, simplify = FALSE))
 )
#> Error: segfault from C stack overflow

success <- jsonify::from_json( # non-simplify
  jsonify::to_json(matrix("yo", nrow = 1e3, ncol = 1e3)),
  simplify = FALSE
)

success <- jsonify::from_json( # non-simplify
  jsonify::to_json(replicate(1e5, letters)),
  simplify = FALSE
)

success <- jsonify::from_json( # less rows
  jsonify::to_json(matrix("yo", nrow = 1e2, ncol = 1e3))
)

success <- jsonify::from_json( # less rows
  jsonify::to_json(replicate(1e4, letters, simplify = FALSE))
)

I don't think I would've come across this if I weren't benchmarking things and I haven't been able to recreate it with "real" data.

@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 31, 2020

ha - no worries. But it seems to now only be in the "simplify" logic, rather than the "parse" logic now, which is a step in the right direction.

Looking at it it's probably in my recursive simplify_dataframe() logic.

@dcooley dcooley reopened this May 31, 2020
@knapply
Copy link
Contributor

@knapply knapply commented May 31, 2020

It is working on files it wasn't before, so there's definitely progress!

These "less rows" versions from the example above...

success <- jsonify::from_json( # less rows
  jsonify::to_json(matrix("yo", nrow = 1e2, ncol = 1e3))
)

success <- jsonify::from_json( # less rows
  jsonify::to_json(replicate(1e4, letters, simplify = FALSE))
)

... both return matrices...

class(
  jsonify::from_json(
    jsonify::to_json(matrix("yo", nrow = 1e2, ncol = 1e3))
  )
)
#> [1] "matrix" "array"

class(
  jsonify::from_json(
    jsonify::to_json(replicate(1e4, letters, simplify = FALSE))
  )
)
#> [1] "matrix" "array"

I'm not sure if things run through simplify_dataframe() first though.

@dcooley
Copy link
Collaborator

@dcooley dcooley commented May 31, 2020

I'll dive in and take a closer look at some point today. Thanks for finding this

@dcooley
Copy link
Collaborator

@dcooley dcooley commented Jun 1, 2020

so, some investigation has led me to

array_to_vector() {

        case rapidjson::kStringType: {
          out[i] = "test";             // <--- this is OK with large i
          //out[i] = Rcpp::String( child.GetString() );
          update_rtype< STRSXP >( r_type );
          break;
        }

array_to_vector() {

        case rapidjson::kStringType: {
          //out[i] = "test";             
          out[i] = Rcpp::String( child.GetString() );    // <--- this crashes with large i
          update_rtype< STRSXP >( r_type );
          break;
        }

@knapply does anything spring to mind to indicate why Rcpp::String( child.GetString() ) would fail after many iterations?

@dcooley
Copy link
Collaborator

@dcooley dcooley commented Jun 1, 2020

Pre-allocating does it

case rapidjson::kStringType: {
          //out[i] = "test";
          size_t sl = child.GetStringLength();
          std::string s = std::string( child.GetString(), sl);
          out[i] = s;
          update_rtype< STRSXP >( r_type );
          break;
        }
m <- matrix("yo", nrow = 1e3, ncol = 1e3)
js <- jsonify::to_json(m)
fail <- jsonify::from_json( js )
str( fail )
 chr [1:1000, 1:1000] "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" "yo" ..
SymbolixAU added a commit that referenced this issue Jun 1, 2020
@knapply
Copy link
Contributor

@knapply knapply commented Jun 5, 2020

Nice! Sorry for the delay, but finger-crossed!

@dcooley
Copy link
Collaborator

@dcooley dcooley commented Jun 5, 2020

no worries. It's on CRAN now too.

So I'll close this until you find another issue :)

@dcooley dcooley closed this Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.