Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libgumbo should allow overriding the default memory functions #2490

Closed
flavorjones opened this issue Apr 2, 2022 · 8 comments
Closed

libgumbo should allow overriding the default memory functions #2490

flavorjones opened this issue Apr 2, 2022 · 8 comments
Labels
state/will-close topic/gumbo Gumbo HTML5 parser topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@flavorjones
Copy link
Member

Today, gumbo-parser/src/util.c uses malloc, realloc, and free directly.

It would be a nice feature to be able to override those memory functions the way libxml2's xmlMemSetup allows, so that we can make it use ruby_xmalloc and friends if we choose to.

See #2480, #2481, #2489 for context.

@flavorjones flavorjones added topic/memory Segfaults, memory leaks, valgrind testing, etc. topic/gumbo Gumbo HTML5 parser labels Apr 2, 2022
@flavorjones
Copy link
Member Author

@stevecheckoway commented at #2489 (comment) saying he's hesitant to do this.

I may try to set up memory function configuration anyway (in a draft PR), just to see what it looks like.

@stevecheckoway
Copy link
Contributor

I honestly cannot recall if @craigbarnes or I removed that support from Gumbo. As I recall, it lead to passing a pointer to the parser struct to other parts of the code that didn't make sense like the tokenizer.

@stevecheckoway
Copy link
Contributor

Looks like ruby_xmalloc() just calls malloc() but when malloc() returns NULL, it raises a ruby exception. That's not at all what I'd expect from a xmalloc() function. I'm not sure how to use that safely. If we've allocated a big parse tree and then allocation failure raises a ruby exception which may be caught, then we've just leaked all of that memory.

The current behavior of allocation failure inside gumbo is to just abort. A better approach would be to return failure all the way back to the call to parse, cleaning up any allocated memory on the way. But that's a moderate sized engineering effort and I'm not sure the benefit outweighs the effort unless allocation failure becomes an actual problem.

Given the raising an exception behavior of ruby_xmalloc(), we should probably audit all uses of it in Nokogiri to make sure we can't be in a situation where allocation failure + caught exception leads to memory leak.

@flavorjones
Copy link
Member Author

ruby_xmalloc has some benefits:

  • there's evidence that in some situations, using Ruby's heap helps with memory locality and so there may be performance benefits
  • ruby_xmalloc will also attempt a major GC if no memory is available which might be useful in some degenerate cases

Given the raising an exception behavior of ruby_xmalloc(), we should probably audit all uses of it in Nokogiri to make sure we can't be in a situation where allocation failure + caught exception leads to memory leak

We for sure have problems with this already, see #1610 and #2096.

In practice, an OOM condition for Ruby pretty much means the end of the process, anyway -- and so while we may be leaking memory in this situation, it's not clear whether cleaning it up would have any practical benefits in terms of recoverability. And so we end up with some long-running open issues that are lower priority. 🤷

I had a chat with @tenderlove about this a few months ago, so I'm tagging him in case he has any other comments.

@craigbarnes
Copy link
Contributor

I honestly cannot recall if @craigbarnes or I removed that support from Gumbo.

@stevecheckoway I removed it in craigbarnes/lua-gumbo@8d3d4e4. I didn't really give it much thought at the time—in the context of other projects pulling in my changes—I just removed it because lua-gumbo didn't need it and doing so seemed to improve code generation (perhaps only slightly; I don't recall the exact details).

@stevecheckoway
Copy link
Contributor

Thanks @craigbarnes! I thought it was part of the set of commits I grabbed from you, but I couldn't recall for sure.

@flavorjones, I'm not totally opposed to doing this, but I think threading an allocator through the gumbo parsing code is probably not a great approach unless we really need that flexibility. One approach that may be reasonable would be to do something like

#ifndef gumbo_alloc
# define gumbo_alloc malloc
#endif

rather than defining gumbo_alloc as a function and then add the command line flag -Dgumbo_alloc=ruby_xmalloc when compiling the extension.

Here's an example of the situation I was concerned about. That repo has a simple C extension that tries to allocate 200 MB of memory, 1 MB at a time. When run with a low ulimit -v, the NoMemoryError exception is raised and then caught, leaking memory.

I think there are a few alternative approaches we could consider, but if you're okay with just leaking memory in this unusual situation (at least for now), then that's okay with me.

@flavorjones
Copy link
Member Author

After chatting with @tenderlove about this, the primary advantage of using ruby_xmalloc would be to include the amount of memory allocated in Ruby's GC bookkeeping.

However, my understanding is that the memory allocated in libgumbo is freed once the libxml2 document is constructed (and the act of construction via methods like xmlNewNode and xmlNewDocText uses xmlStrdup to allocate new strings). If that's correct, then we should close this and avoid adding this complexity.

@flavorjones
Copy link
Member Author

Also - tenderlove confirmed for me that ruby_xmalloc doesn't raise an exception if it fails to allocate memory. It's just a wrapper around malloc with bookkeeping, and on failure will run GC and retry once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state/will-close topic/gumbo Gumbo HTML5 parser topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

No branches or pull requests

3 participants