-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YJIT is indirectly retaining ActiveRecord Object instances #552
Comments
Hi @intrip, Thanks for the bug report. I assume this is running on Ruby 3.3.0 ? IIRC we may have had another report of a memory leak. @XrXr @k0kubun I'm wondering if this might be happening because we root some objects in YJIT when we bake addresses in the machine code, and we now have code GC disabled, so even dead ISEQs could still have code that never gets collected? |
Might be useful if we could come up with a script that reproduces this memory leak to make it easier to debug. Maybe something that creates ActiveRecord models in a loop? |
Hi @maximecb,
Yes, it's Ruby 3.3.0. To workaround it for now I've reduced the YJIT memory to 32Mib and that mitigated greatly the issue, but it dropped a bit the ratio_in_yjit (from 98 to 95). I've also noticed that enabling GCs in YJIT mitigated it but led to worse overall performance so for now I've just reduced the overall memory. We noticed it now because it became worse recently (not sure if it's related to the Ruby upgrade though) but TBH it's been a while that we were affected by this issue, at least 90+ days.
I'll give it a try but I'm about to leave now and I'll be OFF the next week so it might take some time sorry 😅. |
No problem. I wish we could fix the issue faster :) |
@intrip do you make any use of |
Also, I think if you were to strip all the |
I suspect it's the same bug than https://bugs.ruby-lang.org/issues/19436, but with YJIT (and worse because the references accumulate instead of only retaining the last one). |
I just remembered that Aaron @tenderworks had mentioned a memory leak with YJIT 3.3 and Mastodon. This may be related. Would be good if we could get more info about the Mastodon issue. |
They've been running 3.3.0-preview3 since November #548 (comment), which had code GC disabled unlike preview2. So I'm not sure if it explains the situation. Also, Anyway, a local repro would be really helpful for identifying the cause. |
Alright, that's what I suspected: module Foo
def bar
end
end
def call_bar(obj)
# Here the call cache we'll keep a ref on the method_entry
# which then keep a ref on the singleton_class, making that
# instance immortal until the method is called again with
# another instance.
# The reference chain is IMEMO(callcache) -> IMEMO(ment) -> ICLASS -> CLASS(singleton) -> OBJECT
obj.bar
end
obj = Object.new
obj.extend(Foo)
5.times do
call_bar(obj)
end
id = obj.object_id
obj = nil
4.times { GC.start }
begin
ObjectSpace._id2ref(id)
puts "MEMORY LEAK!!!"
exit!(1)
rescue RangeError
puts "no leak"
exit!(0)
end
So @intrip is very likely using |
Following up on the discussions we had in meetings. Jean said that call caches must be weak references. @XrXr do we root objects / CMEs when compiling Ruby calls in YJIT? If so, could we make them weak references? |
We do root CMEs because we root everything we refer to from generated code. Making them weak references will take some effort, though. Even with a constant pool we can't directly make use of the current mechanism that sets the VALUE to Qundef when the referent dies, because we have a GC yield point before writing out CMEs. If the CME gets collected during the GC we end up pushing a corrupted frame. |
Would not compiling singleton instance methods be an easy workaround? Their perf already suck on the interpreter, so doubt it would impact YJIT performance noticeably. And these methods are very likely not to survive long anyway, so compiling them is a bit of a waste. |
Should work if perf impact is indeed negligible. Hard to tell without experimenting, though. I guess we need to stop compiling calls on all objects with a singleton class, because we root the singleton class itself to do the known class guard, so it doesn't really matter where the method is. Here's a modified version of your script: def call_bar(obj)
obj.itself
end
obj = Object.new
p obj.singleton_class
5.times do
call_bar(obj)
end
id = obj.object_id
obj = nil
4.times { GC.start }
begin
ObjectSpace._id2ref(id)
puts "MEMORY LEAK!!!"
exit!(1)
rescue RangeError
puts "no leak"
exit!(0)
end
So it ends up being quite a drastic change. It'd also make a lot of existing tests useless since |
Also I'm merely suggesting this has a quick fix that would be backportable to 3.3, not saying it's an ideal solution. |
Oh. So that's worse than I thought :/ |
@XrXr can you explain in more detail why it's problematic to make the singleton class refs weak references? In theory, if a singleton class is collected, that means it's dead, and it can't be called anymore, no? If the machine code gets patched and the singleton class ref is replaced by Qnil using an atomic write, that will just cause chain guards to fail, which is to be expected since the class should already be dead by that point? |
The GC can't patch the generated code because there is no write permission, so we need a constant pool. |
Maybe for YJIT, like for TruffleRuby, the easier way to deal with this would be to make the |
I think it's something that would be positive, regardless of this YJIT problem. But the compiled code would still leak |
Ah, right. Maybe Code GC or so could remove such compiled code once it notices it refers to a singleton Code GC would also be useful for cases like test suites which might create many anonymous modules/classes which don't live long but yet might be used a few times and some methods on them might end up being compiled. IIRC the JVM considers references (object constants) embedded in JITed code to all be weak, i.e., if one of these referenced objects would GC then that code would be removed. (and if a thread is currently executing that JITed code it'd deopt to the interpreter I guess, which might work because the thread might already be in safepoint for this part of GC). |
For now YJIT's code GC just drop all the compiled code, so the cost/benefit isn't great for most apps, hence why it's disabled by default in 3.3. If the code GC becomes more advanced in the future (e.g. capable of freeing only the dead code), then yeah maybe. As Alan mentioned, short/medium term a constant pool is more likely to happen. That's why I suggested just not compiling anything that deals with a singleton class (unless the attached object is Of course that's an heuristic, and there will be counter examples of singleton objects that are stable, but it should hold reasonably well. |
We do already change permissions on the generated code for movable GC objects though, no? Otherwise, if we did have a constant pool, would you foresee any challenges with respect to making the reference weak references for chain guards? |
We do, but the weak reference part of the GC is different. We get a callback for moving objects, and that allows us to flip permission for the places we write. For weak references there is no callback, and the GC can't reasonably distinguish between a weak reference in generated code versus one from somewhere else that it can write to without issues. Unless we want to flip permission on everything every sweep step, which doesn't sound great.
The problem I mentioned in #552 (comment) could be worked around. We'd also need to rework how we track CME validity to make things fully weak. |
It might be possible to implement such a callback? @XrXr You understand the situation better than I do. What is your intuition as to what is the best solution to solve this memory leak problem? |
We want to have something we could backport for 3.3, so I'll try and benchmark rejecting calls through singleton classes. In the long term we'll of course go with some sort of weak reference based solution. |
Benchmark results:
Not horrible, but we definitely take a hit. Is it acceptable? |
Yikes, that is actually a pretty significant performance impact. I would prefer to avoid it. |
Ah, there seems to be no perf hit if I amend the logic to accept class methods. Classes and modules tend to live forever anyways so this should still address the leak:
|
Yes class methods should definitely continue to be compiled. Does your patch also compile methods on main? I'm surprised to see it have an impact on benchmarks at all |
|
For receiver with a singleton class, there are multiple vectors YJIT can end up retaining the object. There is a path in jit_guard_known_klass() that bakes the receiver into the code, and the object could also be kept alive indirectly through a path starting at the CME object baked into the code. To avoid these leaks, avoid compiling calls on objects with a singleton class. See: Shopify#552 [Bug #20209]
The upstream fix has been merged and is marked for backporting. I think we can close. |
Thank you all for the fix! Limiting YJIT memory worked around it for now, and our ruby build tool doesn't support -dev yet (only rc/preview so far) so I'll wait for a ruby bump to test the fix 😅.
We use it but rarely: I only found one usage in the app codebase, but it doesn't match with the relation referenced in the heap dump. |
Interesting, there might be something else causing relations to have a singleton class. No Also did you look in your gems? In mastodon/mastodon#28013 I found that |
Oh good point. I found one more relation so now the total is 2, this one uses
I haven't found any outlier in the installed gems via |
kk. I wish there was a way to check if an object has a singleton class to debug this further. Unless I'm missing something I think a C ext is needed for that today. Would be interesting to monkey patch |
merge revision(s) 2cc7a56,b0711b1,db5d9429: [Backport #20209] YJIT: Avoid leaks by skipping objects with a singleton class For receiver with a singleton class, there are multiple vectors YJIT can end up retaining the object. There is a path in jit_guard_known_klass() that bakes the receiver into the code, and the object could also be kept alive indirectly through a path starting at the CME object baked into the code. To avoid these leaks, avoid compiling calls on objects with a singleton class. See: Shopify#552 [Bug #20209] --- yjit/bindgen/src/main.rs | 1 + yjit/src/codegen.rs | 17 +++++++++++++++++ yjit/src/cruby_bindings.inc.rs | 1 + yjit/src/stats.rs | 2 ++ 4 files changed, 21 insertions(+) YJIT: Fix tailcall and JIT entry eating up FINISH frames (#9729) Suppose YJIT runs a rb_vm_opt_send_without_block() fallback and the control frame stack looks like: ``` will_tailcall_bar [FINISH] caller_that_used_fallback ``` will_tailcall_bar() runs in the interpreter and sets up a tailcall. Right before JIT_EXEC() in the `send` instruction, the stack will look like: ``` bar [FINISH] caller_that_used_fallback ``` Previously, JIT_EXEC() ran bar() in JIT code, which caused the `FINISH` flag to return to the interpreter instead of to the JIT code running caller_that_used_fallback(), causing code to run twice and probably crash. Recent flaky failures on CI about "each stub expects a particular iseq" are probably due to leaving methods twice in `test_optimizations.rb`. Only run JIT code from the interpreter if a new frame is pushed. --- test/ruby/test_optimization.rb | 11 +++++++++++ vm_exec.h | 3 ++- 2 files changed, 13 insertions(+), 1 deletion(-) YJIT: No need to RESTORE_REG now that we reject tailcalls Thanks to Kokubun for noticing. Follow-up: b0711b1 --- vm_exec.h | 1 - 1 file changed, 1 deletion(-)
We noticed a slow and steady memory leak in our web app, to investigate it I enabled
ObjectSpace.trace_object_allocations_start
and extracted a few heap dumps, then started analysing them with sheap.I found a common pattern: A
yjit_root
node holding references to ActiveRecord methods holding references to instances of ActiveRecord models then holding references to their instance attributes. These objects attributes sums up and creates a slow and steady memory leak:This is the node data until you reach the ActiveRecord model instance:
The fact that ActiveRecord instances hold references to their data looks good to me, to me looks like the issue is that
yjit_root
keeps indirectly referencing them, which prevents the objects from being GCed hence the leak.I've tried disabling YJIT on 1 host (hey-default-app-109) and the leak was gone, you can see it from this overnight graph of memory usage in the hosts:
In these hosts we are running YJIT with the default options so just
RUBYOPT="--yjit"
Can you help me figure out what's going on here? This issue applies only to a subset of web hosts, which receives specific traffic that creates the "Entry" > "Message" > "Okra" objects.
Happy to tell you any additional info that could help, the next week I'll be OFF so if I won't answer quickly that's the reason 😅. Thanks!
The text was updated successfully, but these errors were encountered: