New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate/profile RAM usage #2449
Comments
@bensheldon on reddit generously offered to pair with me to figure it out, and also recommended derailed_benchmarks Ben, if you see this, I can do some initial derailed_benchmarks on my own, so we are a step ahead working together... but I'm not sure how to run |
👋🏻 oops, sorry I didn't describe the entire process:
(arguably there are better technical ways to accomplish both at the same time, but I've found this has been an ok process in my consulting engagements) |
OK, haven't yet figured out best way to run Rails 7.1 (3afd9c5)bundle exec derailed bundle:mem
Rails 7.0 (cc8e840)bundle exec derailed bundle:mem
ComparisonThis particular report only shows 3 meg difference, which is no big deal. The extra 3MB does seem to be between Rails 7.1 and 7.0 specifically.... but this isn't that much, and doesn't explain our observed results yet. There are some clues in there as to what gems are taking up a lot of RAM... but mostly just taht we have a lot of dependencies, some of which are bigger than one would like, but no single star of badness I think. |
OK... have env set up, hacikly.... When trying to run
This app doesn't actually use activestorage, although I guess does include the dependency (as default rails)... indeed it does not have a I'll think on it some more, or perhaps it's time to take Ben up on pairing offer. This sort of thing is why I haven't previously gotten further with this sort of tool, it often turns into some confusing yak shaving. |
OK, our app actually doesn't intentionally load activestorage at all.... Which is making me wonder why derailed is, and how/if derailed perf:mem differs from an actually running app. Here's our config/application.rb
Note how we pick and choose parts of rails, and do not include activestorage. However, perhaps some dependency we have requires activestorage, not sure. Nevertheless, simply adding a default template |
OK, with config/storage.yml added. Rails 7.1 (3afd9c5 )CUT_OFF=1 PATH_TO_HIT=/catalog bundle exec derailed exec perf:mem
Rails 7.0 (cc8e840)CUT_OFF=1 PATH_TO_HIT=/catalog bundle exec derailed exec perf:mem
ConclusionStill nothing obvious to me. Rails 7.1 is about 10M bigger... haven't totally analyzed where that is coming from, but it does not seem enough to explain the observed discrepency (or is it?) In general, the weight seems just like a whole bunch of dependencies, a handful of of which are bigger than they probably should be to some extent, but no specific egregious ones, just an accumulation? Still doesn't totally explain why 3 puma workers with 3 threads are touching our memory limits in a 2.5 G heroku dyno.... that'd be like 800 megs per dyno, when this analysis is only showing a couple hundred -- maybe it really isn't memory just at "boot time" after all, but memory that gets taken under actual use very soon after boot? We tested with search results page @bensheldon I made an appointment with you tomorrow at 2pm, if that time remains good, I figure you'll reach out to me somehow to tell me how we will connect? I really appreciate it, I'm looking forward to seeing what else we can figure out or if you have any advice! So far I'm still fairly stumped about even if I've yet found anything significant or actionable, or what next steps might be. |
@jrochkind looking forward to chatting today 👍🏻 I'll add a Zoom link to our calendar invite. That's a bummer we haven't found any smoke yet! When we meet we can go over it together and see if there's anything being missed. We might want to also experiment with And if none of that bears fruit, we might want to move on to trying to get a memory dump of the running process. I'll write down some options for that below:
|
@bensheldon you do I think have I am pretty sure: The way heroku works, there actually isn't a way to get into a running web dyno, only into new one-off created-just-for-you-just-now dynos (that do not handle web traffic). |
@bensheldon oh wait, I didn't know about ps:exec, what, is this new? Anyway, clearly things I don't know about heroku. |
It's new as of 2017: https://devcenter.heroku.com/changelog-items/1264 |
So unfortunately a pairing session with @bensheldon revealed no smoking guns, and only enduring mystery: The derailed tools suggest that an individual process should be 200-300M (which is fairly reasonable -- especially with anticicpated "copy on write" memory savings), but we're seeing more like 400-500 meg, and are not sure if we're seeing the "copy on write" memory savings we expect (this can be confusing to understand from OS RAM reports, and I am not confident I know what I'm looking at). Ben suggests using the rbtrace gem in production to see if we can more insight into where the RAM is going. (rbtrace alas does not work on Mac at present, so I wasn't able to play with it in dev to get a sense of what it might tell us). Ben also suggests using heroku's puma stats, which we already had as a recent issue at #2446. We should get on that, and decide at meeting today who will do it. (It does require a new gem and deploy). Meanwhile however, The perf:objects report reveals lots of memory allocations in: kithe, attr-json, faster-s3-url and local code -- all code I'm maintainer on, and which includes a lot of low-hanging fruit strings we can optimize. It also reveals a lot of memory allocations from blacklight, which we aren't going to try to touch right now -- especially since we are currently stuck on blacklight 7 due to some blockers in BL 8. (See #2164). It also reveals the BL is quite difficult to PR to, so if BL is a culprit, we may decide the cost/benefit does not look good for spending time on it. I am also a bit worried that our own |
Comparing derailed Nevertheless, we'll work to improve them while we're looking at them. |
OK, after doing more work... I think my efforts to reduce object allocations are actually an irrelevant red herring. But I noticed a pretty important thing in the derailed @bensheldon I'm curious what you make of this, or how to investigate it further... I'm wondering if Rails 7.1 could actually have some kind of memory leak in AR... which we are maybe exascerbating by our use of shrine and attr_json, both of which hook into AR? And by sometimes fetching hundreds of AR objects (using both) per page? But also I"m not sure how derailed counts "retained"; it can very from run to run... but not nearly as much variance as this 7.0 run vs 7.1 run shows....
my app under Rails 7.0
Hardly any, trivial. My app under Rails 7.1
Pretty huge!
OK, that huge amount of retained memory from Why would this be "retained"... or is this an instrumentation error/inexactness? I think JSON in there because we make heavy use of postgres jsonb (and things like shrine and attr_json that de/serialize json from AR). what?What's going on with this? The version of ruby did not change btw. |
I think that retained memory report is exactly what we're looking for 🎯 That looks pretty damning to me: I don't think anything should be retaining 100MB+. And that's nice that you can generate it locally 👍🏻 Can you try running One thing about those "retained memory by location": I believe that is where the retained object originates, but it doesn't say exactly who/where is continuing to hold a reference to the object. So that would require tracing through the application. I am dubious that the problem is in Rails, but that something outside of Rails is holding onto them (e.g. putting them into some global array somewhere, or something of that flavor). |
Totally aside, please don't be distracted but... Is this line loading up everything, multiple times to load scihist_digicoll/app/components/work_file_list_show_component.rb Lines 81 to 85 in ca8a86e
|
@bensheldon Hmm, good find that I think it clearly does require memoization! it's loading one AR object for every "page" of our digitized works -- with maybe 1-3 associated records for each page. We might have anywhere from 10-700 pages -- median probably around 100. So that is kind of a lot, but the current design of our app (internal and intended UX) kind of requires it. But unless there's something I'm missing, it clearly should be memoized, not sure how we missed that, thank you! Assuming tests pass and it works, that could be a huge performance gain as well as RAM, and explain some mysteries! Still doesn't explain why it would differ between 7.0 and 7.1 though. |
This really should have been memoized, as @bensheldon noticed in investigating #2449 -- or actually, should it be passed in from a caller that may already have this list, isntead of being fetched again in the component anyway? (The biggest challenge we have with components!). Not sure how to do that, but we can memoize it, and do so here.
Thanks @bensheldon! ... the thing is, almost nothing changes between these two branches except the Rails version (including updating to new Rails config defaults), so I can't figure out what my app could possibly be doing differently to retain 100MB more in the diff.... but I guess that's the task before us.... Note also my "retained" memory counts are not exactly the same from run to run, although they are in the same ballpark. Whereas hte "allocated" memory counts actually are exactly the same from run to run I think? Is this all expected? It makes me wonder how accurate this "retained" instrumentation really is and what it's measuring. ok, heap_diff. Try with I'm not really sure what to do with this.... I am glad I ran it against the Rails 7.0 version too, because it didn't actually differ in quite the ways I expected... Curious what you think of these... if these lines mentioned aren't what is actually retaining, not really sure how to go about figuring out what is retaining, especially when these seem kind of all over the place... my app on Rails 7.1
Diff
My app on Rails 7.0
diff
|
@bensheldon ah and actually that particular WorkFileListShowComponent is actually fairly rarely used though, and is not rendered in the path I have been testing. That may be partially how we didn't notice it, it's used mostly in some edge cases. AND in those edge cases does NOT have 100-800 constituent members, but only 10-30. So not as big a deal and probably not relevant, but still will fix it it should be memoized! |
New versions have optimizations to make fewer string allocations. Found in investigating #2449, in the end I am not thinking this is actually relevant, but still doesn't hurt to tidy up.
Agreed! I didn't want to distract, but saw it. Thank you for generating that heap dump! That's really helpful. It looks to me like there is maybe some really-big JSON files in the database that are being parsed? Does that ring a bell? The lines seem to correspond to:
|
Thank you, this is super helpful, I really appreciate it. OK, there is definitely some big json in the database, yeah. These come from two places:
So yeah, there is some big json -- which will be in So... why would upgrade to Rails 7.1 suddenly start making this a bigger problem? I know the problem is almost never in Rails it's in your code -- and possibly attr_json is to blame, or more likely doing something that wasn't a perf problem before Rails 7.1 but is now, which would be really hard to solve if it's not considered a bug... But part of me is still wondering if something in Rails 7.1 is holding on to things in a statement cache that it shouldn't be or longer than it should be, or at any rate differently than Rails 7.0 regardless of "should". I definitely agree it looks like something related to our json in ActiveRecord. If I put Not sure how to verfy update, no, prob need to merge it into my DATABASE_URL myself, so ignore below Not sure how to understand this compared to above or if it's different.... looks pretty similar? Rails 7.1, prepared_statements:false (I think)
|
OK, I'm also realizing that before the Rails 7.1 update, our update: okay, ran |
Better, if the hypothesis is that there are really big chunks of SQL or something else in the statement catch
I think you want to add it as a query param on the url e.g. |
OK modified DATABASE_URL is |
OK I dunno, even though
in console, includes:
Even though Running heap_diff again.... I am getting confused about whether it looks better or not, or as good as Rails 7.0 version I think not? I agree something AR must be hanging on to these values... note discussion about inserts/updates and expected large interpolation above... of course no inserts/updates are actually happening in this profile run! Rails 7.1 heap_diff
I will leave for the day shortly, I keep meaning to, but then you leave me a tantalizing next step. :) |
I wonder if it could be the query cache? Apparently Rails 7.1 does make changes to query cache, making it LRU... but query cache should be cleared at end of every request and not result in retained memory.... looking for how to disable query cache to test that, although apparently there is no simple way.... What other places could AR be caching returned json values? |
OK, Rails 7.1 lets you limit or disable query cache from connection properties too. Trying to disable both query cache and prepared statement cache with: Does not seem to me to have much effect on retained objects. I am not sure where to go next, how to figure out what is hanging on to these objects or why it differs from Rails 7.0 (hypothesis remains, they are ActiveRecord objects with a lot of parsed JSON). From
And perf:heap_diff
|
I asked my colleague @jhawthorn for thoughts on this. Together we came up with a rough test plan using his
# config/initializers/vernier.rb
# VERNIER_MODE=wall VERNIER_FILE=tmp/vernier.json rails s
if ENV['VERNIER_MODE']
require 'vernier'
Rails.application.executor.to_run do
3.times { GC.start } if ENV['VERNIER_MODE'] == 'retained'
@vernier = Vernier::Collector.new(ENV['VERNIER_MODE'].to_sym)
@vernier.start
end
Rails.application.executor.to_complete do
next unless @vernier
result = @vernier.stop
filepath = ENV.fetch('VERNIER_FILE', "tmp/vernier_#{ENV['VERNIER_MODE']}.json")
if ENV['VERNIER_MODE'] == 'retained'
File.write(filepath, result.to_gecko)
else
File.write(filepath, Vernier::Output::Firefox.new(result).output)
end
ensure
@vernier = nil
end
end
VERNIER_MODE=retained TEST_COUNT=100 PATH_TO_HIT=/good_job/jobs bin/profile derailed exec perf:mem_over_time ^^ This was the result, albeit my application isn't leaking memory (😊). For Vernier's memory profiling, the left-to-right width is the percentage of retained memory (not time). In my results, it was basically the CookieJar and that really wide string, which we think is the HTML response. This is helpful because the flamegraph shows the stacktrace (unlike that heapy dump). Unfortunately it doesn't tell us who is retaining a reference to those objects, but it hopefully should help narrow down their source. |
OK, I did some more data collection before getting that recent advice for another metric from @bensheldon (thank you very much for that @bensheldon , I will try it next). It turns out it is very time-consuming to take all these measurements while keeping straight what I'm measuring and make sure I have the numbers right. But here is what I think I have discovered:
Summary of statistics in our three actions, between Rails 7.0 and 7.1
"small work" gets substantially bigger in 7.1, "large work" gets unreasonably bigger in 7.1. I can not explain why "search results" gets SMALLER retained memory in Rails 7.1, but it is same order of magnitude. These numbers are not identical from run to run, but are close, same order of magnitude.
|
@bensheldon OK, here's the vernier output json files. i used Firefox Profiler online to look at them. I think it's consistent with our model so far, without unfortunately telling us anything new. The memory allocated comes from the AR fetch at: scihist_digicoll/app/components/work_image_show_component.rb Lines 46 to 48 in 70f8385
Which is expected to be the main AR fetch for this page (the one that fetches in the ~422 AR objects one for each digitized page). So that confirms the model we had already developed of wha'ts going on I think... the question is why those end up getting observed as "retained" -- and especially why it differs between the same app with Rails 7.0 and 7.1!
Curious if you see anything interesting I'm not seeing in either of these, @bensheldon ! I think I'm leaning towards declaring defeat on really solving the mystery.... I think the answer is a) to improve RAM performance we probably have to stop fetching 422 AR objects (with large embedded JSON) in a single Rails request, and it will remain a mystery why this became more of a problem in Rails 7.1, but also b) after looking at it and thinking about it (including looking at our puma metrics now being captured on heroku dashboard), I think I can actually tweak our puma worker/thread figures to probably use a lot less memory without actually losing much capacity, and thus fit comfortably within same dyno again. |
I guess there's not yet the technology to give us an easy report of what is actually referencing the retained objects! |
OK, here is actual vernier graphs produced by above script, for both the Rails 7.0 and 7.1 version to compare. I'm still not sure the comparison tells us anything useful -- they are still both pretty similar, both showing a lot of allocations from the AR fetch in Confirming that, yeah, fetching 422 AR objects does allocate a lot of RAM -- still unclear why it would be retained instead of GC'd, or why the effect on app RAM consumtion or retention would differ from Rails 7.0 to 7.1. vernier_retained_rails70_cc8e84042.json In somewhat better news... yesterday we deployed at 70f8385 a version that reduces object allocations, largely by making sure a lot of string literals are frozen, plus a couple other allocation reductions in ActiveRecord-related hotspots. One day is really too little to see a trend when as we can see it's so variable... but maybe this may have reduced overall RAM consumption in heroku after all, maybe back to the levels we had before before Rails 7.1 deploy, where RAM mostly stays under 2GB on our single 2.5GB performance-m dyno? (But you can see in the last week there was a 12-20 hour period before where it stayed this low too, so). i also noticed that in my Rails 7.1 branch I had updated view_component from 3.6 to 3.7, so I did some derailed profiling with Rails 7.1 and view_component back at 3.6 just to make sure that didn't make a difference -- it didn't. @bensheldon I really appreciate your help, I understand the tools available and how to use them much better now. I think I may be close to declaring defeat, for now, on understanding why Plan for adjusting puma worker/thread counts in same dyno formationRight now we run a heroku performance-m dyno, with 2.5 GB of RAM. It also appears to have two "virtual CPUs", although this is actually maybe only one real cpu and one "hyperthread". From https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#process-count-value , which also has other heroku puma sizing advice we are looking at. We currently run 3 puma workers, with 3 threads each, so 3x3 is 9 total threads. I'm not sure how we decided on this -- the heroku docs above actually recommend 2 workers with 5 threads each (2x5 is 10 total threads). One reason to run multiple worker processes is because ruby can't do true multi-core parallelism, you want at least as many ruby processes as cpus. But we only have 1-2, so three worker processes is probably not necessary, and possibly is even counter-productive? Changing to 2 workers with 4 threads each should be more RAM efficient, while providing about the same number of total threads (2x4 == 8), which should mean about the same actual capacity. We could also try heroku's recommendation of 2x5==10. I will try that next week, don't want to do it at the end of the week. Our current 7-day metrics, show that dyno load mostly stays under 2 (where we want it), but occasionally has spikes up to 3 or 4 showing in those spikes we don't actually have enough CPU resources to handle capacity. We've only been keeping heroku puma stats for 2 days, but in those two days it also shows mostly only two thread utilization (20% of capacity), up to 5 thread (55% capacity), but occasional spikes to 8-9 threads (90-100%) capacity, suggesting similar, we mostly do fine, but do have spikes that top off our capacity. In the last 72 hours, we didn't actually have any timeout alerts, which we do get occasionally (a minute or two ever few weeks), so these spikes are probably enough to slow down response times somewhere under the timeout values when queues are filled up. The 100% puma capacity does line up with our dyno load max -- with no apparent RAM spike there though. There is a dyno restart there too, I think maybe the dyno actually got totally locked up and heroku restarted it? (if so not sure why we didn't get any error alerts)... Other spikes smaller than 100% in the past two days don't seem to have load bumps corresponding or other unusual activity. So I think we are basically handling our usual traffic, with occasional brief spikes our current resources have trouble handling (we think from bots, mostly) -- that my guess is we'd need more dynos power to handle, more workers wouldn't help even if we could fit them in RAM? I'd hope the changed worker/thread config would leave that about the same, with less RAM! |
After a substantial pairing session with @jhawthorn, we're pretty sure we found a problem introduced into Rails 7.1. In brief, this change to RoutesProxy causes a new anonymous class to be created for every RoutesProxy instance which is every request; this anonymous class references the controller, which references the models, which is pretty big. This anonymous class is fairly durable (maybe ~2k of them will be retained by the VM) so we believe it's the main source of the memory growth you saw when upgrading 🎉 (It might be particular to your app because you use Rails PR: rails/rails#50298 Before patching (looks like what you're familiar with)
After patching 🎉
We were able to debug this thanks to @jhawthorn's # $ irb
require './lib/sheap.rb
diff = Sheap::Diff.new("/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-1.ndjson"
, "/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-2.ndjson")
# Find one of the Work records that's been retained
model = diff.after.class_named("Work").first.instances[200]
=> <OBJECT 0x117cf5c98 Work (4 refs)>
# Find the path to the (default) root
diff.after.find_path(model)
=>
[<ROOT vm (2984 refs)>,
<IMEMO 0x126c9ab68 callcache (1 refs)>,
<IMEMO 0x126c9acf8 ment (4 refs)>,
<CLASS 0x12197c080 (anonymous) (15 refs)>,
<OBJECT 0x122ddba08 (0x12197c080) (3 refs)>,
<OBJECT 0x117cfc458 WorksController (13 refs)>,
<OBJECT 0x117cf7318 WorkImageShowComponent (15 refs)>,
<OBJECT 0x117cf5c98 Work (4 refs)>]
diff.after.at("0x126c9ab68").data
=>
{"address"=>"0x126c9ab68",
"type"=>"IMEMO",
"shape_id"=>0,
"slot_size"=>40,
"imemo_type"=>"callcache",
"references"=>["0x126c9acf8"],
"file"=>"/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
"line"=>48,
"method"=>"public_send",
"generation"=>288,
"memsize"=>40,
"flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}} |
First, in a possibly deleted comment you asked:
You got the right idea of what it does basically -- it shows up in stack traces because it wraps every request in an around_filter, which is really annoying for stack traces -- but for most requests it does nothing but "yield", it's just sort of like "middleware". I am confident that in the requests we are profiling it is doing nothing (as the requests we are profiling don't do any writes). But good call, as it does collect things to batch (but should only do so for the duration of a request, and then release them) -- if it had a bug, perhaps it could have a leak or extra retention (although again it didn't differ between Rails 7.0 and 7.1) But then the meat!OK! Wow! Thank you for finding that! I know usually "it's not you it's Rails" , but I had my suspicions anyway, since so little had changed in my app between 7.0 and 7.1. MOST apps don't fetch hundreds of AR objects or otherwise use as much RAM to process a single request as mine does, so the bug as you describe it wouldn't be as significant to most apps. So it is true that to some extent this app's "bad behavior" does make the bug worse. (And in general does lead to more memory consumption for this app, since it does need enough RAM to cover the RAM-hungry requests, as many as may be running concurrently). So that probably explains why nobody else noticed the bug, it took a somewhat poorly behaved app like this one to notice it. BUT it is a bug! Wow, thank you! Or... at least a RAM regression if not a bug. The patch is pretty confusing to me! And probably would be non-trivial, even if not as massive, total memory increase for most apps, I'm guessing.
I don't recall using Since you're giving me profiles of my own app, apparently you got my app to run, and in production mode? I am gratified you found it not too hard to set it up to do that, and apologize for how hard it was (I could have given you some ENV vars instead of making you figure it out yourself!), and grateful you spent the time! I may try patching the patch in locally instead of waiting for the Rails version... but probably don't feel safe doing that until it's merged into Rails. Looking at it... wow how funky, who would have predicted that the use of |
@bensheldon in understanding the tooling and metrics, one mystery is why the vernier profiles looked so similar between the two cases, even though the derailed heap_diff and objects reports differed, and indeed there was retention happening in 7.1 that was not in 7.0. I guess the vernier profile was in fact more measuring allocations (that were the same in both cases), rather than retentions (which did differ, as the derailed metrics it turns out correctly indicated). |
Oh this ALSO explains why the deploy we did a couple days ago that reduced allocations (but wasn't logically expected to reduce retention) may indeed have had a measurable impact on max RAM use -- the Rails regression means that anything referenced in a controller instance variable was retained past action complete, so turning non-trivial number of things in the models ( de-serialized json via It all kind of fits together into a theory, which is a good sign. |
(And wait, I assumed you had run our app since you showed us an "after patch" profile with |
Oops, I had too many tabs open and had started writing a response, then realized I was wrong and abandoned it, and then accidentally submitted it later. You can ignore that 😊
I wouldn't have caught it without @jhawthorn's help, but we were able to isolate the heap dump down to the RoutesProxy, and then looked up the blame on that and saw it changed between Rails versions. Then @jhawthorn explained: It's
I did! It really wasn't too bad to get running, set up an admin account, and create some "asdfadsfadsf" records. I dunno why that URL is the same, unless maybe This was my eventual
|
I'm honestly kind of proud that you were able to get the app running without too much trouble, thank you! And interested in your unintentional feedback on pain points in doing so -- you can ignore the rest of this if you want, but I'm kind of talking to myself. That particular action we were profiling doesn't use solr at all -- although I'm to sure about the error that happened anyway (why!) or why you were able to get past it if an error did happen! Also solr ought to be easy to get going with an automated thing that even downloads solr for you, but oh well. AWS is only needed for storing images (we have a LOT of images of digitized historical document pages), and presumably your test data didn't include actual images, which is fine for this case, although I wonder how representative your test data was of what we were testing, or how you managed to create it! -- it maybe didn't include hundreds of page objects. I actually COULD have sent you a JSON file that you could have used to load in our actual (non-sensitive in this case) production data, we have a function for that we use for copying data from staging to dev to reproduce problems -- but that would have required you to set up some of the AWS stuff. All in all I'm pleased you were able to get it to work though, it's a good test of our bootstrap instructions and methods, which I know aren't as perfect as they could be or as easy as a docker container, but apparently are not ridiculous! Thank you! |
You should be proud! It's a good app; I've seen way weirder.
It didn't, but my manually creating the records through the admin panel did require solr. I just had to sort of limp through that activity, but it took 5 minutes (and another 5 minutes to find the "publish" button in the UI). My recommendation there would be:
(maybe the latter 2 already exist and I didn't figure out how to configure that; or just rescue errors in seeds.rb so that they mostly work ok regardless of whether they actually are set up)
Get offline and enjoy your day! You deserve it ❤️ |
I kind of sort of understand how the instance_eval ended up preserving memory (basically in a closure)... although I'm not sure why the anonymous (singleton?) classes it created were themselves retained... But also what this makes me think is -- if this hadn't been a regression between 7.0 and 7.1, and if I hadn't noticed, because my poorly behaved app made it more noticeable, and if you hadn't spent the time with me and then brought in @jhawthorn.... I wonder if there could be other retaining-too-much bugs/optimizations-in-waiting in Rails, maybe that have been there for a long time, ... and how feasible it would be for them ever to be found, or what tooling might (sheap!) might make it feasible to find them.
There is one, you just didn't discover it! And perhaps it could have been documented better. Also, right, trying to write to solr in admin when adding records... you could have avoided that with I guess ideally we'd doc all the env's better, although they are all centralized in the env.rb, it's messy there. |
OK some updates and plans R14 out of memory is backI had hoped that the deploy on Wednesday Dec 6th (70f8385) which reduced allocations might return our app to it's historical RAM patterns of staying under 2GB. While it initially looked like it did as a trend, last night we did get some R14's again. While not an emergency to get an R14 now and then, we like to eliminate them because: app perf is degraded while they happen; they are errors in our logs/monitoring services distracting us from "real" errors; and solve the problem now before it does become an emergency, if we keep our app from having R14's we'll never get to R15. So we do want to consider doing something. Let's patch in the fix@bensheldon's PR to Rails at rails/rails#50298 is now merged into Rails main. This makes me feel secure about "monkey-patching" it into our app, until a 7.1.x Rails release happens including it. The change is to one method, so is pretty monkey-patch-able. This should hopefully return our app more fully to it's previous memory profie... we will see? Change worker/thread counts: in the futureI am still optimistic about changing our worker/thread counts on our single performance-m dyno. from currently 3 workers x 3 threads I think this should give us lower RAM usage with pretty similar performance profile, not actually giving maybe anything up. But in order to only change one thing at a time to have a better idea of what had an effect, I'm going to hold off on this, and start with patching the memory fix in. Filed as separate ticket at #2465 Extra ruby language heroku stats may be busting our logsOur logs go into papertrail, and our plan has a certain capacity. Lately we have been exceeding our papertrail capacity on many/most days. This approximately coincides with when we turned on ruby language metrics at #2446. I'm not sure why -- haven't figured it out yet. But we probably need to reverse it -- or identify what is causing the logging increase, and have papertrail exclude those lines? Changing from our current Papertrail 250 MB/month plan to the next larger one at 550 MB/month would be an extra $35/month, which we don't really want to pay. I'm actually having a lot of trouble figuring out what extra log lines (that would make it to papertrai) were created by the settings to turn on ruby language and puma metrics... the docs aren't clear, and I don't see anything obvious in visual inspection of logs. I may start just by turnign this stuff off again, to see if our log volume goes down again. |
…o it's release in presumably Rails 7.1.3 See rails/rails#50298 And ref #2449
RAM usage trends already looked like they got closer to with Rails 7.0 after we significantly reduced allocations. After "monkey-patching" in rails/rails#50298 (only deploy showing on graph below), RAM usage looks to have returned to Rails 7.0 trends, staying reliably below 2G on our 2.5G-limit dyno. |
Our app has for a while used relatively large amounts of RAM for a Rails app.
Some months ago, with 3 puma workers with 3 threads each, we were exhausting the 2.5G of RAM on a Heroku performance-m dyno -- which is pretty large usage.
A couple months ago we turned on "jemalloc" use on heroku, tracked in #2377, which looked to save us around ~20% of RAM use, and we stopped getting the heroku R14 errors.
However, with the recent migration to Rails 7.1, our RAM usage has jumped, and we've started getting R14 errors ago. There is no general community-known reason why a Rails 7.1 upgrade should increase RAM usage.
That is heroku RAM metric dashboard. You can see there the bump in RAM usage that corresponded to Rails 7.1 deployment.
I think before and after commit SHA's are cc8e840 (most recent Rails 7.0 deploy) and 3afd9c5 (first Rails 7.1 deploy)
We want to find out:
The text was updated successfully, but these errors were encountered: