Skip to content

Slow performance when rendering collections of nested partials #41452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
willcosgrove opened this issue Feb 15, 2021 · 16 comments
Open

Slow performance when rendering collections of nested partials #41452

willcosgrove opened this issue Feb 15, 2021 · 16 comments

Comments

@willcosgrove
Copy link
Contributor

Steps to reproduce

  • To start, you need two partials (_a.html.erb and _b.html.erb)
  • Partial _a.html.erb should contain a render call for partial _b.html.erb
  • Render partial _a.html.erb in a collection

Expected behavior

Ideally, the time to render should be on par with the aggregate time to render both _a and _b if _b was not nested inside of _a.

Actual behavior

I see about a 10x difference in render time between a nested example and a flat example. I've done some preliminary digging with stackprof. Keep in mind that I am not well versed in the internals of ActionView rendering so my understanding of what is not working may not line up with reality. It seems like whatever optimization is done that affords such a significant speed difference between these two cases:

Fast

<%= render partial: "a", collection: 100.times.to_a %>

Slow

<% 100.times do %>
  <%= render partial: "a" %>
<% end %>

...is not able to be applied to the partial that is nested inside of the outer partial. Based on my stackprof digging it looks like some of that optimization is happening in the conversion of the partial string ("a") into an actual path to a view file. In the fast case above, that conversion of "a" into "#{Rails.root}/app/views/application/_a.html.erb" happens one time, and in the slow case it happens 100 times. The same is true of the nested partials situation: the outer partial gets its view path calculated/looked-up one time, but the inner render call for the nested partial _b gets looked-up on each iteration.

I had some back and forth on Twitter with @tenderlove about this, and his intuition from the best description I could muster in 270 characters was that this was a bug and shouldn't require a performance hit. But it is entirely possible I wasn't explaining it well enough to him over the limited bandwidth communication channel of Twitter 😄

Tweet

https://twitter.com/tenderlove/status/1351223079805083648

Reproduction Repo

https://github.com/willcosgrove/partial-perf-repro

Server log images

Flat case (fast):
Flat

Nested case (slow):
slow_1
slow_2

I should also clarify about the logs: The default view logging in development definitely adds some additional overhead to the slow case, because it logs a line of output for each iteration. But even with view logging turned off completely, there is still a significant performance difference between the two cases, so I don't believe the difference falls entirely at the feet of ActionView logging instrumentation.

Flamegraph:

Fast:
fast_flame

Slow:
slow_flame

System configuration

Rails version:
Seems to be any version, but I tested specifically on 6.1.2.1 and main.

Ruby version:
Also seems to be any version but I tested on both Ruby 2.7 and Ruby 3.0.

CC: @tenderlove

@mdworken
Copy link

@willcosgrove this is super interesting - I thought it was well-known that this behavior existed and that's why we render "with collections" instead of "in loops" but I can't seem to find an official citation anywhere! Here's a scout article though:

https://scoutapm.com/blog/performance-impact-of-using-ruby-on-rails-view-partials

@willcosgrove
Copy link
Contributor Author

@mdworken I agree. The surprising part was not that loops are slow, it's that you can apparently negate the optimization of rendering with collections if the partial you're rendering renders another partial.

It may be that it's technically intractable to solve this problem. I was compelled to write up the issue though because tenderlove indicated over Twitter that it seemed like a bug. Again, it could have just been because I was explaining the issue to him poorly. But if it is truly a bug, I wanted to document it.

@rafaelfranca
Copy link
Member

I see that the app is using slim. Does this also happen if the app was only using ERBs?

@thewatts
Copy link
Contributor

@rafaelfranca Thanks for chiming in!

I added a PR swapping slim for erb - the same issue persists:

willcosgrove/partial-perf-repro#1

@alipman88
Copy link
Contributor

The surprising part was not that loops are slow, it's that you can apparently negate the optimization of rendering with collections if the partial you're rendering renders another partial.

It may be that it's technically intractable to solve this problem.

I believe this may be the case (intractable to optimize). Consider the first example:

<%= render partial: "x", collection: 100.times.to_a %>

In this case, we know we'll be rendering _x.html.erb 100 times, and can take steps to make that rendering more efficient.

Now, consider another example:

<% 100.times do %>
  <%= render partial: rand() > 0.5 ? "x" : "y" %>
<% end %>

The above can't easily be optimized, because the argument passed to render changes from iteration to iteration.

Even if we remove the randomness, the same challenge remains - even though we (humans) can easily see the first and second cases will produce the same output, a computer won't automatically assume the same argument will be passed to render in each iteration of the loop - so render will need to be called separately, 100 times.

The same holds true for a nested partial - even when the outer partial is rendered using collection, the inner render call will still need to be performed once per iteration, same as if we used a for loop.

This isn't to say there's no room for improvement (in terms of caching, etc.), but rather an effort to understand why we might expect nested render calls to negate easy optimization gains from rendering as a collection.

@willcosgrove
Copy link
Contributor Author

@alipman88 That makes sense. Thanks for walking through that! It sounds like there's probably nothing that can be done to improve this specific problem.

Our "fix" to this problem in our app has been to move away from partials towards helper methods that build markup through the use of the tag helper. We make extensive use of these little bits of markup to encapsulate our visual theme into reusable chunks. So where before we may have had something like this:

_fuzzy_option.html.slim

- html_class ||= ""

a.fuzzy-select__item(
  class=html_class
  data-action="touchdown->fuzzy-select#selectItem mousedown->fuzzy-select#selectItem"
  data-fuzzy-select-target="item"
  data-fuzzy-search-target="item"
  data-value=value
  role="button")

  - if block
    = capture(&block)

theme_helper.rb

def fuzzy_option(**opts, &block)
  render "components/fuzzy_select/fuzzy_option", **opts.merge(block: block)
end

We now put all of that markup directly in the helper method

theme_helper.rb

def fuzzy_option(**opts, &block)
  tag.a(class: "fuzzy-select__item #{opts[:html_class]}",
    data: {
      action: "touchdown->fuzzy-select#selectItem mousedown->fuzzy-select#selectItem",
      fuzzy_select_target: "item",
      fuzzy_search_target: "item",
      value: opts[:value],
    },
    role: "button",
    &block
  )
end

We would prefer to keep our markup out of our ruby files, but this seems to be the only way to sidestep the performance penalty of calling render.

I don't suppose anyone knows if there is a way to bypass a majority of the overhead cost of render? Possibly by passing a full file path to it instead of using the "components/fuzzy_select/fuzzy_option" shorthand that requires it to traverse the view path to find the file?

And thank you to everyone who has spent time looking at this and thinking about it 🙏

@thewatts
Copy link
Contributor

a computer won't automatically assume the same argument will be passed to render in each iteration of the loop - so render will need to be called separately, 100 times.

@alipman88 totally agree!

The same is true for calling any method with differing inputs.

I think where we're wondering about optimization is: if we were to call a method directly, instead of render, the speed increase is significant. From what we understand about render calls - is that the lookup is cached - so it should, in theory, be the equivalent of calling a method, right? (I could be completely wrong in this assumption).

I've updated my PR to include a scenario with calling a method directly instead of calling render.

Here's the render speed difference:

Flat Rendering:

CleanShot 2021-02-20 at 07 59 16

Nested Partial Rendering:

CleanShot 2021-02-20 at 07 59 31

Method Rendering:

CleanShot 2021-02-20 at 07 59 42

And yes, thank you to everyone who's spent time on this! 🙏

@thewatts
Copy link
Contributor

As a side - in looking at the screenshots - calling render allocates significantly more objects, perhaps that is where the majority of the speed difference is.

@rails-bot
Copy link

rails-bot bot commented May 21, 2021

This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails team are limited, and so we are asking for your help.
If you can still reproduce this error on the 6-1-stable branch or on main, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.

@zzak
Copy link
Member

zzak commented May 22, 2021

🦄

@lfalcao
Copy link
Contributor

lfalcao commented May 22, 2021

Not exactly related, but one interesting point here is that instrumentation increases the overhead of the render method and in many cases its not used at all. Maybe these non required notifications should all be disabled by default.

image

Benchmark the slow method with ActiveSupport::Notifications.unsubscribe 'render_partial.action_view' and without :

Comparison:
         unsubscribe:      773.2 i/s
          subscribed:      619.4 i/s - 1.25x  (± 0.00) slower

Doing just a simple render, the difference between both cases is similar. Benchmark script here: https://github.com/lfalcao/benchmarks/blob/main/views/notifications.rb

@thewatts
Copy link
Contributor

thewatts commented May 22, 2021

I added an example using View Component, just to see how it compares in general.

Here's my PR with the additions: willcosgrove/partial-perf-repro#2

Something to note: My PR increases the nesting, to where there are three layers of partials/components being used.

A page renders 10k tweets, each tweet rendering a button, each button rendering button text.
(button text is a stretch of an example - as it's literally only passing content into a partial/component and returning it, but I wanted to have another layer to emphasize speed differences).

Flat:

60ms | ~57k allocations

image


Nested Partials:

3015ms | ~1.74M allocations

image


Methods, using Tag Helpers:

1201ms | ~2.2M allocations

image


View Components:

558ms | ~487k allocations

image


Honestly, I'm quite surprised at the performance of ViewComponent here, as they are effectively just using erb partials under the hood.

Again, this example is extreme to emphasize what's going on (10k items, 3 layers of nesting), but I think it really highlights some inherent performance constraints of using partials (or even tag helpers).

albertchae added a commit to hackclub/hcb that referenced this issue Apr 18, 2022
Previously, we see many lines repeated of
```
Rendered canonical_transactions/_canonical_transaction.html.erb (Duration: 4.9ms | Allocations: 1405)
Rendered canonical_transactions/_canonical_transaction.html.erb (Duration: 4.3ms | Allocations: 1416)
Rendered canonical_transactions/_canonical_transaction.html.erb (Duration: 4.4ms | Allocations: 1416)
```

This is because the template gets reread from disk every loop iteration.
Fortunately the partials API has a way to read it once and reuse it for
the whole loop.

After:
```
Rendered collection of canonical_pending_transactions/_canonical_pending_transaction.html.erb [0 times] (Duration: 0.0ms | Allocations: 34)
Rendered collection of canonical_transactions/_canonical_transaction.html.erb [100 times] (Duration: 398.0ms | Allocations: 130029)
```

Local timing (absolute values aren't informative but relative are):
Before: 827.9 ms in views `Completed 200 OK in 1409ms (Views: 827.9ms | ActiveRecord: 135.4ms | SQL Queries: 25 (1 cached) | Allocations: 605016)`
After: 714.2 ms in views `Completed 200 OK in 1325ms (Views: 714.2ms | ActiveRecord: 138.8ms | SQL Queries: 25 (1 cached) | Allocations: 603067)`

More reading:

https://scoutapm.com/blog/performance-impact-of-using-ruby-on-rails-view-partials
rails/rails#41452
https://www.justinweiss.com/articles/how-much-time-does-rendering-a-partial-really-take/
@casperisfine
Copy link
Contributor

cc @jhawthorn. You might be interested into this.

@jhawthorn
Copy link
Member

jhawthorn commented Jun 3, 2022

I don't have a ton of time to dig in right now, but a couple quick notes (I see a bunch of this is also discussed above). I tested against @thewatts branch.

First is that the benchmarks from above are no longer from the latest Rails (they probably were when this issue was opened 😅) but a quick bundle update improves performance of the "nested" case about 40%, so progress has been made here.

The next reason the "nested" case is slower is logging/instrumentation. In my benchmark script I added ActionView::Base.logger = nil and that brought it to almost 2x the original speed. This is printing 20,000 lines of logs, so we should expect it to be slow.

I think the next significant cost is the template lookup, which has had significant improvements in the past year (which is why the bundle update helped so much 😁). A significant difference from the "flat" and "nested" case is that the nested case requires each of the lookups to be separate (and makes 20k of them). These are cached, but what we're seeing is the cost to lookup the cached result.

Benchmark script
require "./config/environment"
require 'stackprof'
require 'benchmark/ips'

ActionView::Base.logger = nil

app = Rack::Builder.load_file("./config.ru")[0]

make_request = -> (path) {
  env = Rack::MockRequest.env_for("http://localhost/#{path}")
  env["HTTP_HOST"] = "localhost"
  env["REMOTE_ADDR"] = "127.0.0.1"

  status, headers, body = app.call(env.dup)
  s = []
  body.each { |x| s << x }
  body.close if body.respond_to?(:close)
  raise unless status == 200
  [status, headers, s.join("")]
}

p make_request.call("/flat_test").last.size
p make_request.call("/nested_test").last.size

result = Benchmark.ips do |x|
  x.report "/flat_test" do
    make_request.call("/flat_test")
  end

  x.report "/nested_test" do
    make_request.call("/nested_test")
  end
Calculating -------------------------------------
          /flat_test     14.629  (±13.7%) i/s -     72.000  in   5.133654s
        /nested_test      1.899  (± 0.0%) i/s -     10.000  in   5.285865s

(I think we shouldn't consider the "tag method" part of this benchmark, I believe the performance there is completely unrelated to partial rendering)

@jhawthorn
Copy link
Member

Aha! Another issue (mentioned briefly above) is that the benchmark repo uses "implicit" prefixes in the templates: render "button" instead of render "application/button". Switching to the latter should provide another performance boost, since we don't have to scan the cache of all view directories for all prefixes before finding that application/ matches.

@bensheldon
Copy link
Contributor

bensheldon commented Jun 4, 2022

Based on @jhawthorn's comments, I wanted to pair down the benchmark. Observations:

  • Looping over render ViewComponent is about the same as render :collection
  • Logging is the major performance slowdown for nested partials. Causes ~1.7-1.9x slower than ViewComponents, render :collection or inlining. Using ActionView::Base.logger = nil is a big speedup.
  • (Not-replicated) I didn't see a speedup by using a full partial path.
  • (Subjective) Without logging, nested partials are 1.3x slower than inlining, but ViewComponents and render :collection are still 1.1x slower than inlining. Given that I used simplistic examples, and I assume that it's a fixed-cost difference regardless of the complexity within the partial... it's maybe a small difference in the big context of things, without logging.

Here's my reproduction repo: https://github.com/bensheldon/benchmark_render

Benchmark Output (with logging)
❯   RAILS_VERSION=6.0.5 ./benchmark_render.rb \
  && RAILS_VERSION=7.0.3 ./benchmark_render.rb \
  && ./benchmark_render.rb

Rails version: 6.0.5
Warming up --------------------------------------
             /inline   313.000  i/100ms
        /nested_loop   172.000  i/100ms
   /nested_loop_path   174.000  i/100ms
  /nested_collection   272.000  i/100ms
            /vc_loop   298.000  i/100ms
Calculating -------------------------------------
             /inline      3.254k (± 1.3%) i/s -     16.276k in   5.003258s
        /nested_loop      1.737k (± 1.4%) i/s -      8.772k in   5.051732s
   /nested_loop_path      1.727k (± 1.2%) i/s -      8.700k in   5.039853s
  /nested_collection      2.754k (± 3.6%) i/s -     13.872k in   5.045295s
            /vc_loop      3.029k (± 3.4%) i/s -     15.198k in   5.023531s

Comparison:
             /inline:     3253.7 i/s
            /vc_loop:     3029.4 i/s - 1.07x  (± 0.00) slower
  /nested_collection:     2753.9 i/s - 1.18x  (± 0.00) slower
        /nested_loop:     1736.8 i/s - 1.87x  (± 0.00) slower
   /nested_loop_path:     1726.5 i/s - 1.88x  (± 0.00) slower

Rails version: 7.0.3
Warming up --------------------------------------
             /inline   308.000  i/100ms
        /nested_loop   173.000  i/100ms
   /nested_loop_path   174.000  i/100ms
  /nested_collection   268.000  i/100ms
            /vc_loop   281.000  i/100ms
Calculating -------------------------------------
             /inline      3.153k (± 2.5%) i/s -     16.016k in   5.083613s
        /nested_loop      1.732k (± 1.4%) i/s -      8.823k in   5.094727s
   /nested_loop_path      1.748k (± 2.5%) i/s -      8.874k in   5.081193s
  /nested_collection      2.686k (± 2.7%) i/s -     13.668k in   5.091552s
            /vc_loop      2.858k (± 1.7%) i/s -     14.331k in   5.016606s

Comparison:
             /inline:     3152.5 i/s
            /vc_loop:     2857.5 i/s - 1.10x  (± 0.00) slower
  /nested_collection:     2686.5 i/s - 1.17x  (± 0.00) slower
   /nested_loop_path:     1747.5 i/s - 1.80x  (± 0.00) slower
        /nested_loop:     1732.2 i/s - 1.82x  (± 0.00) slower

Rails version: 7.1.0.alpha
Warming up --------------------------------------
             /inline   307.000  i/100ms
        /nested_loop   181.000  i/100ms
   /nested_loop_path   180.000  i/100ms
  /nested_collection   258.000  i/100ms
            /vc_loop   280.000  i/100ms
Calculating -------------------------------------
             /inline      3.062k (± 3.4%) i/s -     15.350k in   5.020859s
        /nested_loop      1.784k (± 4.1%) i/s -      9.050k in   5.082636s
   /nested_loop_path      1.736k (± 2.0%) i/s -      8.820k in   5.083707s
  /nested_collection      2.647k (± 1.9%) i/s -     13.416k in   5.069750s
            /vc_loop      2.829k (± 2.5%) i/s -     14.280k in   5.051345s

Comparison:
             /inline:     3061.6 i/s
            /vc_loop:     2828.8 i/s - 1.08x  (± 0.00) slower
  /nested_collection:     2647.2 i/s - 1.16x  (± 0.00) slower
        /nested_loop:     1784.0 i/s - 1.72x  (± 0.00) slower
   /nested_loop_path:     1735.7 i/s - 1.76x  (± 0.00) slower
Benchmark Output (without logging)
❯ WITHOUT_LOGGER=1 RAILS_VERSION=6.0.5 ./benchmark_render.rb \
  && WITHOUT_LOGGER=1 RAILS_VERSION=7.0.3 ./benchmark_render.rb \
  && WITHOUT_LOGGER=1 ./benchmark_render.rb
Rails version: 6.0.5
Warming up --------------------------------------
             /inline   353.000  i/100ms
        /nested_loop   182.000  i/100ms
   /nested_loop_path   275.000  i/100ms
  /nested_collection   325.000  i/100ms
            /vc_loop   326.000  i/100ms
Calculating -------------------------------------
             /inline      3.518k (± 6.0%) i/s -     17.650k in   5.038094s
        /nested_loop      2.770k (± 1.8%) i/s -     14.014k in   5.060657s
   /nested_loop_path      2.715k (± 4.3%) i/s -     13.750k in   5.074845s
  /nested_collection      3.218k (± 4.3%) i/s -     16.250k in   5.061147s
            /vc_loop      3.310k (± 2.0%) i/s -     16.626k in   5.024399s

Comparison:
             /inline:     3517.7 i/s
            /vc_loop:     3310.4 i/s - same-ish: difference falls within error
  /nested_collection:     3217.8 i/s - same-ish: difference falls within error
        /nested_loop:     2770.1 i/s - 1.27x  (± 0.00) slower
   /nested_loop_path:     2715.4 i/s - 1.30x  (± 0.00) slower

Rails version: 7.0.3
Warming up --------------------------------------
             /inline   345.000  i/100ms
        /nested_loop   279.000  i/100ms
   /nested_loop_path   272.000  i/100ms
  /nested_collection   304.000  i/100ms
            /vc_loop   310.000  i/100ms
Calculating -------------------------------------
             /inline      3.452k (± 1.5%) i/s -     17.595k in   5.098560s
        /nested_loop      2.708k (± 3.5%) i/s -     13.671k in   5.054214s
   /nested_loop_path      2.652k (± 5.2%) i/s -     13.328k in   5.042542s
  /nested_collection      2.992k (± 3.9%) i/s -     15.200k in   5.089442s
            /vc_loop      3.082k (± 7.6%) i/s -     15.500k in   5.076108s

Comparison:
             /inline:     3451.8 i/s
            /vc_loop:     3082.2 i/s - 1.12x  (± 0.00) slower
  /nested_collection:     2991.5 i/s - 1.15x  (± 0.00) slower
        /nested_loop:     2708.5 i/s - 1.27x  (± 0.00) slower
   /nested_loop_path:     2651.9 i/s - 1.30x  (± 0.00) slower

Rails version: 7.1.0.alpha
Warming up --------------------------------------
             /inline   296.000  i/100ms
        /nested_loop   247.000  i/100ms
   /nested_loop_path   248.000  i/100ms
  /nested_collection   296.000  i/100ms
            /vc_loop   304.000  i/100ms
Calculating -------------------------------------
             /inline      3.281k (± 3.9%) i/s -     16.576k in   5.060233s
        /nested_loop      2.476k (± 5.1%) i/s -     12.350k in   5.005304s
   /nested_loop_path      2.500k (± 1.9%) i/s -     12.648k in   5.061520s
  /nested_collection      2.970k (± 2.1%) i/s -     15.096k in   5.085463s
            /vc_loop      3.051k (± 2.1%) i/s -     15.504k in   5.083644s

Comparison:
             /inline:     3281.4 i/s
            /vc_loop:     3051.1 i/s - 1.08x  (± 0.00) slower
  /nested_collection:     2969.8 i/s - 1.10x  (± 0.00) slower
   /nested_loop_path:     2499.8 i/s - 1.31x  (± 0.00) slower
        /nested_loop:     2475.6 i/s - 1.33x  (± 0.00) slower

stillhart added a commit to stillhart/openstreetmap-website that referenced this issue Aug 7, 2023
…ssue. this is because the whole render context is given to the partial. this view is using this extensively and thousands of times. just inlining the object should improve the performance. At the same time a lot of code can be de-duplicated. I kept a fallback in case of new objects. rails/rails#41452
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants