Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance improvements for article list blocks #7683

Closed
briand44 opened this issue May 15, 2024 · 33 comments
Closed

Investigate performance improvements for article list blocks #7683

briand44 opened this issue May 15, 2024 · 33 comments
Assignees
Labels
medium story pointing

Comments

@briand44
Copy link
Contributor

256 MB memory limit on uiowa03 didn't solve now.uiowa.edu/campus listblock

https://iowaweb.slack.com/archives/CSQA18BK4/p1715450560643829

@briand44
Copy link
Contributor Author

briand44 commented Jun 6, 2024

Grooming conversation.... The Iowa Now example is paging and still running into memory limits. We should investigate and see if we can understand what is going on here. Not committing to solving as part of this.

@briand44 briand44 added medium story pointing and removed needs grooming labels Jun 6, 2024
@GaryRidgway GaryRidgway self-assigned this Jun 7, 2024
@GaryRidgway
Copy link
Contributor

Re-posting from slack for posterity:

So taking a look at the cache tags for the view for listblock (Which is what is overflowing), it seems like the media tags are what is growing exponentially.

This is being called at /mnt/www/html/uiowa03/docroot/core/lib/Drupal/Core/Cache/Cache.php
Take a look here for what I found.

@GaryRidgway
Copy link
Contributor

That the error is bubbling up through a few different functions, so the top level functions that we are seeing in new relic are probably not the actual issue, just where the app is running out of memory.

From some more digging it seems as though the error is coming up through the fieldBlock() call in docroot/modules/contrib/ctools/modules/ctools_views/src/Plugin/Display/Block.php. Which could mean that there is something going on with the exposed filters?

@GaryRidgway
Copy link
Contributor

Removing the filters from the back end views UI did not let the views UI load the preview, so perhaps it is something else.

@cory-skeers
Copy link
Contributor

Adding the little bit of exploration I've been able to do using xhprof memory profiling, as best I have been able to piece together.

The primary memory usage seems to build up from the database get and getMulti when pulling in results check for cached versions prior to going to the database. Fetching this cache version does a lot of unserialize calls, and these results get held in memory as the next items get called. Changing what gets displayed from the view doesn't appear to make a difference (eg changing the view to display only a single field rather than rendering the content teaser display results in essentially the same memory usage).

There also doesn't appear to be any difference between whether AJAX is enabled or not. The precise functions that are called in the stack vary slightly, but both end up in similar 'check the cache and unserialize or check the database' buildups of memory.

There also doesn't seem to be much difference in displaying many results or one result, and from the memory usage it seems like if the pager is enabled, Views is fetching ALL results on load, and then simply reducing down to how many should be displayed at a time (as opposed to if there is a limit with no pager, only the limited number of results are fetched and processed). I haven't confirmed this by stepping through the code, but the memory usage remaining consistent across different view and block settings seems to lean toward this conclusion.

If this is the case, I'd expect there to be some existing Issue that could be tied to performance degradation when the pager is enabled, but I haven't found anything as of yet.

@cory-skeers
Copy link
Contributor

it seems like if the pager is enabled, Views is fetching ALL results on load, and then simply reducing down to how many should be displayed at a time (as opposed to if there is a limit with no pager, only the limited number of results are fetched and processed). I haven't confirmed this by stepping through the code, but the memory usage remaining consistent across different view and block settings seems to lean toward this conclusion.

Stepping through from Drupal\views\Plugin\views\pager\SqlBase::query I was able to watch the query get properly limited based on the pager settings and get executed as that, so if there is something querying all results, it is happening further down the line than the initial Views results querying.

@joewhitsitt
Copy link
Contributor

I'd expect there to be some existing Issue that could be tied to performance degradation

I don't want to throw you off any scent, but your comment made me think about how Alan was getting the error consistently on the backend. Have either of you tried turning on the views settings that show the queries and the performance? Maybe it shows something a bit closer to the view than xhprof does. https://now.uiowa.ddev.site/admin/structure/views/settings

Turning on those things and reducing the Now's version of the list block to a sensible pager amount, I am seeing a lot of calls to things like crop_field_data and path_alias. which seems a bit overkill. At the same time I want to point out that the tags filter on our listblocks has the reduce duplicates option checked which I imagine might be a bit of a memory hog in itself. Maybe we can no longer just have that checked for all and that only filters will multiple selections should get it?

@GaryRidgway
Copy link
Contributor

Have either of you tried turning on the views settings that show the queries and the performance?

I know that I haven't so I will do so and take a look! Thanks for the tip joe!

@joewhitsitt
Copy link
Contributor

I am curious though if we have clear steps to reproduce on the frontend yet?

@GaryRidgway
Copy link
Contributor

No, not yet

@GaryRidgway
Copy link
Contributor

I reduced my memory down to 64M and the /campus page errored for me, and then removing the Reduce duplicates on the tags filter allowed it to render again.

Additionally it seems as though @bspeare removed featured image and body from the teaser and it allowed the page to load as well.

I am not sure how MUCH memory we are saving, but at least it is a start!

@GaryRidgway
Copy link
Contributor

After a fair amount of digging in this thread, we were able to discover that disabling the "Reduce duplicates" option and removing the body field from the teaser display were actually not helpful.

In fact, based on the data analysis that I was able to whip up, it actually looks like it costs MORE memory with them disabled.

In this vein, we have decided to try and explore other avenues.

@bspeare
Copy link
Contributor

bspeare commented Jun 13, 2024

Would it make a difference if we set a limit on the number of pages allowed in these list blocks? What if we set the max limit to 5 instead of 255 like on https://now.uiowa.ddev.site/campus?

@cory-skeers
Copy link
Contributor

Would it make a difference if we set a limit on the number of pages allowed in these list blocks? What if we set the max limit to 5 instead of 255 like on https://now.uiowa.ddev.site/campus?

I think that would be one way to get it to work. I'm not sure what the limit would need to be, and if there are factors that play into it where the required limit for one results set might be different for another, or if they're all roughly the same.

It could be worth it to punt the further investigation if we have a page or results limit that works consistently for the oom examples we have, that we can lift if/when any changes get made?

@joewhitsitt
Copy link
Contributor

are these erroring on the taxonomy-term page for the tag campus? because those aren't limited but also aren't listblocks in the traditional sense

ultimately, folks that keep an archive that big want folks to be able to get to it (view ALL articles). is a paged view the best way? maybe not, but is the only thing we offer.

@bspeare
Copy link
Contributor

bspeare commented Jun 13, 2024

I was thinking that we could limit the list blocks and direct people to the /admin/config/sitenow/uiowa-articles for a full archive since view pages don't have the same memory errors.

@cory-skeers
Copy link
Contributor

are these erroring on the taxonomy-term page for the tag campus? because those aren't limited but also aren't listblocks in the traditional sense

I was able to pull that up on the 64M local limitation, so looks like that works fine and the issue is specifically list blocks (or not exactly list blocks, but that view and/or list blocks. Extra filters, exposed filters, view in a block, etc)

@GaryRidgway GaryRidgway removed their assignment Jun 13, 2024
@joewhitsitt joewhitsitt self-assigned this Jun 14, 2024
@joewhitsitt
Copy link
Contributor

welp. nothing yet, but I got xdebug profiling to work:

Image

@joewhitsitt
Copy link
Contributor

joewhitsitt commented Jun 14, 2024

I remember running into this when working on card_4x but I find it interesting that layout_builder_custom_preprocess_views_view and everything downstream (e.g. buildCard) of it seems to get called twice per view.

...and taxonomy term views don't

@joewhitsitt
Copy link
Contributor

joewhitsitt commented Jun 17, 2024

Still trying to figure out the catch, but adding the following to the beginning of docroot/modules/custom/layout_builder_custom/src/Plugin/Display/ListBlock.php:538 cut down the duplicate calls and still returns the settings I set.

diff --git a/docroot/modules/custom/layout_builder_custom/src/Plugin/Display/ListBlock.php b/docroot/modules/custom/layout_builder_custom/src/Plugin/Display/ListBlock.php
index 9377fb52e..f5929dedd 100644
--- a/docroot/modules/custom/layout_builder_custom/src/Plugin/Display/ListBlock.php
+++ b/docroot/modules/custom/layout_builder_custom/src/Plugin/Display/ListBlock.php
@@ -537,6 +537,11 @@ public function blockSubmit(ViewsBlock $block, $form, FormStateInterface $form_s
    */
   public function preBlockBuild(ViewsBlock $block) {
     parent::preBlockBuild($block);
+    static $been_there_done_that = FALSE;
+    if ($been_there_done_that) {
+      return;
+    }
+    $been_there_done_that = TRUE;
     $this->setOption('exposed_block', FALSE);
 
     $allow_settings = array_filter($this->getOption('allow'));

Update: This doesn't seem to be the right spot to do this. for example preprocess_media loops through multiple times still

@joewhitsitt joewhitsitt removed their assignment Jun 17, 2024
@joewhitsitt
Copy link
Contributor

joewhitsitt commented Jun 18, 2024

Forgot to summarize where I left things yesterday:

I was noticing that to load one list block on a page that several processes ran multiple times. preprocess_media I think ran the most (per each row time 3 different times). Card build is also in this mix, but I accidentally removed my breakpoint from it.

\Drupal\layout_builder_custom\Plugin\Display\ListBlock::preBlockBuild
\Drupal\views\Plugin\Block\ViewsBlock::build
\layout_builder_custom_views_pre_build
\layout_builder_custom_preprocess_media
\layout_builder_custom_preprocess_block
\uiowa_core_preprocess_views_view
\layout_builder_custom_preprocess_views_view
\layout_builder_custom_preprocess_views_view_unformatted
\layout_builder_custom_preprocess_media
\uiowa_core_preprocess_views_view
\layout_builder_custom_preprocess_views_view
\layout_builder_custom_preprocess_views_view_unformatted
\layout_builder_custom_preprocess_media
\Drupal\layout_builder_custom\Plugin\Display\ListBlock::preBlockBuild
\layout_builder_custom_views_pre_build

My "been there done that" attempt at stopping processes from running multiple times didn't really work. It would jump to the processes it was allowed to do and I didn't inspect to see if the first-run output was retained, only that there didn't seem to be a visual difference in what I selected. paging styles seemed to be retained.

After a while 64 MB local was almost too restricting so until I needed to test something, I was running 128 so I was able to sync a site, etc.

Nobody responded to my rather cryptic cry for help: https://drupal.slack.com/archives/C6SQM2J94/p1718644245476239

Also, in addition to the straight up memory exhaustion errors, I feel like ob_get_clean errors can also be attributed to memory issues with our list blocks: https://iowaweb.slack.com/archives/CSQA18BK4/p1718693570070859

@pyrello pyrello self-assigned this Jun 18, 2024
@pyrello
Copy link
Contributor

pyrello commented Jun 18, 2024

A few initial notes:

  • When accessing the article list block view in the Views UI, there appears to be no limit set so the preview below is attempting to load all 10877 items, which works until the entities get loaded and then runs out of memory.
  • When accessing the /campus page on Iowa Now locally, I can validate that the queries for the list blocks on that page get appropriately limited so that they don't pull back all results (3 for the first one and 20 for the second one).
  • Based on the above, I think there is a limited amount we can learn by using the preview in the Views UI about the issues we are having.

@pyrello
Copy link
Contributor

pyrello commented Jun 18, 2024

I'm curious how we concluded that it was the list block views on the /campus page that were the cause of the OOM issues? There is very little original source information on this issue, which tracks back to a Slack thread that, as best as I can tell, is just @joewhitsitt commenting that we might have to look into list blocks more.

I've been wondering this the whole time we've had this issue queued up and in progress, but especially now because I do not see anything that appears to be amiss with the views aspect of the page. As best I can tell there is a total of 23 complete nodes being loaded on this page, which doesn't seem like enough to cause an OOM error at 256 M.

I'd like to understand how we got to this point better because the assumption that the list block is the cause of the issue seems to have been driving a decent amount of our investigation of the issue. There have been suggestions about how we might change the list block to maybe, possibly, hopefully fix the issue. If I'm missing the context that makes this make sense, I'd really like to see what it is.

@joewhitsitt
Copy link
Contributor

@pyrello My suspicion started with listblock based on the amount of out of memory errors occurring on paths that contain one or more listblocks. When an out of memory error occurs in #drupal-status, I usually take a moment to click in to see where. Here is a list of uiowa03 now.uiowa.edu examples (not just /campus) going back a couple months: https://one.newrelic.com/errors-inbox?duration=7776000000&state=86f7bfd6-e6c6-442b-5efb-9e10fcce94ca

With the exception of the homepage, most of these other paths are placed article list blocks that filter on a particular tag.

We can also tell that compared to taxonomy term view pages /tags/campus which display the same content that they don't seem to be problematic.

@pyrello
Copy link
Contributor

pyrello commented Jun 18, 2024

Noting a couple major differences between an instance of an article list block (e.g. /campus) and a taxonomy term page (e.g. /tags/campus):

  • Term pages don't force fields and list blocks do.
  • Term pages don't engage Layout Builder as part of the rendering process.

@pyrello
Copy link
Contributor

pyrello commented Jun 18, 2024

preprocess_media I think ran the most (per each row time 3 different times).

The first time is when the view is rendering the field, which happens because the list blocks all currently utilize the "force fields" setting despite using the teaser. This is because we utilize the list of fields from the view to generate our list of fields to hide in the block form. We could potentially generate our own list and then remove the "force fields" setting from list blocks to make them more performant.

The second time preprocess_media gets triggered is when the field is being rendered as part of the node teaser. I'm not sure about the third time. It appears that there is duplication of multiple parts of the node teaser rendering process and I also remember encountering this when we were in the middle of the card 4.x work.

--

I'm pretty skeptical that there is an actual issue with the list block or views generally in this case. I can't deny that its related methods and functions are in the stack often when we encounter these OOM issues. But I'm not seeing anything that seems to be working incorrectly or inefficiently with the views part of the equation. I've been poking at it all day and can't find anything amiss. Possibly there is something amiss with our card rendering logic, but I'm also skeptical of that because we can't reliably trigger OOM issues at 128 or 256 M locally. Also, why are we regularly getting these issues on a page with 23 node teasers, but https://bam.sites.uiowa.edu/news rarely seems to trigger these errors? Its just a hunch, but I think we're pulling the wrong thread to try to resolve this.

@joewhitsitt
Copy link
Contributor

Thanks for looking into it and the breakdown of preprocess_media.

Maybe it isn't an issue of a process running incorrectly but the combination of processes we have brought together to fire at once (force fields, reduce duplicates, query of 1000+ entities to construct pager, on-the-fly media view mode switching). Maybe we are just pushing the limits and unable to scale beyond a certain point.

The taxonomy term view is a different display handler (page, not listblock), rendering defined teasers (no forced fields, on the fly media processing) with a single contextual argument (no reduced duplicates).

BAM used to have OOM errors at 128 M and a reminder that this page is not using listblock or entity reference card teasers, but card blocks. Though there is more rendered here (images, blocks), I imagine there would be less looping, processes involved.

I'd be curious, in addition to what you have already pushed for setting an initial limit, to make the reduce duplicates only take effect if multiple arguments are set.

Future slice: I also wonder if we wanted to leverage view modes a bit more and have teaser widescreen and teaser square/circle to reduce the media processing. I might need help remembering any complicating factor/options that I am not accounting for here.

@pyrello
Copy link
Contributor

pyrello commented Jun 19, 2024

query of 1000+ entities to construct pager

Just want to note that this is a SELECT COUNT(*) query, which is very fast and takes almost no resources. It doesn't return anything except the number of rows.

@pyrello
Copy link
Contributor

pyrello commented Jun 19, 2024

Some more digging:
I was curious about what @cory-skeers mentioned yesterday in EoD, about how he had observed higher memory usage in cases with list blocks that had pagers (and obviously a lot of potential matching node results) vs. list blocks without pagers. The implication seemed to be that including the pager might be triggering the loading of all matching nodes somehow even though we were able to observe by stepping through the Views execution process that only the results of the current page were queried and loaded into the View results storage.

To test this, I used this custom DDEV command to allow enabling query logging: https://gist.github.com/jonaseberle/28a675d034dfa1dbaf8e586184fd1d77. When this command is run, it will show you all the queries that are run while it is logging. I cleared the cache on the local Now site, turned on query logging, and then loaded the /campus page. I had previously identified the nid of this page: https://now.uiowa.edu/news/2012/03/sleigh-bells-headlining-riverfest-2012-april-24-performance. It shows up in the last page of results for the list block on the /campus page that uses a pager. I searched all the output of the query logging for that nid, 37911 and found no matches. Presumably, if the nodes that match the view but are not part of the current page of results are being loaded somehow, I'd be able to see a query with that nid included. Based on this, I am concluding that only the nodes that are part of the list block result set are being loaded after cache clear when the /campus page is loaded.

Maybe it isn't an issue of a process running incorrectly but the combination of processes we have brought together to fire at once (force fields, reduce duplicates, query of 1000+ entities to construct pager, on-the-fly media view mode switching). Maybe we are just pushing the limits and unable to scale beyond a certain point.

To test this premise, I disabled the maximum limit in the list block form settings that caps results at 50 per page. Then I set the limit at successively higher numbers and cleared the cache 10 times and reloaded the page to see if I could get recreate an OOM issue. I was able to get up to 400 results without any errors. At 500, I immediately encountered a WSOD that persisted across multiple drush cr and page reloads. My conclusion is that there is a point at which the solution we have in place no longer scales, but up to 400 results was not enough to recreate an OOM error across many trials.

The reason I am trying to poke holes in the premise that this is a views/list block/media preprocess issue is because I think that pursuing optimizations based on the assumption that we have maxed out what we can do with the list block is not going to solve this issue for us. It is not to say that the optimizations that we would undertake would not have value. Rather, I don't think they are going to help us avoid this issue in the future, so I'm skeptical that they are worth doing at this point.

@pyrello
Copy link
Contributor

pyrello commented Jun 19, 2024

Potential future slices:

  • Currently, each of the standard list blocks use the "force fields" setting because we use the list of fields to generate a list of checkboxes in the block form to allow users to select fields to be hidden. It is possible we are using the fields from the view in further processing, but we don't need to because we also load the complete node and can pull the field information from that. We could create a list of fields per content type in either configuration or code to use to generate the list of checkboxes and then we can avoid the penalty of loading the views fields an extra time.
  • Investigate why preprocess functions related to rendering the node/card teaser are triggered twice.

Also noting that I checked the time of errors and matched that up to when cron runs on uiowa03, which happens at 21 minutes past the hour. The error message times are all over the board, so I don't think there there is a strict correlation between the error and cron triggering a cache flush.

@pyrello
Copy link
Contributor

pyrello commented Jun 20, 2024

Summary

Problem

We are seeing instances of out of memory (OOM) errors on multiple pages on multiple websites, with a particularly high incidence on now.uiowa.edu. At the time that this issue was created, there was an error message in Slack (from New Relic) for OOM related to https://now.uiowa.edu/campus. The page in question utilizes two article list blocks: both utilizing the campus tag as a filter, the first limited to just 3 results, and the second showing all results paged in sets of 20. It was observed that many of the pages where OOM errors occur have list blocks on them, so there seemed to be some type of correlation.

Methodology

A variety of methods were used to try to understand the root of the problem:

  • Reviewing the call stack attached to OOM errors in New Relic to attempt to identify what was happening when the OOM error occurred
  • xhprof memory profiling via DDEV
  • Step debugging with xdebug. This was combined with a variety of code/configuration changes to understand what was happening/changing during code execution.
  • Changing code/configuration to observe effects during code execution
  • Reducing local PHP memory_limit to 256M (production setting) and 64M
  • Comparison of instance of page with article list block to "campus" taxonomy term page on Iowa Now
  • A DDEV query logging command was added to allow analysis of queries run during the process of cache rebuild + page reload of /campus.

Observations

  • Review of the OOM errors in New Relic revealed that "the error is bubbling up through a few different functions, so the top level functions that we are seeing in new relic are probably not the actual issue, just where the app is running out of memory."
  • The preview view was not able to load for the article list block in Views UI, which was initially suspected of being connected, but turned out to be because the configuration being loaded for the preview view was trying to load all 10k+ articles on the Iowa Now site.
  • "The primary memory usage seems to build up from the database get and getMulti when pulling in results check for cached versions prior to going to the database." This results in a lot of unserialize calls and these results get held in memory.
  • Changing what gets displayed from the view doesn't appear to make a difference in the memory usage.
  • It seemed from memory usage that if the pager is enabled on a list block, views was fetching all results on load. This was ruled out as a possibility by multiple people by step debugging through views execution.
  • In fact, views only queries and returns the results for a particular paged set (20 results in the case of the second article list block in the /campus page example).
  • It was proposed that maybe the "Reduce duplicates" option being enabled was negatively affecting the performance of the query, but @GaryRidgway's results seemed to indicate otherwise: https://docs.google.com/spreadsheets/d/1XxCge1NO6SuGag_Juy8pN4EvJm1SPc2dmJ3uyUDV8qw/edit?gid=0#gid=0 (source)
  • Loading one list block on a page causes several processes to load multiple times: Investigate performance improvements for article list blocks #7683 (comment)
  • Because we have the "force fields" setting turned on for list blocks, we incur an extra loading of all the fields specified in the view. This is responsible for at least one of the "extra" layout_builder_custom_preprocess_media calls.
  • Throughout all of this testing, no developer was able to cause an OOM error locally on the /campus page without artificially modifying some platform, block, or environment configuration.
  • To stress test the article list block, the limit of how many items could be showed in a single paged set was artificially increased followed by 10 cache rebuilds + page reloads. It was only after increasing the number above 400 that an OOM error was achieved at 256M.
  • A cache rebuild was triggered, followed by enabling mysql query logging, and finally loading the /campus page. This was done to test the premise that nodes other than the paged set of 20 queried and returned by the view were being loaded from the database through some as yet unidentified process. An nid for an article node that appears in the last page of results was selected. A full search of the query log output was performed to find the nid in question and it was not present. This suggests that the node was not queried or loaded through any process necessary to build and load the page with no cache available.

@pyrello
Copy link
Contributor

pyrello commented Jun 20, 2024

I submitted a support request with Acquia.

@briand44
Copy link
Contributor Author

Closing this. We can open a new issue if Acquia gives us any leads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium story pointing
Projects
None yet
Development

No branches or pull requests

6 participants