[R&D] Adaptative batch size on Preload #6396

MathieuLamiot · 2024-01-18T17:20:55Z

Context
It seems Preload can generate a lot of pressure on a server if the pages of the website are slow to open. A way to adapt to this would be to measure how long a request takes, and adapt batch size based on this.

What to do
This branch is a quick&dirty example of how this could be implemented: https://github.com/wp-media/wp-rocket/tree/prototype/preload-adaptative-batch
The idea is partially described here, but has evolved a bit to base the batch size on the measurement of a preload request, by making one request blocking from time to time.

A developer from the plugin team needs to go spend some time on this branch to make it work (maybe it is not, I just wrote the code to lay the idea down), production ready, and play with it to see how it behaves, possibly with logs.
We have the gamma.rocketlabs.ovh website that suffers from CPU issues when doing a full cache clear to trigger the preload. It would be a good place to test it. See here.

Warning
This branch would need #6394
Otherwise, we don't have control to prevent flooding the AS queue and the number of job in-progress could increase too quickly.

Khadreal · 2024-01-26T11:25:34Z

Improved the batch size work a bit from what @MathieuLamiot did, added transient for all requests and then used the value to determine the max and min size of the next batch requests.

MathieuLamiot · 2024-01-26T16:00:55Z

Thanks @Khadreal 🙏
I am not sure what would be the expected behavior for rocket_preload_previous_request_durations values 🤔 If I understand correctly:

every ~5 minutes, we'll check the duration of one request (behavior I introduced)
we will add this duration to rocket_preload_previous_request_durations (behavior you introduced)
we will use rocket_preload_previous_request_durations to define the preload batch size.

It seems to me that, with the current code, rocket_preload_previous_request_durations starts at 0 and will increase without upper limit every 5 minutes, so every 5 minutes, we will reduce the speed of preload. You might be missing a "rolling average" mechanism? Or am I missing something?

A rolling average could be implemented as follows (it's not the best way to do it, but it's the quickest one):

Replace:
$previous_request_durations = $previous_request_durations + $duration;
With:

if ($previous_request_durations <= 0) {
    $previous_request_durations = $duration;
} else {
   $previous_request_durations = $previous_request_durations*0.7 + $duration*0.3;
}

MathieuLamiot · 2024-01-29T17:47:21Z

I cleaned up a few things and added a dedicated logic based on transient to limit the number of blocking requests to 1 per minute. It gives good results with my local. I am trying to test on gamma, where we should be able to see a preload going way slower thanks to this, currently blocked because I can't write with the FTP access 🤷

To easily monitor, I add the following:

error_log(sprintf('preload_url: duration %s averaged %s', $duration, $previous_request_durations)); after $check_duration = false;

error_log(sprintf('process_pending_jobs: batch size %s averaged %s', $next_batch_size, $preload_request_duration));
before $next_batch_size = min( $next_batch_size, $max_batch_size);

@piotrbak
In the the current branch:

First time we prepare a preload batch, we start with the minimum batch size (currently 5 URL, there is a filter). Next time, the batch size will be based on the average request time (see below). The formula currently is $next_batch_size = round((1/$preload_request_duration) / 2 * 60); which gives 30 jobs per batch on my local with 1s request duration. The formula might be too conservative, but we can easily change it. There is a min/max also to avoid going too fast/too slow (5 and 45 for now).
some preload requests are now blocking to allow us to measure the time. We limit the number of blocking request to 1 per minute so that the preload non-blocking approach remains the majority of cases.
Each blocking request allow us to update $preload_request_duration.

While we finalize testing, we would need your inputs on:

How conservative do you want to be? For instance, 1s duration is currently already lowering the batch size. Maybe it shouldn't, and we should have a more aggressive formula?
How do you want to release this adaptive feature? By default? Behind a filter? Something else? Note that with the min/max filter, one can already "force" a batch size by setting min = max. The question is mostly, should we do 45 = min = max by default, or not.

MathieuLamiot · 2024-02-01T13:45:26Z

After running tests on gamma website and locally, I adjusted the formula so that we don't impact much "normal" website but provide a batch reduction when the website is slow (typically more than 3 or 4 seconds on average per request starts to reduce significantly the preload).
I reduced the timeframe over which the transient of average duration is kept to allow to quickly adapt in case the website performances change quickly (which is the case with gamma for instance).

I opened a PR to keep track, but we'll need AC or at least NRT plans here, and some rework of the unit/integration tests. I manually tested as much as possible and preloads seems to be going well.

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

MathieuLamiot · 2024-02-13T16:30:45Z

@Khadreal Can you take over this issue for the completion?

Need to get an answer about this:

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

Adapt the built-in tests
Prepare the PR

MathieuLamiot · 2024-03-11T10:40:44Z

Summary of the functional behavior of the implemented solution, as of now

Functional behavior

The number of URLs to preload per batch becomes variable. It is now adjusted with the time it takes to load a page. This time is estimated by frequently measuring how long a preload requests takes, and doing an average over time.
The impact is that, on websites where loading an uncached page takes more than 2 seconds, the batch size will be reduced and hence, preload will take longer.

Preparing a batch

When preparing a preload batch, the plugin computes the batch size based on rocket_preload_previous_request_durations transient (estimation of how long it takes to load a page).
There are safeguards so that: the count of pending actions in AS is never above rocket_preload_cache_pending_jobs_cron_rows_count filter, and if possible, that the batch size is at least rocket_preload_cache_min_in_progress_jobs_count filter.
In case there is no estimation available (first time using this feature, or first preload since at least 5 minutes), then the batch size is the minimum one by default.

Sending preload requests

When sending a preload request, if it has been more than 1 minute since the last estimation, we make the request blocking and measure how long it takes to return. The measured time is used to update the rocket_preload_previous_request_durations transient, with an expiration of 5 minutes.
When a new estimation is done, we set rocket_preload_check_duration transient, with an expiration of 60 seconds. As long as this transient is set, no new estimation will occur.

Controlling the feature

Currently, this feature is applied by default.

Bypassing the feature

Bypassing the feature means having a constant preload batch size. In the current implementation, to do this, one must set those filters to the same value, being the desired preload batch size: rocket_preload_cache_min_in_progress_jobs_count, rocket_preload_cache_pending_jobs_cron_rows_count.
Note that the estimation of the loading time will still be performed.

List of filters

rocket_preload_cache_min_in_progress_jobs_count: New. Integer. Default: 5. Minimum number of URLs per batch. A batch can be smaller only if the AS queue is almost full (see rocket_preload_cache_pending_jobs_cron_rows_count)
rocket_preload_cache_pending_jobs_cron_rows_count: Already introduced. Integer. Default: 45. Target size of the AS preload queue. The batch size will never exceed (this value) - (number of tasks currently in the AS queue for preload).

List of transients

rocket_preload_previous_request_durations: Current estimation of the time to load a page. Expiry: 5 minutes.
rocket_preload_check_duration: Set if a load time estimation has been done less than a minute ago. expiry: 1 minute.

MathieuLamiot added the needs: r&d Needs research and development (R&D) before a solution can be proposed and scoped. label Jan 18, 2024

Khadreal self-assigned this Jan 23, 2024

MathieuLamiot linked a pull request Feb 1, 2024 that will close this issue

Preload adaptative batch #6427

Open

8 tasks

MathieuLamiot mentioned this issue Mar 11, 2024

Preload batch size is artificially reduced as we count in-progress rows with a pending AS action twice #6473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R&D] Adaptative batch size on Preload #6396

[R&D] Adaptative batch size on Preload #6396

MathieuLamiot commented Jan 18, 2024

Khadreal commented Jan 26, 2024

MathieuLamiot commented Jan 26, 2024 •

edited

MathieuLamiot commented Jan 29, 2024 •

edited

MathieuLamiot commented Feb 1, 2024 •

edited

MathieuLamiot commented Feb 13, 2024

MathieuLamiot commented Mar 11, 2024

[R&D] Adaptative batch size on Preload #6396

[R&D] Adaptative batch size on Preload #6396

Comments

MathieuLamiot commented Jan 18, 2024

Khadreal commented Jan 26, 2024

MathieuLamiot commented Jan 26, 2024 • edited

MathieuLamiot commented Jan 29, 2024 • edited

MathieuLamiot commented Feb 1, 2024 • edited

MathieuLamiot commented Feb 13, 2024

MathieuLamiot commented Mar 11, 2024

Summary of the functional behavior of the implemented solution, as of now

Functional behavior

Preparing a batch

Sending preload requests

Controlling the feature

Bypassing the feature

List of filters

List of transients

MathieuLamiot commented Jan 26, 2024 •

edited

MathieuLamiot commented Jan 29, 2024 •

edited

MathieuLamiot commented Feb 1, 2024 •

edited