Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit numbers of pending RUCSS jobs to avoid being late on fetching the result and hence getting a 400 Bad Json #6213

Closed
MathieuLamiot opened this issue Oct 10, 2023 · 11 comments · Fixed by #6230
Assignees
Labels
effort: [M] 3-5 days of estimated development time module: remove unused css needs: documentation Issues which need to create or update a documentation priority: high Issues which should be resolved as quickly as possible type: enhancement Improvements that slightly enhance existing functionality and are fast to implement
Milestone

Comments

@MathieuLamiot
Copy link
Contributor

Context
Several users with an high number of pages on their websites reported issues with RUCSS feature, resulting in "400: Bad JSON".
Some investigations showed that, when starting the process of RUCSS optimization on the plugin for the whole website, the first URLs were correctly optimized but at some point, all jobs started to return 400: Bad JSON. This seem linked to the fact that the website does not manage to fetch all jobs in the expected schedule: too many actions to fecth results are planned and a good part of them take place to late, when the job is stale on the server side.

Expected behavior
The plugin should not send new jobs if it is not capable of fetching them back in the appropriate time window. Therefore, we should keep the number of pending jobs under control:

  • If the website is fast enough, jobs start in pending and are quickly completed, allowing for more jobs to be created immediately.
  • If the website is slow, jobs start in pending and we wait for them to be completed. There is no use putting more jobs in the queue as the website won't be able to fetch them in time anyway.
    The maximum number of pending + in progress job must be configurable with a filter for instance. With a special value (for instance 0?) the configuration means that the limit must be disable (come back to the current state).

Acceptance Criteria

  • No regression on the RUCSS feature. Especially in case of 408: the retry job must be created no matter what.
  • Emulate a website that never tries to fetch results. Start RUCSS on all website. The number of pending jobs must stop at the given threshold.
  • Then, complete and/or fail one job. A new job must be created for a new page.
  • The filter allows to change the number of pending jobs authorized, and also to remove the limitation.
@MathieuLamiot
Copy link
Contributor Author

From @wordpressfan: (Slack)

Let's think about this, we need a way to stop sending the request of registering a new job when we have enough pending jobs (we can define the word enough later)
to do that we can easily create a new filter here:

public function add_url_to_the_queue( string $url, bool $is_mobile ) {

As a switch to enable/disable registering new jobs from the plugin, then we can add a callback to change this switch based on the number of pending+inprogress jobs so for example we get 100 pending rows to process every minute, then we need to stop registering new jobs once we reach 500 rows (pending+inprogress)

@MathieuLamiot
Copy link
Contributor Author

MathieuLamiot commented Oct 10, 2023

@wordpressfan I agree with the approach. However, it should not impact the 408 retry mechanism here
What do you think of adding an optional argument force to add_url_to_the_queue, set to False by default.
And in the method add_url_to_the_queue:

if ( force == False ) {
   $max_pending_jobs = apply_filters( 'rocket_rucss_max_pending_jobs', MAGIC_CONSTANT )
   if ( 0 < $max_pending_jobs) {
      $nb_pending_jobs = count_pending_jobs()
      if ( nb_pending_jobs >= max_pending_jobs) {
         return false;
      }
   }
}

Where MAGIC_CONSTANT is to be defined (> 100 not to block the default batches, and not too big to avoid the problem we currently have. Maybe 300?)
And where count_pending_jobs is just a query on the RUCSS DB.

@piotrbak, what do you think in terms of product impact?

@piotrbak piotrbak added type: enhancement Improvements that slightly enhance existing functionality and are fast to implement priority: high Issues which should be resolved as quickly as possible needs: grooming module: remove unused css labels Oct 10, 2023
@MathieuLamiot
Copy link
Contributor Author

MathieuLamiot commented Oct 10, 2023

@wordpressfan @vmanthos what was discussed during the daily is that we could replace MAGIC_CONSTANT in the above suggestion with a calculated value?
The one that could be relevant would be this one: n * apply_filters( 'rocket_rucss_pending_jobs_cron_rows_count', 100 )
What do you think?
This way, we would scale how many jobs we send based on how many jobs we are trying to collect.

I would suggest n to be ~3 (it has to be more than 2 and not to big...)
So something like:

$max_pending_jobs = apply_filters('rocket_rucss_max_pending_jobs', 3 * apply_filters( 'rocket_rucss_pending_jobs_cron_rows_count', 100 ) )

(And please, let's not hard code 3 and 100 but use (shared) constants instead? ; maybe put the inner apply_filter in a method as we use it at different places in the codebase)

@MathieuLamiot
Copy link
Contributor Author

MathieuLamiot commented Oct 10, 2023

I'm just worried about hammering the DB with the COUNT query every time a job is added 🤔
It's not a costly query and the DB is small... I don't really know

@wordpressfan
Copy link
Contributor

Yes it's a small query and the table is indexed well so I believe we are fine but this is a good concern, thanks for discussing it.

The big picture that I was thinking of last night is:
What if we made the process of registering the new job is happening through action scheduler and CRON too?
So the process will be as follows:

  1. The not cached page is visited for the first time, we will create a row in RUCSS table with the status pending_create for example and in this step we will not send the RUCSS SaaS job creation request.
  2. Will have a CRON to get some of those pending_create rows and create async job in action scheduler to create them.
  3. Once the job is created, the status will be pending and get into the normal process.

This will give us the control of how many jobs are created at once and not to miss any job but also will add a layer of complexity.

@wordpressfan
Copy link
Contributor

Reproduce the issue

Yes I can see the issue in one of our customer's site.

Root cause

Jobs are kept in plugin's database even when they are removed from SaaS side so every check status after that time will fail because SaaS at this point returns empty response.

Scope a solution

As We mentioned above we need to add a new status pending_create and stop registering jobs into SaaS directly.

  1. Here

    public function create_new_job( string $url, string $job_id, string $queue_name, bool $is_mobile = false ) {
    if ( ! self::$table_exists && ! $this->table_exists() ) {
    return false;
    }
    $item = [
    'url' => untrailingslashit( $url ),
    'is_mobile' => $is_mobile,
    'job_id' => $job_id,
    'queue_name' => $queue_name,
    'status' => 'pending',
    'retries' => 0,
    'last_accessed' => current_time( 'mysql', true ),
    ];
    return $this->add_item( $item );
    }

    we need to make $job_id and $queue_name optional arguments, also change the status here from pending to pending_create

  2. Make $job_id argument optional here with default value as empty string:

    public function reset_job( int $id, string $job_id ) {
    if ( ! self::$table_exists && ! $this->table_exists() ) {
    return false;
    }
    return $this->update_item(
    $id,
    [
    'job_id' => $job_id,
    'status' => 'pending',
    'error_code' => '',
    'error_message' => '',
    'retries' => 0,
    'modified' => current_time( 'mysql', true ),
    ]
    );
    }

And change the status to pending_create.

  1. Remove this part of code from here

    if ( false === $add_to_queue_response || ! isset( $add_to_queue_response['contents'], $add_to_queue_response['contents']['jobId'], $add_to_queue_response['contents']['queueName'] ) ) {
    return $html;
    }
    /**
    * Lock preload URL.
    *
    * @param string $url URL to lock
    */
    do_action( 'rocket_preload_lock_url', $url );
    // We got jobid and queue name so save them into the DB and change status to be pending.
    $this->used_css_query->create_new_job(
    $url,
    $add_to_queue_response['contents']['jobId'],
    $add_to_queue_response['contents']['queueName'],
    $is_mobile
    );

  2. Change the functionality of this method add_url_to_the_queue to only:

$used_css_row = $this->used_css_query->get_row( $url, $is_mobile );
if ( empty ( $used_css_row ) ) {
    $this->used_css_query->create_new_job( $url, '', '', $is_mobile );
} else {
    $this->used_css_query->reset_job( $used_css_row->id );
}
  1. Remove this part:

    if ( false !== $add_to_queue_response ) {
    $new_job_id = $add_to_queue_response['contents']['jobId'];
    $this->used_css_query->update_job_id( $id, $new_job_id );
    }

  2. Remove this part:

    if ( false !== $add_to_queue_response ) {
    $new_job_id = $add_to_queue_response['contents']['jobId'];
    $this->used_css_query->reset_job( $id, $new_job_id );
    }

  3. For the CRON we can use the current CRON rocket_rucss_pending_jobs to process also pending_create jobs by adding a new callback here:

    'rocket_rucss_pending_jobs' => 'process_pending_jobs',

  4. Here: https://github.com/wp-media/wp-rocket/blob/06381b8860eb46e570911b2812652f54a676dd6d/inc/Engine/Optimization/RUCSS/Controller/UsedCSS.php#L871-L870
    Create a new method process_pending_create_jobs that will be called exactly like process_pending_jobs

  5. In this new method process_pending_create_jobs we will get rows with pending_create status and do the same functionality we already do inside process_pending_jobs with a filter to change the number of rows to handle, with a default value (we need to agree on that default value because we only need to keep the number of pending rows below a specific number and use this max_pending_count to decide the number of pending_create to process)

  6. Here we are between two approaches of creating async action scheduler's task for each row to call SaaS to create the job.

Effort

[M] or [L]

@CrochetFeve0251 @wp-media/php to review plz.

@MathieuLamiot
Copy link
Contributor Author

Looks good overall. Some comments:
On point 1:

  • we need to make $job_id and $queue_name optional arguments -> Let's just remove those arguments in the method, and pass an empty string to the DB, no?
  • To ease discussion, let's not re-use pending in the status name. Maybe 'to_submit'?

Point 2: Same here, let's just remove the job_id argument.

Point 3 & 4:
Here again, for naming clarity we should properly name methods to avoid confusion.
If I recap your suggestion, my understanding would be:

public function prepare_job( $url, $is_mobile ) {
    $used_css_row = $this->used_css_query->get_row( $url, $is_mobile );
    if ( empty ( $used_css_row ) ) {
        $this->used_css_query->create_new_job( $url, '', '', $is_mobile );
    } else {
        $this->used_css_query->reset_job( $used_css_row->id );
    }
}
public function send_job( $used_css_row ) {
$url = $used_css_row->url;
$is_mobile = $used_css_row->is_mobile; 
// CURRENT CONTENT OF add_url_to_the_queue
 /**
		 * Filters the RUCSS safelist
		 *
		 * @since 3.11
		 *
		 * @param array $safelist Array of safelist values.
		 */
		$safelist = apply_filters( 'rocket_rucss_safelist', $this->options->get( 'remove_unused_css_safelist', [] ) );

		/**
		 * Filters the styles attributes to be skipped (blocked) by RUCSS.
		 *
		 * @since 3.14
		 *
		 * @param array $skipped_attr Array of safelist values.
		 */
		$skipped_attr = apply_filters( 'rocket_rucss_skip_styles_with_attr', [] );
		$skipped_attr = ( is_array( $skipped_attr ) ) ? $skipped_attr : [];

		$config = [
			'treeshake'      => 1,
			'rucss_safelist' => $safelist,
			'skip_attr'      => $skipped_attr,
			'is_mobile'      => $is_mobile,
			'is_home'        => $this->is_home( $url ),
		];

		$add_to_queue_response = $this->api->add_to_queue( $url, $config );
		if ( 200 !== $add_to_queue_response['code'] ) {
			Logger::error(
				'Error when contacting the RUCSS API.',
				[
					'rucss error',
					'url'     => $url,
					'code'    => $add_to_queue_response['code'],
					'message' => $add_to_queue_response['message'],
				]
			);

			return false;
		}
// END OF CURRENT CONTENT OF add_url_to_the_queue

			/**
			 * Lock preload URL.
			 *
			 * @param string $url URL to lock
			 */
			do_action( 'rocket_preload_lock_url', $url );

			// We got jobid and queue name so save them into the DB and change status to be pending.
			$this->used_css_query->add_job_reference(
				$add_to_queue_response['contents']['jobId'],
				$add_to_queue_response['contents']['queueName'],
			);

add_job_reference has to be created.
We can delete add_url_to_the_queue and replace occurrences with prepare_job

in the if ( empty( $used_css ) ) { we would just call prepare_job.

Point 5: Make sure to add do_action( 'rocket_preload_unlock_url', $row_details->url ); after calling prepare_job: the job will not be immediately sent to the server, so we should unlock preload in the meantime.

Point 9: We can use this limit for the number of "to_submit" jobs:
$max_pending_jobs = apply_filters('rocket_rucss_max_pending_jobs', 3 * apply_filters( 'rocket_rucss_pending_jobs_cron_rows_count', 100 ) )

Point 10: Do you foresee an issue in using a for loop to go through the (up to) max_pending_jobs RUL and making the API calls? for fetching the results, we use async tasks for each API call. Is there a reason for this approach?

@CrochetFeve0251
Copy link
Contributor

@wordpressfan The grooming looks good but there is one point I am not sure about is the rocket_preload_lock_url filter disapearing. Why are we sure we do not need to lock the URLs for the preload anymore?

Another thing this issue is working on the exact same methods with different AC. I guess we need some communication to not end up with monster conflicts.

@MathieuLamiot
Copy link
Contributor Author

rocket_preload_lock_url I think it should be kept and used in the CRON job part, when actually sending the job to the SaaS. that could be in the send_job method as suggested above

@CrochetFeve0251 CrochetFeve0251 added effort: [M] 3-5 days of estimated development time and removed needs: grooming labels Oct 11, 2023
@wordpressfan
Copy link
Contributor

After a discussion with @MathieuLamiot and @CrochetFeve0251

  • Yes we will keep the rocket_preload_lock_url action but will move it from its current place to be in the method that will run for each url after sending the job to SaaS.
  • For the action scheduler part we need to do the same functionality we do now for pending rows, so we will create Async job for every database row to send the request to SaaS.

@CrochetFeve0251 CrochetFeve0251 self-assigned this Oct 12, 2023
@MathieuLamiot MathieuLamiot added this to the 3.15.3 milestone Oct 18, 2023
@MathieuLamiot
Copy link
Contributor Author

@CrochetFeve0251 Can you provide a documentation for the new submission system and the related filters, so that support can learn how to use them? Thanks!

@piotrbak piotrbak added the needs: documentation Issues which need to create or update a documentation label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: [M] 3-5 days of estimated development time module: remove unused css needs: documentation Issues which need to create or update a documentation priority: high Issues which should be resolved as quickly as possible type: enhancement Improvements that slightly enhance existing functionality and are fast to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants