Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds resendfailedbatches setting to allow failed statements to be retried. #367

Merged
merged 16 commits into from Jan 8, 2019

Conversation

garemoko
Copy link
Contributor

@garemoko garemoko commented Dec 20, 2018

Description
This PR does a few different things in order to improve the handling of failed batches.

  1. (No related issue)

Previously emit_task.php would record in the trace a list of event ids which the plugin had attempted to send to the LRS (some or all of which might have been in failed batches) with the message "Events [list of event ids] have been successfully sent to the LRS." But those events had not necessarily been successfully sent and having the event ids served no purpose because the events were no longer in the database in order to look up the ids.

Now emit_task.php will instead record the number of successful and (separately) unsuccessful events so the person looking at the logs can see how many events were successful and how many were not.

  1. (Fixes Undefined variable: eventobj #356)

Previously load_batch.php, in the event of a failed xapi request, would try to trace the event id using the $eventobj->id variable. However this only resulted in an error because:
A. The $eventobj variable did not exist in that scope.
B. It couldn't exist because the xapi request would normally relate to multiple events, not a single id.
C. Having the event id(s) would have been useless anyway because the database row id for a row that's deleted is useless (the event gets a new id in the failed log table).

Now the error instead includes the size of the batch that was rejected.

  1. (Fixes re-send statements one by one if batch fails #359)

Previously if a batch failed, all events in that batch were marked as failed, even if there was only one event in the batch that had an issue.

Now the plugin will recursively retry the events in smaller and smaller batches (half size each time) until it succeeds or gets down to a batch size of one. This means that events will not be rejected for just being in the same batch as a bad request. It also makes debugging of bad events easier because you'll end up with a bad request containing just one event's statement(s).

Note: I have not tested this with a large data set, only with a pair of events, one of which is bad.

Related Issues

PR Type

  • Fixes and Enhancement

@garemoko
Copy link
Contributor Author

@ryansmith94 This and the other two PRs I've opened are ready for review.

$logerror($e->getTraceAsString());
$loadedevents = construct_loaded_events($transformedevents, false);

// Recursively retry sending statements in increasingly smaller batches so that only the actual bad data fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause some issues because this will delay the time the CRON takes to execute so the next CRON will try to process the same events so we'll duplicate statements. Hope that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that Moodle cron has some built in stuff that will prevent that happening. See "running cron on multiple servers" here: https://docs.moodle.org/36/en/Cron

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah didn't know that, good spot.

Tasks can run in parallel and processes use locking to prevent tasks from running at the same time which allows cron to be triggered from multiple web servers that serve the same Moodle instance.

Do you think it would be better to just validate the statements before we send them? Otherwise if the first statement in the batch is invalid we might send X requests where X is the max statements in a batch.

Copy link
Contributor Author

@garemoko garemoko Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating before we send them is good, but it'll be a lot of effort to ensure validation is as good and matches the validation done by the LRS.

X requests would not be sent in that scenario. Let's say it's a batch of 10 statements the first statement is bad.

Round 1: 1 batch of 10 statements fails
Round 2: 2 batches of 5 statements (1 fails, 1 succeeds)
Round 3: 2 batches of up to 3 statements (1 fails, 1 succeeds)
Round 4: 2 batches of up to 2 statements (1 fails, 1 succeeds)
Round 5: 2 batches of 1 statement (1 fails, 1 succeeds)

Total requests: 9. Ok, that's actually pretty close to 10, but if you have up to 12 events, it's still only 9 requests!

I'm open to suggestions of a better approach. The current approach of failing the whole batch is not great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah matching the validation of the LRS is pretty much impossible given the slight differences between the LRSs and the little holes in the conformance tests. I'd be interested to know what other people think about this issue (@davidpesce and @caperneoignis). I don't think the current approach is too bad, agree that it's not ideal though, but I'm not sure what the ideal solution is either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes so the recursive retry functionality only triggers in the event of a 400 Bad Request response from the LRS. This ensures that in the event of slow responses we're not retrying and retrying so that cron gets blocked.

I do still think we should retry in the event of 404, 429, 503 and 504, but that can be handled as a separate issue, and probably means retrying next time cron runs rather than in the same cron task.

I like the approach of us tackling specific error codes or groups of error codes one by one, rather than having a catch all approach, so happy for this to start with just 400 errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry avoided checking in here to avoid getting in trouble with the better half. I'm quite happy to retry on any error code and perhaps only put events in the failed log if we get a 400-499 error code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also putting this behind a flag would be good I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a setting but haven't tested it just yet. I'll comment again once I have tested. (It's passing automated tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks Andrew 👍

@ryasmi ryasmi added the feat label Dec 21, 2018
@ryasmi ryasmi self-assigned this Dec 21, 2018
@garemoko
Copy link
Contributor Author

garemoko commented Jan 7, 2019

@ryansmith94 this has been tested with both the setting on and off with batches of 2 statements one of which is good and one is bad. In both cases it worked as expected.

I'm expecting it to pass automated testing and then it should be good for you to review again.

Copy link
Member

@ryasmi ryasmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @garemoko. Made some comments that you might find interesting/useful but no need to make any changes.

// In the event of a 400 error, recursively retry sending statements in increasingly
// smaller batches so that only the actual bad data fails.
if ($batchsize === 1 || $e->getCode() !== 400 || $config['lrs_resend_failed_batches'] !== '1') {
$loadedevents = construct_loaded_events($transformedevents, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would return early here instead of assigning to loadedevents to be returned later because it's quicker to read since you don't need to scroll down looking for any uses/modifications to loadedevents. Happy to merge with this just sharing that because you might agree and find it useful in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point.


// In the event of a 400 error, recursively retry sending statements in increasingly
// smaller batches so that only the actual bad data fails.
if ($batchsize === 1 || $e->getCode() !== 400 || $config['lrs_resend_failed_batches'] !== '1') {
Copy link
Member

@ryasmi ryasmi Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have flipped this if around and just returned construct_loaded_events($transformedevents, false); outside without an else, but that is very much a personal preference. I think I prefer that way because it means you don't have to reverse the condition (lacked a better phrase, hope that makes sense). You could instead write the following.

$hasInvalidStatement = $e->getCode() === 400;
$isResendEnabled = $config['lrs_resend_failed_batches'] === '1';

if ($batchsize > 1 && $hasInvalidStatement && $isResendEnabled) {
  // ...
}

You could even extract resend logic into a function to remove the need for the comment that currently explains the if block.

$hasInvalidStatement = $e->getCode() === 400;
$isResendEnabled = $config['lrs_resend_failed_batches'] === '1';

if ($batchsize > 1 && $hasInvalidStatement && $isResendEnabled) {
  retry_in_smaller_batches($config, $transformedevents, $loader);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the long run we'll end up with an if for batchsize > 1 and the setting containing a switch for the various types of error.

Extracting the logic does look clean.

Copy link
Member

@ryasmi ryasmi Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good 👍

$loadedevents = construct_loaded_events($transformedevents, false);
} else {
$newconfig = $config;
$newconfig['lrs_max_batch_size'] = round($batchsize / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the way you solved this recursively with a modified copy of $config.

@ryasmi ryasmi changed the title Improved handling of failed batches feat: Adds resendfailedbatches setting to allows failed statements to be retried. Jan 8, 2019
@ryasmi ryasmi changed the title feat: Adds resendfailedbatches setting to allows failed statements to be retried. feat: Adds resendfailedbatches setting to allows failed statements to be retried. Jan 8, 2019
@ryasmi ryasmi changed the title feat: Adds resendfailedbatches setting to allows failed statements to be retried. feat: Adds resendfailedbatches setting to allow failed statements to be retried. Jan 8, 2019
@ryasmi ryasmi merged commit 803bb4c into xAPI-vle:master Jan 8, 2019
@HT2Bot
Copy link
Member

HT2Bot commented Jan 8, 2019

🎉 This PR is included in version 4.2.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@HT2Bot HT2Bot added the released label Jan 8, 2019
@garemoko garemoko deleted the issue-356-eventobj branch January 8, 2019 10:28
@ryasmi ryasmi linked an issue Feb 14, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

re-send statements one by one if batch fails
4 participants