New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark long running jobs as timed out #280
Conversation
...instead of nuking them outright. Bug: T220423
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for now, but I think the current system is flawed. Event::getStaleJobs()
returns any job that was created over an hour ago, regardless of its state. This means it could get marked as timed out just after it finally started. Better would be to change the job_submitted_at
to job_status_changed_at
(or even job_status_as_of
) that indicates how long it's been in the current state, and Event::getStaleJobs()
will off of that. So if a Job that has been "queued" for 59 minutes finally enters the "started" state, the clock will reset and it has another hour before it gets passed to handleStaleJobs()
. Does that makes sense? Would you be willing to take that on? If not no worries, it's sort of out of scope anyway.
There is one other concern to comes to mind. While a job is in the queued or started states, what happens when I try to manually browse to the URLs for the revision browser and reports? I think they should redirect back to the Event Summary page, since the data produced by those reports needs the Event update to finish (since it gives the page IDs). This can be done in a separate PR, though.
Interested to hear your input on the above, but otherwise code looks good except for the one comment at https://github.com/wikimedia/eventmetrics/pull/280/files#diff-0540b9242f475058ca0f274f47c2e075R143
$job->setStatus(Job::STATUS_FAILED_TIMEOUT); | ||
} | ||
if ($job->getSubmitted() >= $dayAgo) { | ||
$event->removeJob($job); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job::isBusy()
gets Jobs that are either queued or started. The only other states are failed states, those are left intentionally so the user can see them. So I guess we don't need to remove the Job at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just makes sure that very old jobs don't get stored forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stored forever is fine, I think... If I leave a job running, and come back the next day, I might wonder why nothing happened even though I recall having initiated an update. I don't have any strong feelings, just a thought.
I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here.
It's just kind of unfair to have your job queued up for so long, and then lose your place in line. I don't think we're anywhere near having this problem though, but it's something to keep in mind as our user base grows.
Approving but leaving the merge to you, in case you want to change something.
I created a PR for this at #283 |
...instead of nuking them outright.
Bug: T220423