Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark long running jobs as timed out #280

Merged
merged 1 commit into from Apr 9, 2019
Merged

Mark long running jobs as timed out #280

merged 1 commit into from Apr 9, 2019

Conversation

MaxSem
Copy link
Contributor

@MaxSem MaxSem commented Apr 8, 2019

...instead of nuking them outright.

Bug: T220423

...instead of nuking them outright.

Bug: T220423
Copy link
Member

@MusikAnimal MusikAnimal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for now, but I think the current system is flawed. Event::getStaleJobs() returns any job that was created over an hour ago, regardless of its state. This means it could get marked as timed out just after it finally started. Better would be to change the job_submitted_at to job_status_changed_at (or even job_status_as_of) that indicates how long it's been in the current state, and Event::getStaleJobs() will off of that. So if a Job that has been "queued" for 59 minutes finally enters the "started" state, the clock will reset and it has another hour before it gets passed to handleStaleJobs(). Does that makes sense? Would you be willing to take that on? If not no worries, it's sort of out of scope anyway.

There is one other concern to comes to mind. While a job is in the queued or started states, what happens when I try to manually browse to the URLs for the revision browser and reports? I think they should redirect back to the Event Summary page, since the data produced by those reports needs the Event update to finish (since it gives the page IDs). This can be done in a separate PR, though.

Interested to hear your input on the above, but otherwise code looks good except for the one comment at https://github.com/wikimedia/eventmetrics/pull/280/files#diff-0540b9242f475058ca0f274f47c2e075R143

$job->setStatus(Job::STATUS_FAILED_TIMEOUT);
}
if ($job->getSubmitted() >= $dayAgo) {
$event->removeJob($job);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job::isBusy() gets Jobs that are either queued or started. The only other states are failed states, those are left intentionally so the user can see them. So I guess we don't need to remove the Job at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just makes sure that very old jobs don't get stored forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stored forever is fine, I think... If I leave a job running, and come back the next day, I might wonder why nothing happened even though I recall having initiated an update. I don't have any strong feelings, just a thought.

@MaxSem
Copy link
Contributor Author

MaxSem commented Apr 9, 2019

I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here.

Copy link
Member

@MusikAnimal MusikAnimal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here.

It's just kind of unfair to have your job queued up for so long, and then lose your place in line. I don't think we're anywhere near having this problem though, but it's something to keep in mind as our user base grows.

Approving but leaving the merge to you, in case you want to change something.

@MusikAnimal
Copy link
Member

There is one other concern to comes to mind. While a job is in the queued or started states, what happens when I try to manually browse to the URLs for the revision browser and reports? I think they should redirect back to the Event Summary page, since the data produced by those reports needs the Event update to finish (since it gives the page IDs). This can be done in a separate PR, though.

I created a PR for this at #283

@MaxSem MaxSem merged commit e352300 into master Apr 9, 2019
@MaxSem MaxSem deleted the job-timeout branch April 9, 2019 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants