Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark long running jobs as timed out #280

Merged
merged 1 commit into from Apr 9, 2019

Conversation

2 participants
@MaxSem
Copy link
Contributor

commented Apr 8, 2019

...instead of nuking them outright.

Bug: T220423

Mark long running jobs as timed out
...instead of nuking them outright.

Bug: T220423
@MusikAnimal
Copy link
Contributor

left a comment

This looks good for now, but I think the current system is flawed. Event::getStaleJobs() returns any job that was created over an hour ago, regardless of its state. This means it could get marked as timed out just after it finally started. Better would be to change the job_submitted_at to job_status_changed_at (or even job_status_as_of) that indicates how long it's been in the current state, and Event::getStaleJobs() will off of that. So if a Job that has been "queued" for 59 minutes finally enters the "started" state, the clock will reset and it has another hour before it gets passed to handleStaleJobs(). Does that makes sense? Would you be willing to take that on? If not no worries, it's sort of out of scope anyway.

There is one other concern to comes to mind. While a job is in the queued or started states, what happens when I try to manually browse to the URLs for the revision browser and reports? I think they should redirect back to the Event Summary page, since the data produced by those reports needs the Event update to finish (since it gives the page IDs). This can be done in a separate PR, though.

Interested to hear your input on the above, but otherwise code looks good except for the one comment at https://github.com/wikimedia/eventmetrics/pull/280/files#diff-0540b9242f475058ca0f274f47c2e075R143

$job->setStatus(Job::STATUS_FAILED_TIMEOUT);
}
if ($job->getSubmitted() >= $dayAgo) {
$event->removeJob($job);

This comment has been minimized.

Copy link
@MusikAnimal

MusikAnimal Apr 9, 2019

Contributor

Job::isBusy() gets Jobs that are either queued or started. The only other states are failed states, those are left intentionally so the user can see them. So I guess we don't need to remove the Job at all.

This comment has been minimized.

Copy link
@MaxSem

MaxSem Apr 9, 2019

Author Contributor

This just makes sure that very old jobs don't get stored forever.

This comment has been minimized.

Copy link
@MusikAnimal

MusikAnimal Apr 9, 2019

Contributor

Stored forever is fine, I think... If I leave a job running, and come back the next day, I might wonder why nothing happened even though I recall having initiated an update. I don't have any strong feelings, just a thought.

@MaxSem

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2019

I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here.

@MusikAnimal
Copy link
Contributor

left a comment

I don't think extra efforts would help here. The purpose of this ticket is to make jobs that likely have problems to become visibly timed out. From a product perspective, the decision is to consider jobs timed out after one hour, so I think that extra semantic correctness wouldn't help here.

It's just kind of unfair to have your job queued up for so long, and then lose your place in line. I don't think we're anywhere near having this problem though, but it's something to keep in mind as our user base grows.

Approving but leaving the merge to you, in case you want to change something.

@MusikAnimal

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

There is one other concern to comes to mind. While a job is in the queued or started states, what happens when I try to manually browse to the URLs for the revision browser and reports? I think they should redirect back to the Event Summary page, since the data produced by those reports needs the Event update to finish (since it gives the page IDs). This can be done in a separate PR, though.

I created a PR for this at #283

@MaxSem MaxSem merged commit e352300 into master Apr 9, 2019

4 of 5 checks passed

codeclimate/diff-coverage 0% (50% threshold)
Details
Scrutinizer Analysis: No new issues – Tests: passed
Details
codeclimate All good!
Details
codeclimate/total-coverage 95%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@MaxSem MaxSem deleted the job-timeout branch Apr 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.