Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register logfile before starting assignment #263

Merged
merged 2 commits into from
Apr 9, 2015

Conversation

guidow
Copy link
Contributor

@guidow guidow commented Apr 7, 2015

If we set the state of tasks not just on the task itself, but also on the current tasklog, we must make sure the tasklog is registered before we do that, otherwise that will fail with a 404.

In extreme cases, where assignments finish very fast, even uploading the final tasklog file at the end may fail because of this. (This can happen in practice if a jobtype class decides that no process needs to be started.)

Unfortunately, this PR still has a problem: When setting our agent's state, the request fails with this message:

2015-04-07 17:36:57 INFO     - pf.jobtypes.core - Spawning Command(command='/usr/bin/blender', arguments=('-b', '/home/guido/stift_aniex43_finish.blend', '-s', '196.0', '-e', '200.0', '-o', '/tmp/stift_####.exr', '-a'), cwd='/home/guido/git/pyfarm', user=None, group=None, env={})
2015-04-07 17:36:57 ERROR    - pf.agent.http.client - POST http://127.0.0.1:5000/api/v1/agents/a113b5a8-9822-413b-b1fd-fce3014956cf has failed (uid: 2d57ff05bd4f):
Traceback (most recent call last):
Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]

2015-04-07 17:36:57 ERROR    - pf.agent.service - Failed to announce self to the master: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>].  Will retry in 1.96419640754 seconds.

It works on retry, but I have no idea where this problem comes from.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.63%) to 75.54% when pulling 29b2faa on guidow:register_log_before_start into 606c366 on pyfarm:master.

self._started_deferred = None

@property
def stopped_deferred(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you go over the specifics as for why we're getting rid of these properties instead of updating them somehow? These are setup to try and ensure it's harder to missorder the setup of start_deferred/stop_deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Process._start() will now register the tasklog before actually starting the process, it will have to return these two properties before the process has actually started. So, yeah, in this case we will have to access both of these before self.start_called is set to True.

@opalmer opalmer added this to the 0.8.4 milestone Apr 8, 2015
@guidow
Copy link
Contributor Author

guidow commented Apr 8, 2015

As for the problems described in the opening comments, I think I've gotten to the bottom of those, as well as, probably, the issues in #249. This seems to be a bug in Twisted-web (which treq uses as its backend), where if you do two consecutive, non-overlapping http requests with data against a server that doesn't support keep-alive, but doesn't make that explicit by setting a Connection: close header, the second request will fail because twisted tries to reuse the connection not realizing the server has closed it already...

See: https://twistedmatrix.com/trac/ticket/7843
and: http://twistedmatrix.com/pipermail/twisted-web/2015-April/005139.html

Since, in production, farms will probably use a real webserver instead of flask's built-in standalone mode, this doesn't seem like a major issue. Besides, as long as we're restarting the request when this happens, we're fine anyway.

@opalmer
Copy link
Member

opalmer commented Apr 9, 2015

Interestingly the problem you're describing here is something I was able to replicate at work. I basically came to the same conclusions you did by accident when I did

data = treq.json_content(response)

when I meant to do

data = yield treq.json_content(response)

which in my case caused overlapping requests and it seemed to trigger the ConnectionDone error. I wanted to reply to your thread on twisted-web but was not subscribed at the time unfortunately. Regardless, nice work and thanks for taking a deep look at that.

Anyway I'm going to go ahed and merge this, I think I have all my questions answered and I don't see anything code wise to fix here right now.

opalmer added a commit that referenced this pull request Apr 9, 2015
Register logfile before starting assignment
@opalmer opalmer merged commit 9de8967 into pyfarm:master Apr 9, 2015
guidow added a commit to guidow/pyfarm-agent that referenced this pull request Apr 13, 2015
Merging pyfarm#263 ended up breaking handling exceptions the jobtype's start()
method.  This should fix it again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants