Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

slbackup script hangs up sometimes in defunct state #7

Closed
nareshov opened this Issue Oct 9, 2012 · 10 comments

Comments

Projects
None yet
2 participants

nareshov commented Oct 9, 2012

Here's a traceback captured by cron: https://gist.github.com/08e637f8550fa14e2a12

This happens sometimes -- any more information I could furnish you with?

Thanks

Contributor

CrackerJackMack commented Oct 9, 2012

Confirming bug. Dead lock is due to an unhandled exception from softlayer_objectstorage in the upload threads causing them to exit out while the main loop waits for work to finish. Since the workers died, the main loop will wait forever.

Do you get disconnected a lot?

@ghost ghost assigned CrackerJackMack Oct 9, 2012

Do you get disconnected a lot?

I haven't had many runs of this script yet, but so far, it looks like half my runs are getting stuck.

Contributor

CrackerJackMack commented Oct 14, 2012

Would you mind trying the issue-7 branch?

Sure: I tried it concurrently from three different hosts.
During the first run, 2 out of 3 failed: https://gist.github.com/20a2a5889e45a5becf14 (noticed `ps -ef | grep slbackup' show all of them in defunct state so I Ctrl-C'd them myself.)

I retried on the two hosts and one of them succeeded while the other failed: https://gist.github.com/5d40dc2c28e0123e4bdc (didn't Ctrl+C this time, those messages showed up by itself)

On a related note, are we hitting any limitations such as the number of object-storage calls that can be made from a server or servers belonging to an account?

Contributor

CrackerJackMack commented Oct 15, 2012

I was finally able to produce a disconnect using 200 threads uploading the linux kernel tarball. Don't see the "requeuing" message in our output though like this. The exception will still show up for now, which I will squash as soon as I know this issue is resolved.

https://gist.github.com/3892593

Contributor

CrackerJackMack commented Oct 16, 2012

I went ahead and merged it into master if you want to grab the latest version from master.

Contributor

CrackerJackMack commented Oct 18, 2012

Did getting the master branch of this help you any ?

Hey,

I've just deployed the master branch, I'll keep an eye and notify in case I
see issues.
On the plus side, with the issue-7 branch's slbackup.py, in the past five
days, the process hasn't remained in a defunct/hung state for more than a
day (setup as a daily cron).

Contributor

CrackerJackMack commented Oct 19, 2012

I'll leave this open for a week then close it. Good to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment