Skip to content

Runs with large file uploads (>100MB) seem to hang #26

Closed
dbishop opened this Issue Feb 18, 2013 · 3 comments

1 participant

@dbishop
SwiftStack member
dbishop commented Feb 18, 2013

Still trying to get a handle on what's going on here; details are sketchy so far.

I was running one, somewhat large (1M smallish objects) last night, which mainly worked. It had some failures, but it completed. Then I went on running a large one, with 80M objects, which was supposed to take more or less all weekend. I started the run around 01:00 and by 02:00 it had hung. So, the main difference between runs that complete and the ones that fail/stall, is that 100% of runs with larger file sizes are the ones that stall. I don't think a single one with files >= to 100M have completed.

This run, same scenario as I started last night, died after almost exactly the same amount of time/amount. I can't prove this, but I have a sense it stops once it has reached a certain amount of data.

@dbishop
SwiftStack member
dbishop commented Mar 14, 2013

This is still happening, though it seemed to usually affect 10 or 11 out of something like 20 workers. And it seemed to only happen when the operation count was over some threshold.

The proxy claims it timed out waiting for PUT data from the client. An affected ssbench-worker did receive the bench job in question and under strace, the ssbench-worker was sitting around epolling the connection to the proxy.

The bench job which hung was not the last bench job the ssbench-worker received (but the one we looked at was toward the end).

@dbishop
SwiftStack member
dbishop commented Mar 14, 2013

Seems like we ought to be able to put a timeout around the conn.get_response() in put_object(). That seems like a reasonable guess as to where ssbench-worker was sitting in epoll... though maybe it was still in the contents.read(size)/conn.send(chunk) loop.

@dbishop
SwiftStack member
dbishop commented Mar 19, 2013

More info from Hugo:

ssbench hangs for several scenarios . In my test , it happens with large objects(5GB) for GET , Also happened with mixed scenario. (http)

More information from the customer , as they said that the situation happens with SSL enabled(https) frequently. Even for 1KB objects .

Due to this issue , they can not operate a long term benchmarking test (assume 24x3 hrs). The 408 ERROR appeared in proxy's log in their tests recently. So that I think it should be same issue as my previous test.

@dbishop dbishop was assigned Mar 19, 2013
@dbishop dbishop added a commit that closed this issue Mar 21, 2013
@dbishop dbishop Add socket timeouts. fixes #26.
Add timeouts for socket operations.  ssbench-master now takes two new
arguments, --connect-timeout <float_seconds> and --network-timeout
<float_seconds>.  The first is a timeout for connecting to the Swift
proxy (or load-balancer or whatever), and the second timeout applies to
all socket operations after the connect.  They default to 10.0 and 20.0
seconds, respectively.

Also fix bug where the storage_url would get overridden with None.
28ade3c
@dbishop dbishop closed this in 28ade3c Mar 21, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.