Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#160 Straggler due to list-after-write consistency #170

Merged
merged 5 commits into from Sep 4, 2017
Merged

Conversation

@ooq
Copy link
Collaborator

@ooq ooq commented Aug 23, 2017

In wait(), we signal the completion of tasks by probing status files. Because S3 does not provide list-after-write consistency (which might also apply to other storage backend). List() can be only used as an optimization but not a timely way to signal completion. Thus, our strategy is to:
1) do list()
2) use get() to signal N tasks that do not show up in 1)
3) repeat 2) if all N tasks completed, otherwise stop
Note: a small N is probably preferred here. N is set to 4.
#160

callids_done.update(callids_found)

# break if not all N tasks completed
if (len(fs_found) < len(fs_samples)):

This comment has been minimized.

@shivaram

shivaram Aug 23, 2017
Collaborator

why should we break if not all N tasks completed ? (also where is fs_found defined)

This comment has been minimized.

@ooq

ooq Aug 23, 2017
Author Collaborator

Here is a trade-off. Think this block runs when the tasks are far from finish. Then we are essentially scanning all tasks, and doing this many times before they actually complete. (yeah, that's a bug. I fixed it.)

This comment has been minimized.

@shivaram

shivaram Aug 23, 2017
Collaborator

Sorry I am still missing something. So if we break out of it during say the first few blocks we sample then how do we get back into this loop for the last few tasks ?

ooq added 2 commits Aug 23, 2017
fix
fix
@ericmjonas
Copy link
Collaborator

@ericmjonas ericmjonas commented Aug 23, 2017

@ooq Do we have any plots or example tests to show how this mitigates stragglers?


pool = ThreadPool(num_samples)
# repeat util all futures are done
while still_not_done_futures:

This comment has been minimized.

@ericmjonas

ericmjonas Aug 23, 2017
Collaborator

Is this actually the design contract of _wait? I thought it would perform one pass of checking if things are done and downloading those; it would not block waiting for all to finish (note the original _wait has no looping, that's handeled in wait())

This comment has been minimized.

@ooq

ooq Aug 23, 2017
Author Collaborator

It's not block waiting for all to finish, with if (len(callids_found) < len(fs_samples)): break

This comment has been minimized.

@ericmjonas

ericmjonas Aug 23, 2017
Collaborator

I see so it's still blocking, just not for all of them. Would it make sense to have num_samples be passed in (and the policy managed at the layer above) and just have it make a single best-effort to get this number of samples? That is, is there a reason to have this control logic at this layer but the all_completed logic at a layer higher?

This comment has been minimized.

@shivaram

shivaram Aug 23, 2017
Collaborator

Yeah it'll be good to have a clean contract between the two functions.

This comment has been minimized.

@ooq

ooq Aug 23, 2017
Author Collaborator

So say all 100 tasks finish but they only show up with list() 30 seconds after.
With this alternative solution, it would take min(100/N * wait_interval, 30) to find them.

This comment has been minimized.

@ooq

ooq Aug 23, 2017
Author Collaborator

I actually think if we assume get_callset_status and get_call_status are instantaneous, then the contract is exactly same here, i.e., use whatever way to give one pass and returns.

@ooq
Copy link
Collaborator Author

@ooq ooq commented Aug 23, 2017

@ericmjonas btw, I'll make some plot examples today.

@ericmjonas
Copy link
Collaborator

@ericmjonas ericmjonas commented Aug 23, 2017

@ooq That's awesome, is there a way we could pass the "num_samples" into the outer wait in a clean way so that we can have code going forward to show the differences? @shivaram does this seem sane or a proliferation of rarely-used options?

@ericmjonas ericmjonas modified the milestone: v0.3 Sep 3, 2017
@ericmjonas
Copy link
Collaborator

@ericmjonas ericmjonas commented Sep 4, 2017

Ok I spent a bit of time tightening the code up, making the invocation semantics more clear, and making parts of it a little bit more pythonic. I also renamed some variables so it is hopefully now a bit easier to follow along. @ooq @shivaram would love your opinions


while query_count < max_queries:

if len(done_call_ids) >= return_early_n:

This comment has been minimized.

@ooq

ooq Sep 4, 2017
Author Collaborator

I think with this, we assume that there are few stragglers? Let's say we have 100 stragglers (not showing up with list), then it will take 100/return_early_n * WAIT_DUR_SEC to finish. I personally have seen the case with a large number of stragglers yet, but just want to point out the possibility.

This comment has been minimized.

@ericmjonas

ericmjonas Sep 4, 2017
Collaborator

Yeah I think in the future if we want and this becomes a performance pain point we can make return_early_n to be a user-facing parameter.

if len(done_call_ids) >= return_early_n:
break
num_to_query_at_once = THREADPOOL_SIZE
fs_to_query = still_not_done_futures[query_count:query_count + num_to_query_at_once]

This comment has been minimized.

@ooq

ooq Sep 4, 2017
Author Collaborator

I think query_count + num_to_query_at_once could go out of bound?

This comment has been minimized.

@ericmjonas

ericmjonas Sep 4, 2017
Collaborator

Python list slices always return as much as they can:

In [3]: x = list(range(10))

In [4]: x
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]: x[5:15]
Out[5]: [5, 6, 7, 8, 9]

so I think this should be ok?

@ooq
Copy link
Collaborator Author

@ooq ooq commented Sep 4, 2017

Thanks for taking it over, @ericmjonas . The code is clean to follow!
It looks good, and I think the corner case pointed out in my comment is not an immediate concern. (and you can tweak the early return parameter to deal with it)
I'm fine merging it after the out-of-bound concern is resolved.

Copy link
Collaborator

@ericmjonas ericmjonas left a comment

REview from @ooq in PR

@ericmjonas ericmjonas merged commit 2a1e1b6 into master Sep 4, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ericmjonas ericmjonas deleted the issue#160 branch Sep 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.