Skip to content

Issue#160 Straggler due to list-after-write consistency #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 4, 2017

Conversation

ooq
Copy link
Collaborator

@ooq ooq commented Aug 23, 2017

In wait(), we signal the completion of tasks by probing status files. Because S3 does not provide list-after-write consistency (which might also apply to other storage backend). List() can be only used as an optimization but not a timely way to signal completion. Thus, our strategy is to:
1) do list()
2) use get() to signal N tasks that do not show up in 1)
3) repeat 2) if all N tasks completed, otherwise stop
Note: a small N is probably preferred here. N is set to 4.
#160

pywren/wait.py Outdated
callids_done.update(callids_found)

# break if not all N tasks completed
if (len(fs_found) < len(fs_samples)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we break if not all N tasks completed ? (also where is fs_found defined)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a trade-off. Think this block runs when the tasks are far from finish. Then we are essentially scanning all tasks, and doing this many times before they actually complete. (yeah, that's a bug. I fixed it.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am still missing something. So if we break out of it during say the first few blocks we sample then how do we get back into this loop for the last few tasks ?

@ericmjonas
Copy link
Collaborator

@ooq Do we have any plots or example tests to show how this mitigates stragglers?

pywren/wait.py Outdated

pool = ThreadPool(num_samples)
# repeat util all futures are done
while still_not_done_futures:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually the design contract of _wait? I thought it would perform one pass of checking if things are done and downloading those; it would not block waiting for all to finish (note the original _wait has no looping, that's handeled in wait())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not block waiting for all to finish, with if (len(callids_found) < len(fs_samples)): break

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so it's still blocking, just not for all of them. Would it make sense to have num_samples be passed in (and the policy managed at the layer above) and just have it make a single best-effort to get this number of samples? That is, is there a reason to have this control logic at this layer but the all_completed logic at a layer higher?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it'll be good to have a clean contract between the two functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So say all 100 tasks finish but they only show up with list() 30 seconds after.
With this alternative solution, it would take min(100/N * wait_interval, 30) to find them.

Copy link
Collaborator Author

@ooq ooq Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think if we assume get_callset_status and get_call_status are instantaneous, then the contract is exactly same here, i.e., use whatever way to give one pass and returns.

@ooq
Copy link
Collaborator Author

ooq commented Aug 23, 2017

@ericmjonas btw, I'll make some plot examples today.

@ericmjonas
Copy link
Collaborator

@ooq That's awesome, is there a way we could pass the "num_samples" into the outer wait in a clean way so that we can have code going forward to show the differences? @shivaram does this seem sane or a proliferation of rarely-used options?

@ericmjonas ericmjonas modified the milestone: v0.3 Sep 3, 2017
@ericmjonas
Copy link
Collaborator

Ok I spent a bit of time tightening the code up, making the invocation semantics more clear, and making parts of it a little bit more pythonic. I also renamed some variables so it is hopefully now a bit easier to follow along. @ooq @shivaram would love your opinions


while query_count < max_queries:

if len(done_call_ids) >= return_early_n:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this, we assume that there are few stragglers? Let's say we have 100 stragglers (not showing up with list), then it will take 100/return_early_n * WAIT_DUR_SEC to finish. I personally have seen the case with a large number of stragglers yet, but just want to point out the possibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think in the future if we want and this becomes a performance pain point we can make return_early_n to be a user-facing parameter.

if len(done_call_ids) >= return_early_n:
break
num_to_query_at_once = THREADPOOL_SIZE
fs_to_query = still_not_done_futures[query_count:query_count + num_to_query_at_once]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think query_count + num_to_query_at_once could go out of bound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python list slices always return as much as they can:

In [3]: x = list(range(10))

In [4]: x
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]: x[5:15]
Out[5]: [5, 6, 7, 8, 9]

so I think this should be ok?

@ooq
Copy link
Collaborator Author

ooq commented Sep 4, 2017

Thanks for taking it over, @ericmjonas . The code is clean to follow!
It looks good, and I think the corner case pointed out in my comment is not an immediate concern. (and you can tweak the early return parameter to deal with it)
I'm fine merging it after the out-of-bound concern is resolved.

Copy link
Collaborator

@ericmjonas ericmjonas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REview from @ooq in PR

@ericmjonas ericmjonas merged commit 2a1e1b6 into master Sep 4, 2017
@ericmjonas ericmjonas deleted the issue#160 branch September 4, 2017 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants