New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#160 Straggler due to list-after-write consistency #170

Merged
merged 5 commits into from Sep 4, 2017

Conversation

Projects
None yet
3 participants
@ooq
Collaborator

ooq commented Aug 23, 2017

In wait(), we signal the completion of tasks by probing status files. Because S3 does not provide list-after-write consistency (which might also apply to other storage backend). List() can be only used as an optimization but not a timely way to signal completion. Thus, our strategy is to:
1) do list()
2) use get() to signal N tasks that do not show up in 1)
3) repeat 2) if all N tasks completed, otherwise stop
Note: a small N is probably preferred here. N is set to 4.
#160

Show outdated Hide outdated pywren/wait.py

ooq added some commits Aug 23, 2017

fix
fix
@ericmjonas

This comment has been minimized.

Show comment
Hide comment
@ericmjonas

ericmjonas Aug 23, 2017

Collaborator

@ooq Do we have any plots or example tests to show how this mitigates stragglers?

Collaborator

ericmjonas commented Aug 23, 2017

@ooq Do we have any plots or example tests to show how this mitigates stragglers?

Show outdated Hide outdated pywren/wait.py
@ooq

This comment has been minimized.

Show comment
Hide comment
@ooq

ooq Aug 23, 2017

Collaborator

@ericmjonas btw, I'll make some plot examples today.

Collaborator

ooq commented Aug 23, 2017

@ericmjonas btw, I'll make some plot examples today.

@ericmjonas

This comment has been minimized.

Show comment
Hide comment
@ericmjonas

ericmjonas Aug 23, 2017

Collaborator

@ooq That's awesome, is there a way we could pass the "num_samples" into the outer wait in a clean way so that we can have code going forward to show the differences? @shivaram does this seem sane or a proliferation of rarely-used options?

Collaborator

ericmjonas commented Aug 23, 2017

@ooq That's awesome, is there a way we could pass the "num_samples" into the outer wait in a clean way so that we can have code going forward to show the differences? @shivaram does this seem sane or a proliferation of rarely-used options?

@ericmjonas ericmjonas modified the milestone: v0.3 Sep 3, 2017

@ericmjonas

This comment has been minimized.

Show comment
Hide comment
@ericmjonas

ericmjonas Sep 4, 2017

Collaborator

Ok I spent a bit of time tightening the code up, making the invocation semantics more clear, and making parts of it a little bit more pythonic. I also renamed some variables so it is hopefully now a bit easier to follow along. @ooq @shivaram would love your opinions

Collaborator

ericmjonas commented Sep 4, 2017

Ok I spent a bit of time tightening the code up, making the invocation semantics more clear, and making parts of it a little bit more pythonic. I also renamed some variables so it is hopefully now a bit easier to follow along. @ooq @shivaram would love your opinions

while query_count < max_queries:
if len(done_call_ids) >= return_early_n:

This comment has been minimized.

@ooq

ooq Sep 4, 2017

Collaborator

I think with this, we assume that there are few stragglers? Let's say we have 100 stragglers (not showing up with list), then it will take 100/return_early_n * WAIT_DUR_SEC to finish. I personally have seen the case with a large number of stragglers yet, but just want to point out the possibility.

@ooq

ooq Sep 4, 2017

Collaborator

I think with this, we assume that there are few stragglers? Let's say we have 100 stragglers (not showing up with list), then it will take 100/return_early_n * WAIT_DUR_SEC to finish. I personally have seen the case with a large number of stragglers yet, but just want to point out the possibility.

This comment has been minimized.

@ericmjonas

ericmjonas Sep 4, 2017

Collaborator

Yeah I think in the future if we want and this becomes a performance pain point we can make return_early_n to be a user-facing parameter.

@ericmjonas

ericmjonas Sep 4, 2017

Collaborator

Yeah I think in the future if we want and this becomes a performance pain point we can make return_early_n to be a user-facing parameter.

if len(done_call_ids) >= return_early_n:
break
num_to_query_at_once = THREADPOOL_SIZE
fs_to_query = still_not_done_futures[query_count:query_count + num_to_query_at_once]

This comment has been minimized.

@ooq

ooq Sep 4, 2017

Collaborator

I think query_count + num_to_query_at_once could go out of bound?

@ooq

ooq Sep 4, 2017

Collaborator

I think query_count + num_to_query_at_once could go out of bound?

This comment has been minimized.

@ericmjonas

ericmjonas Sep 4, 2017

Collaborator

Python list slices always return as much as they can:

In [3]: x = list(range(10))

In [4]: x
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]: x[5:15]
Out[5]: [5, 6, 7, 8, 9]

so I think this should be ok?

@ericmjonas

ericmjonas Sep 4, 2017

Collaborator

Python list slices always return as much as they can:

In [3]: x = list(range(10))

In [4]: x
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]: x[5:15]
Out[5]: [5, 6, 7, 8, 9]

so I think this should be ok?

@ooq

This comment has been minimized.

Show comment
Hide comment
@ooq

ooq Sep 4, 2017

Collaborator

Thanks for taking it over, @ericmjonas . The code is clean to follow!
It looks good, and I think the corner case pointed out in my comment is not an immediate concern. (and you can tweak the early return parameter to deal with it)
I'm fine merging it after the out-of-bound concern is resolved.

Collaborator

ooq commented Sep 4, 2017

Thanks for taking it over, @ericmjonas . The code is clean to follow!
It looks good, and I think the corner case pointed out in my comment is not an immediate concern. (and you can tweak the early return parameter to deal with it)
I'm fine merging it after the out-of-bound concern is resolved.

@ericmjonas

REview from @ooq in PR

@ericmjonas ericmjonas merged commit 2a1e1b6 into master Sep 4, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ericmjonas ericmjonas deleted the issue#160 branch Sep 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment