Issue#160 Straggler due to list-after-write consistency#170
Conversation
| callids_done.update(callids_found) | ||
|
|
||
| # break if not all N tasks completed | ||
| if (len(fs_found) < len(fs_samples)): |
There was a problem hiding this comment.
why should we break if not all N tasks completed ? (also where is fs_found defined)
There was a problem hiding this comment.
Here is a trade-off. Think this block runs when the tasks are far from finish. Then we are essentially scanning all tasks, and doing this many times before they actually complete. (yeah, that's a bug. I fixed it.)
There was a problem hiding this comment.
Sorry I am still missing something. So if we break out of it during say the first few blocks we sample then how do we get back into this loop for the last few tasks ?
|
@ooq Do we have any plots or example tests to show how this mitigates stragglers? |
|
|
||
| pool = ThreadPool(num_samples) | ||
| # repeat util all futures are done | ||
| while still_not_done_futures: |
There was a problem hiding this comment.
Is this actually the design contract of _wait? I thought it would perform one pass of checking if things are done and downloading those; it would not block waiting for all to finish (note the original _wait has no looping, that's handeled in wait())
There was a problem hiding this comment.
It's not block waiting for all to finish, with if (len(callids_found) < len(fs_samples)): break
There was a problem hiding this comment.
I see so it's still blocking, just not for all of them. Would it make sense to have num_samples be passed in (and the policy managed at the layer above) and just have it make a single best-effort to get this number of samples? That is, is there a reason to have this control logic at this layer but the all_completed logic at a layer higher?
There was a problem hiding this comment.
Yeah it'll be good to have a clean contract between the two functions.
There was a problem hiding this comment.
So say all 100 tasks finish but they only show up with list() 30 seconds after.
With this alternative solution, it would take min(100/N * wait_interval, 30) to find them.
There was a problem hiding this comment.
I actually think if we assume get_callset_status and get_call_status are instantaneous, then the contract is exactly same here, i.e., use whatever way to give one pass and returns.
|
@ericmjonas btw, I'll make some plot examples today. |
|
|
||
| while query_count < max_queries: | ||
|
|
||
| if len(done_call_ids) >= return_early_n: |
There was a problem hiding this comment.
I think with this, we assume that there are few stragglers? Let's say we have 100 stragglers (not showing up with list), then it will take 100/return_early_n * WAIT_DUR_SEC to finish. I personally have seen the case with a large number of stragglers yet, but just want to point out the possibility.
There was a problem hiding this comment.
Yeah I think in the future if we want and this becomes a performance pain point we can make return_early_n to be a user-facing parameter.
| if len(done_call_ids) >= return_early_n: | ||
| break | ||
| num_to_query_at_once = THREADPOOL_SIZE | ||
| fs_to_query = still_not_done_futures[query_count:query_count + num_to_query_at_once] |
There was a problem hiding this comment.
I think query_count + num_to_query_at_once could go out of bound?
There was a problem hiding this comment.
Python list slices always return as much as they can:
In [3]: x = list(range(10))
In [4]: x
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [5]: x[5:15]
Out[5]: [5, 6, 7, 8, 9]
so I think this should be ok?
|
Thanks for taking it over, @ericmjonas . The code is clean to follow! |
ericmjonas
left a comment
There was a problem hiding this comment.
REview from @ooq in PR
In wait(), we signal the completion of tasks by probing status files. Because S3 does not provide list-after-write consistency (which might also apply to other storage backend). List() can be only used as an optimization but not a timely way to signal completion. Thus, our strategy is to:
1) do list()
2) use get() to signal N tasks that do not show up in 1)
3) repeat 2) if all N tasks completed, otherwise stop
Note: a small N is probably preferred here. N is set to 4.
#160