-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfilling progress times out most of the time #1999
Comments
Can you describe what server configuration you saw this with? I used |
I started the two instances on my local machine (i7 3rd gen, 12gb ram, ssd). They are both in debug mode, so maybe that's why? I'll try in release mode later. |
I'd like to move this to subsequent. I think this functionality is really important. It's been somewhat flaky before because both the numerator and the denominator change, and we'll have to make that work better for ops people, but I don't think we should just accept progress timing out. |
Also, the numerator and the denominator both change, which makes the progress feature unusable in real life scenarios. |
I don't fully follow on this one. Is the problem that the progress value is not monotonic? Or that the denominator changes? I know this is based on an actual user issue, but maybe you have more insights better knowing the context. The latter doesn't seem like an issue. If you are not interested in the exact blocks estimated, you can simply convert the value to a decimal number. The solution to the former would be to artificially force the fraction to never decrease, though that would be arguably worse. It's in the nature of predictions based on incomplete information that they change as more data becomes available. |
From what I've heard (I never read the code), we somehow recurse in the Btree, and give an estimate of the number of blocks left to copy by keeping an avarage branching factor. |
Yes. We don't always increase the denominator though, but start with an estimate for the total number of blocks that gets refined during backfilling. I just wonder which part of this is the actual problem. Making the block count precise isn't that easy (in the presence of resharding). It is essentially the same problem as keeping track of the number of documents in a certain range of the tree #152 . |
One problem is that if you compute a percentage, it may decrease. I think sending a percentage that never decrease is enough for most of the use cases. It's basically what the web interface was doing until I replace the progress bar with the number of replicas ready/total number of replicas. |
Actually another problem is that backfills that are being held back because there already are too many concurrent backfills going on don't report any estimates. I think their progress is just 0/0 or -1/-1 or something like that. That's technically fine, but makes it more difficult to get a good overall progress estimate (you could for example assume that those shards are approximately as big as the other shards, and use the average number of estimated blocks from the other shards as the number of blocks for the throttled ones). |
I think the current implementation has a general problem of usability + perception. There are a couple of solutions I can think of (though I don't know if some of these are possible or how much work they are):
|
How about we expose the actual number of blocks transferred, but not the estimate for the total number of blocks? We would have:
Optionally we can make sure that the percentage never decreases. |
That makes sense, but then couldn't I divide one by the other to get an estimate of the total number of blocks? :) (Also, I don't think making sure percentage doesn't decrease is optional, otherwise the feature isn't very useful) |
Hmm you are right. Though this endeavour would be partially hindered if we forced the percentage to be monotonic. |
😀 |
This is outdated. We now have |
The progress data I retrieve from the servers seem to timeout most of the time when servers are backfilling (
/ajax/progress
).It used to work better (at least from what I can remember)
Putting in backlog because the web ui just displays the number of blocks only if the number is available, and relies on the number of replicas ready to display progress.
The text was updated successfully, but these errors were encountered: