Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

Closed
itaiad200 opened this issue Jun 14, 2024 · 4 comments · Fixed by #7875 or #7892
Closed

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

itaiad200 opened this issue Jun 14, 2024 · 4 comments · Fixed by #7875 or #7892
Labels
area/client/hadoopfs area/KV Improvements to the KV store implementation bug Something isn't working P0

Comments

@itaiad200
Copy link
Contributor

Consider the following branch staging area:

foo/1 : deleted
foo/2 : deleted
foo/3 : deleted
...
foo/100 : deleted
foo/101: some value

A client, most likely our HadoopFS implementation, wants to check if the foo/ prefix is empty.
It will listObjects with amount=1, just to find anything under that prefix.

The amount will propagate all the way to the kvstore iterator that will use batchSize=amount+1, in this case 2. The iterator will continue to use this batching value during the next fetches, as the wrapper StagingIterator calls it to bring more values. The reason that the wrapping iterators calls for more fetches although it's only looking for a single object, is that tombstones (Graveler's delete markers) are being filtered (i.e. they mark that an object does not exist).

In the example above, the iterator will perform 51 list calls to the kvstore. Very ineffective, and potentially not finishing on time for a HTTP request.

@itaiad200 itaiad200 added bug Something isn't working area/KV Improvements to the KV store implementation area/client/hadoopfs P0 labels Jun 14, 2024
@arielshaqed
Copy link
Contributor

Yikes! This is obviously bad. But please do NOT fix without a careful analysis.

Obviously fetching larger batches will give better latency. But it may still consume significantly more RCUs. Ref:

Scan consumes a minimum read capacity unit to perform one strongly consistent read per second, or two eventually consistent reads per second for an item up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. Empty tables and very large tables which have a sparse amount of partition keys might see some additional RCUs charged beyond the amount of data scanned. This covers the cost of serving the Scan request, even if no data exists

What makes it tricky is when we don't have many tombstones (uncommitted deletes). In a Spark workload there are a huge number of listObjects operations with amount=1. Now if we Scan to fetch, say, 50 elements, we will experience 50x amplification -- and run out of RCUs.

An interim solution might increase the batch size when the iterator encounters tombstones.

Another possible patch

Another patch might be to add a flag to listObjects that allows us to ask not to receive hasMore. That reduces the number of items which we need to find 2x. But actually for Spark / DBIO / lakeFSFS, this will be much better than 2x!

more importantly, '_' < 'a'. So in the Spark / DBIO case, when it looks for an object _stated_* or _committed_*, there can be many deletions immediately after this directory marker. So might have another gain here.

@arielshaqed
Copy link
Contributor

Relevant: #2092, of course (but this is the quicker win).

@itaiad200
Copy link
Contributor Author

I think the underlying issue here is pushing application level logic to the kv store implementation. The application level logic (graveler) is "I need 2 non-tombstones entries", what the kv store gets is "give me 2 entries". I don't think graveler knows much about the kv specific implementation, and it shouldn't control it in the essence of how many entries to fetch with every call to the iterator. The batch_size is really a hint for the number of entries the kv store should fetch in the first request to the DB. If the caller requests more entries for the same scan, then it shouldn't trust the hint any longer and next listing from the DB should just use the default amount for the specific kv store implementation. I can't think of a flow where the caller of the store wants to control the requests to the DB for fetching just N entries each time.In some flows the hint is useful for the first DB request, but not for the rest of the requests, like in this specific bug.

@itaiad200
Copy link
Contributor Author

Reopenning since it wasn't fixed for CosmosDB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client/hadoopfs area/KV Improvements to the KV store implementation bug Something isn't working P0
Projects
None yet
2 participants