Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

itaiad200 · 2024-06-14T15:20:41Z

Consider the following branch staging area:

foo/1 : deleted
foo/2 : deleted
foo/3 : deleted
...
foo/100 : deleted
foo/101: some value

A client, most likely our HadoopFS implementation, wants to check if the foo/ prefix is empty.
It will listObjects with amount=1, just to find anything under that prefix.

The amount will propagate all the way to the kvstore iterator that will use batchSize=amount+1, in this case 2. The iterator will continue to use this batching value during the next fetches, as the wrapper StagingIterator calls it to bring more values. The reason that the wrapping iterators calls for more fetches although it's only looking for a single object, is that tombstones (Graveler's delete markers) are being filtered (i.e. they mark that an object does not exist).

In the example above, the iterator will perform 51 list calls to the kvstore. Very ineffective, and potentially not finishing on time for a HTTP request.

The text was updated successfully, but these errors were encountered:

arielshaqed · 2024-06-14T15:36:07Z

Yikes! This is obviously bad. But please do NOT fix without a careful analysis.

Obviously fetching larger batches will give better latency. But it may still consume significantly more RCUs. Ref:

Scan consumes a minimum read capacity unit to perform one strongly consistent read per second, or two eventually consistent reads per second for an item up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. Empty tables and very large tables which have a sparse amount of partition keys might see some additional RCUs charged beyond the amount of data scanned. This covers the cost of serving the Scan request, even if no data exists

What makes it tricky is when we don't have many tombstones (uncommitted deletes). In a Spark workload there are a huge number of listObjects operations with amount=1. Now if we Scan to fetch, say, 50 elements, we will experience 50x amplification -- and run out of RCUs.

An interim solution might increase the batch size when the iterator encounters tombstones.

Another possible patch

Another patch might be to add a flag to listObjects that allows us to ask not to receive hasMore. That reduces the number of items which we need to find 2x. But actually for Spark / DBIO / lakeFSFS, this will be much better than 2x!

more importantly, '_' < 'a'. So in the Spark / DBIO case, when it looks for an object _stated_* or _committed_*, there can be many deletions immediately after this directory marker. So might have another gain here.

arielshaqed · 2024-06-14T15:36:54Z

Relevant: #2092, of course (but this is the quicker win).

itaiad200 · 2024-06-15T15:16:32Z

I think the underlying issue here is pushing application level logic to the kv store implementation. The application level logic (graveler) is "I need 2 non-tombstones entries", what the kv store gets is "give me 2 entries". I don't think graveler knows much about the kv specific implementation, and it shouldn't control it in the essence of how many entries to fetch with every call to the iterator. The batch_size is really a hint for the number of entries the kv store should fetch in the first request to the DB. If the caller requests more entries for the same scan, then it shouldn't trust the hint any longer and next listing from the DB should just use the default amount for the specific kv store implementation. I can't think of a flow where the caller of the store wants to control the requests to the DB for fetching just N entries each time.In some flows the hint is useful for the first DB request, but not for the rest of the requests, like in this specific bug.

itaiad200 · 2024-06-19T15:09:19Z

Reopenning since it wasn't fixed for CosmosDB

itaiad200 added bug Something isn't working area/KV Improvements to the KV store implementation area/client/hadoopfs P0 labels Jun 14, 2024

itaiad200 mentioned this issue Jun 17, 2024

Make batch_size in kv.Scan effective for first Scan of iterator #7875

Merged

itaiad200 closed this as completed in #7875 Jun 17, 2024

itaiad200 reopened this Jun 19, 2024

itaiad200 mentioned this issue Jun 19, 2024

CosmosDB iterator to use dynamic batch size after the first list #7892

Merged

itaiad200 closed this as completed in #7892 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

itaiad200 commented Jun 14, 2024

arielshaqed commented Jun 14, 2024

arielshaqed commented Jun 14, 2024

itaiad200 commented Jun 15, 2024

itaiad200 commented Jun 19, 2024

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

Bug: A branch with deleted objects is very ineffective with small 'amount' param #7864

Comments

itaiad200 commented Jun 14, 2024

arielshaqed commented Jun 14, 2024

Another possible patch

arielshaqed commented Jun 14, 2024

itaiad200 commented Jun 15, 2024

itaiad200 commented Jun 19, 2024