readers: evictable_reader: skip progress guarantee when next pos is partition start #13563

denesb · 2023-04-18T11:58:19Z

The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader.
To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition.

Fixes: #13491

…artition start The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader. To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition. Fixes: scylladb#13491

scylladb-promoter · 2023-04-18T15:00:23Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/695/

denesb · 2023-04-18T15:03:31Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/695/

#13553

scylladb-promoter · 2023-04-18T17:24:39Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/705/

denesb · 2023-04-19T05:27:25Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/705/

20:30:21  hudson.remoting.ProxyException: groovy.lang.MissingPropertyException: No such property: results for class: WorkflowScript
20:30:21  	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:66)
20:30:21  	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.getProperty(ScriptBytecodeAdapter.java:471)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker$7.call(Checker.java:377)
20:30:21  	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onGetProperty(GroovyInterceptor.java:68)
20:30:21  	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:347)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker$7.call(Checker.java:375)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:379)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:355)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:355)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:355)
20:30:21  	at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:355)
20:30:21  	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.getProperty(SandboxInvoker.java:29)
20:30:21  	at com.cloudbees.groovy.cps.impl.PropertyAccessBlock.rawGet(PropertyAccessBlock.java:20)
20:30:21  	at WorkflowScript.run(WorkflowScript:198)
20:30:21  	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2125)
20:30:21  	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2110)
20:30:21  	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2151)
20:30:21  	at WorkflowScript.run(WorkflowScript:194)
20:30:21  	at ___cps.transform___(Native Method)
20:30:21  	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:73)
20:30:21  	at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
20:30:21  	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:65)
20:30:21  	at jdk.internal.reflect.GeneratedMethodAccessor571.invoke(Unknown Source)
20:30:21  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
20:30:21  	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
20:30:21  	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
20:30:21  	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
20:30:21  	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
20:30:21  	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
20:30:21  	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
20:30:21  	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
20:30:21  	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
20:30:21  	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:420)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:330)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:294)
20:30:21  	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
20:30:21  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
20:30:21  	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
20:30:21  	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
20:30:21  	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
20:30:21  	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
20:30:21  	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
20:30:21  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
20:30:21  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
20:30:21  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
20:30:21  	at java.base/java.lang.Thread.run(Thread.java:829)
20:30:21  Finished: FAILURE

@benipeled

scylladb-promoter · 2023-04-19T08:44:37Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/714/

denesb · 2023-04-19T11:02:13Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/714/

#13553

scylladb-promoter · 2023-04-20T15:26:18Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/762/

bhalevy · 2023-04-25T04:46:50Z

readers/multishard.cc

-        while (next_mf && _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0) {
+        // This loop becomes inifinite when next pos is a partition start.
+        // In that case progress is guranteed anyway, so skip this loop entirely.
+        while (!_next_position_in_partition.is_partition_start() && next_mf && _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0) {


By the way, we access buffer().back().position() below, right after this loop.
How do we know that the buffer isn't empty at that point?

shouldn't that be done under while (next_mf)?

By the way, we access buffer().back().position() below, right after this loop. How do we know that the buffer isn't empty at that point?

See !is_buffer_empty() in the enclosing if.

sorry, I got confused and thought we're consuming from the buffer, but we're actually filling it.

avikivity · 2023-04-27T20:12:58Z

readers/multishard.cc

-        while (next_mf && _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0) {
+        // This loop becomes inifinite when next pos is a partition start.
+        // In that case progress is guranteed anyway, so skip this loop entirely.
+        while (!_next_position_in_partition.is_partition_start() && next_mf && _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0) {


I don't understand the comment. Why is it guaranteed we make progress on partition_start?

Making progress is only a concern due to range tombstone changes, which can have non-monotonically increasing positions. The evictable reader needs to ensure that each buffer fill, ends with a position, strictly larger than that of the previous buffer fill. When the next expected position is a partition start, this is guaranteed and need not be checked (partition start means a new partition is started, so we make partition-level progress).

I see. Though I don't see why the current code is broken. You read the partition start and emit it, Then you read the partition end, which should have its position_in_partition greater than the partition start, which should stop the loop.

It's absolutely broken to compare position_in_partition without noticing that we changed partitions, as position_in_partition can't be compared across partitions. So I agree with the fix, just wondering why partition_end didn't save us.

Reading an entire partition into memory is an OOM sentence if the partition is large.

But I don't see it. push_mutation_fragment() updates buffer().back().

But doesn't buffer.back() change during the loop? Ah, I guess it doesn't.

It does. And we keep changing it (by pushing new fragments) until the condition becomes false and we exit the loop. The problem is that if _next_position_in_partition is partition_start, now matter what we push to the buffer, _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0 will always hold.

Ok - I get it now. I thought that we check that the next unconsumed position is after the last consumed position, but we check against some position that isn't advanced. Thanks for bearing with me.

I don't get it. The code is while (_next_position_in_partition <= back) advance_back...

But if this condition is supposed to check for progress against _next_position_in_partition, shouldn't it be while (back <= (<?) _next_position_in_partition) advance_back...?

Yup. If you modify the test below to call fill_buffer() a second time, this time it will read till the end of partition instead of reading another small batch of fragments - because the comparison is done in the wrong direction.

avikivity · 2023-05-02T11:01:17Z

test/boost/mutation_reader_test.cc

+
+    rd.fill_buffer().get();
+    auto buf1 = rd.detach_buffer();
+    BOOST_REQUIRE_EQUAL(buf1.size(), 3);


Does this test fail before the patch with an infinite loop (or by consuming all the reader)? If not, it's just testing some detail of the implementation, not the bug.

Yes, the test fails before the patch, by finding that the reader read more than expected into the buffer. Unfortunately, there is no way to write a test for this, without involving specific details about how readers work.

More than expected != the failure condition.

If before the fix would stop at 4 or 2 fragments, then the test doesn't reproduce the infinite loop.

If it's really unbounded, then you can have a reader with 10 fragments and assert that not all fragments were consumed, rather than some specific number was consumed (but another number would have been just as well, as long as it's not "consume the entire stream while some data-dependent condition holds".

Before the fix, the reader would read all 1003 fragments of test data. After the fix, it stops after 3 fragments, which is where it should stop, according to the precalculated max buffer size.

Ok. I suggest to change the condition to < 10. Instead of enshrining some detail, let's check the actual failure (we can't check that the number of fragments is infinite, but 10 is a close approximation).

The reader not reading more than what its max-buffer-size is is actually part of the reader contract. But I can increase this number so that it allows the reader deciding to read a few more fragments than expected.

I'll just queue it.

…artition start The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader. To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition. Fixes: #13491 Closes #13563 (cherry picked from commit 72003dc)

…ition The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in scylladb#13491. There was already a fix in this area to handle `partition_start` fragments correctly - scylladb#13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in scylladb#13563) to detect this case. Fixes scylladb#13491

…re partition' from Kamil Braun The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s. Fixes #13491 Closes #14375 * github.com:scylladb/scylladb: readers: evictable_reader: don't accidentally consume the entire partition test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position

…re partition' from Kamil Braun The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s. Fixes #13491 Closes #14375 * github.com:scylladb/scylladb: readers: evictable_reader: don't accidentally consume the entire partition test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position (cherry picked from commit 586102b)

denesb mentioned this pull request Apr 18, 2023

sstableloader/nodetool refresh: bad_alloc (seastar - Failed to allocate 536870912 bytes) #13491

Closed

bhalevy reviewed Apr 25, 2023

View reviewed changes

bhalevy approved these changes Apr 25, 2023

View reviewed changes

avikivity reviewed Apr 27, 2023

View reviewed changes

avikivity reviewed May 2, 2023

View reviewed changes

scylladb-promoter closed this in 72003dc May 2, 2023

kbr-scylla mentioned this pull request Jun 23, 2023

readers: evictable_reader: don't accidentally consume the entire partition #14375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readers: evictable_reader: skip progress guarantee when next pos is partition start #13563

readers: evictable_reader: skip progress guarantee when next pos is partition start #13563

denesb commented Apr 18, 2023

scylladb-promoter commented Apr 18, 2023

denesb commented Apr 18, 2023

scylladb-promoter commented Apr 18, 2023

denesb commented Apr 19, 2023

scylladb-promoter commented Apr 19, 2023

denesb commented Apr 19, 2023

scylladb-promoter commented Apr 20, 2023

bhalevy Apr 25, 2023 •

edited

bhalevy Apr 25, 2023

denesb Apr 25, 2023

bhalevy Apr 25, 2023

avikivity Apr 27, 2023

denesb May 2, 2023

avikivity May 2, 2023

avikivity May 2, 2023

denesb May 2, 2023

avikivity May 2, 2023

denesb May 2, 2023

avikivity May 2, 2023

michoecho Jun 23, 2023 •

edited

kbr-scylla Jun 23, 2023

avikivity May 2, 2023

denesb May 2, 2023

avikivity May 2, 2023

denesb May 2, 2023

avikivity May 2, 2023

denesb May 2, 2023

avikivity May 2, 2023

readers: evictable_reader: skip progress guarantee when next pos is partition start #13563

readers: evictable_reader: skip progress guarantee when next pos is partition start #13563

Conversation

denesb commented Apr 18, 2023

scylladb-promoter commented Apr 18, 2023

denesb commented Apr 18, 2023

scylladb-promoter commented Apr 18, 2023

denesb commented Apr 19, 2023

scylladb-promoter commented Apr 19, 2023

denesb commented Apr 19, 2023

scylladb-promoter commented Apr 20, 2023

bhalevy Apr 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michoecho Jun 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalevy Apr 25, 2023 •

edited

michoecho Jun 23, 2023 •

edited