Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.2.x] Backport of #12544 #12596 #12644 #12586 #12737 #12368 #13480 #13784

Merged
merged 36 commits into from
Oct 3, 2023

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Sep 28, 2023

Backport of PR #12544
Backport of PR #12596
Backport of PR #12644
Backport of PR #12586
Backport of PR #12737
Backport of PR #12368
Backport of PR #13480

Switches the tree to operate on ss::shared_ptr<log>. This is a purely
mechnical preparation for the next commit which will remove the pimpl
wrapper around. Making this change first means that the two
modifications do not need to be in the same commit. The oddities that
arise (e.g. optional<shared_ptr>) will be cleaned-up in a subsequent
commit.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 5c9e845)

Conflicts:
	src/v/raft/recovery_stm.cc
	src/v/storage/log_manager.cc
	src/v/storage/tests/storage_e2e_test.cc
The get_impl wrapper is removed in the next commit as it is now
unnecessary.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 98da038)
@ztlpn
Copy link
Contributor Author

ztlpn commented Sep 28, 2023

@dotnwat added a couple of your PRs to make backports of #12737 and possible future storage PRs easier, please review.

Now that holders of an ss::shared_ptr alreayd have a polymorphic pointer
to the underlying implementation there is no need for an extra accessor.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 4674479)

Conflicts:
	src/v/storage/log_manager.cc
Now a nullptr can serve the same purpose.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit a6df0a1)
Not strictly necessary, but a nice safety net since the only users of
the log should be holding a shared pointer anyway.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit f722387)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 89f3ba3)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 9b8a051)

Conflicts:
	src/v/storage/log_manager.cc
Exposes the segment set through the log interface, and removes the
dynamic_cast to disk_log_impl. The segment is forward declared as its
header includes some bits that wreck compilation due to naming conflicts
in other parts of the tree. This will be addressed later.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 17f5354)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 5a18e56)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit a750cc9)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit a85ece4)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit f5fbe1c)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 8023f90)
Exposes the probe. The probe is forward declared because the probe.h
header brings a long some bits that wreck compilation in a few places
with naming clases, which will be fixed up later.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit d57a518)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 1dcbc8b)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 98c4d4d)
@ztlpn ztlpn force-pushed the v23.2.x-bp branch 2 times, most recently from 8f790f3 to 3d24ead Compare September 29, 2023 01:07
@ztlpn ztlpn changed the title [v23.2.x] Backport of #12544 #12596 #12737 #12368 #13480 [v23.2.x] Backport of #12544 #12596 #12644 #12737 #12368 #13480 Sep 29, 2023
@ztlpn ztlpn changed the title [v23.2.x] Backport of #12544 #12596 #12644 #12737 #12368 #13480 [v23.2.x] Backport of #12544 #12596 #12737 #12368 #13480 Sep 29, 2023
This can be useful in tests that want to truncate up to the max
collectible offset.

(cherry picked from commit 3f31502)
Some tests may want to control housekeeping manually. This makes this
easier by optionally ommitting the start() of the housekeeping loop.

(cherry picked from commit a4f4bc2)
Current tests use acks=1 because there's usually just one node. I have
an upcoming test that will use multiple nodes, and I'd like to wait on
acks.

(cherry picked from commit 827c511)
When instantiating multiple fixtures, new fixtures overwrite node
properties like the Kafka port. This makes it annoying to use Kafka
transports against multiple fixtures.

This patch makes this easier by caching the port and using it in the
make_kafka_client() call.

read_replica_test is updated to accommodate this change.

(cherry picked from commit 4d71121)
I have an upcoming test that will use multiple replicas.

(cherry picked from commit 3c03b82)
Similar to cluster_test_fixture, but brings in tiered storage.

(cherry picked from commit 6443a8b)
Currently we have cooperative_spin_wait_with_timeout that throws if a condition
isn't met after a given timeout. The way this typically shows up in tests is
the exception is caught by the test fiber, the test's destructor is called, and
the generic timeout exception is logged after everything is cleaned up. This
makes it difficult to track down issues because it's rarely clear what timed
out.

This patch adds a new boost_require_eventually macro to BOOST_FAIL on timeout,
indicating the log line that failed.

Tests that use this will timeout like:
```
INFO  2023-08-03 12:53:16,023 [shard 0] cluster - members_manager.cc:373 - applying finish_reallocations_cmd, offset: 10, node id: 1
/home/awong/Repos/redpanda/src/v/cloud_storage/tests/delete_records_e2e_test.cc(272): fatal error: in "test_delete_from_stm_consume": Timed out at /home/awong/Repos/redpanda/src/v/cloud_storage/tests/delete_records_e2e_test.cc:272
INFO  2023-08-03 12:53:16,638 [shard 0] kafka - server.cc:278 - kafka rpc protocol - Stopping 1 listeners
INFO  2023-08-03 12:53:16,638 [shard 0] kafka - server.cc:287 - kafka rpc protocol - Shutting down 2 connections
```

Instead of logs like below that don't indicate what timed out:
```
INFO  2023-08-03 12:56:48,354 [shard 0] cluster - members_manager.cc:373 - applying finish_reallocations_cmd, offset: 10, node id: 1
INFO  2023-08-03 12:56:48,950 [shard 0] kafka - server.cc:278 - kafka rpc protocol - Stopping 1 listeners
INFO  2023-08-03 12:56:48,950 [shard 0] kafka - server.cc:287 - kafka rpc protocol - Shutting down 2 connections
...
INFO  2023-08-03 12:56:49,003 [shard 0] kafka - server.cc:287 - kafka rpc protocol - Shutting down 0 connections
unknown location(0): fatal error: in "test_delete_from_stm_consume": seastar::timed_out_error: timedout
/home/awong/Repos/redpanda/src/v/cloud_storage/tests/delete_records_e2e_test.cc(270): last checkpoint
/home/awong/Repos/redpanda/src/v/cloud_storage/tests/delete_records_e2e_test.cc(256): Leaving test case "test_delete_from_stm_consume"; testing time: 4194410us
```

I don't love that boost permeates the interface, but until we move to
another test framework, it seems reasonable to tie the implementation
and the macro name to boost.

(cherry picked from commit 372c159)
@ztlpn ztlpn changed the title [v23.2.x] Backport of #12544 #12596 #12737 #12368 #13480 [v23.2.x] Backport of #12544 #12596 #12644 #12737 #12368 #13480 Sep 29, 2023
andrwng and others added 13 commits September 29, 2023 03:49
This will make it easier to figure out what is timing out, when it
happens.

(cherry picked from commit de08acd)
Seems to have been partially cleaned up in
redpanda-data#9121.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit cfd6325)
The additional point where data is updated seems to have been overlooked
in redpanda-data@26fb548#diff-708b13fb33ad235d5c420e38131e7188daeee60ff6a4e1054ed480a36142ccb2

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 3981467)
When reclaimable space is calculated the size above local retention is
cached so that it is available to the health report subsystem.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit dd8fbc8)
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit fb8657b)
Tests that the reclaimable local size from cloud storage topic is
reflected back through health report.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit ac6a307)
This happens in several places so factoring it out into a reusable
method.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 9f6ada4)

Conflicts:
	src/v/storage/log_manager.cc
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit fa9fb6d)
The reason is twofold:
1. We have another option for controlling the concurrency of the
   movement batch - partition_autobalancing_concurrent_moves which is
   easier to understand
2. Part of initial motivation is gone - in the past we waited for the
   whole batch to finish before scheduling the next movements so it was
   important for the batch to finish in a reasonable amount of time, but
   now we can schedule new movement while the old ones are still in
   progress.

(cherry picked from commit e9721e2)
With the introduction of space management data eviction becomes lazy:
partitions can grow beyond their eviction limit if there is sufficient
space to do it and only if disk space pressure becomes high, this
"extra" data is evicted.

For partition balancing this means two things: 1) partitions sizes can
vary widely between replicas and 2) when predicting how much space the
partition will take on a node if it is moved there, we must take into
account the possibility of reclaim (otherwise we won't allow moves that
are perfectly possible if reclaim is applied).

So this commit we 1) store current sizes in a per-replica map instead of
a single value and 2) use non-reclaimable size when predicting the
partition size on the target replica.

(cherry picked from commit 3480ba8)
@ztlpn ztlpn changed the title [v23.2.x] Backport of #12544 #12596 #12644 #12737 #12368 #13480 [v23.2.x] Backport of #12544 #12596 #12644 #12586 #12737 #12368 #13480 Sep 29, 2023
@ztlpn
Copy link
Contributor Author

ztlpn commented Sep 29, 2023

k8s issues are #13669

@ztlpn ztlpn merged commit 080ed02 into redpanda-data:v23.2.x Oct 3, 2023
@RafalKorepta RafalKorepta added this to the v23.2.12 milestone Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants