Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.1.x] Add cloud_storage statistics to usage reporting #9670

Merged

Conversation

vbotbuildovich
Copy link
Collaborator

@vbotbuildovich vbotbuildovich commented Mar 28, 2023

Cover Letter

Backport of PR: #9574

Release Notes

Improvements

  • Adds cloud storage metrics to usage reporting

@vbotbuildovich vbotbuildovich added this to the v23.1.x-next milestone Mar 28, 2023
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Mar 28, 2023
@graphcareful graphcareful marked this pull request as ready for review March 28, 2023 01:09
@graphcareful graphcareful self-requested a review March 28, 2023 15:14
@graphcareful graphcareful self-assigned this Mar 28, 2023
@graphcareful graphcareful removed their request for review March 28, 2023 15:17
@graphcareful
Copy link
Contributor

Looks like there is a dependency on this PR #9121

- New method for manually invoking a refresh of the cloud stats. Will
only work if invoked on controller leader.

(cherry picked from commit 1e4ad97)
- close window currently contains an async job that in all cases will
run in a very short amount of time.

- the concern is adding more asynchronous work before in the
close_window method could hold up the opening of the subsequent window.

- Solution is to make close_window synchronous, but start a background
fiber that runs the asynchronous work and later places the results at
the correct historical bucket.

(cherry picked from commit cebeeb9)
- All requests made to the usage endpoint in the usage_test will check
response entries are correctly ordered

(cherry picked from commit 892e1b2)
- Extend the duration of the test to cover scenarios where the max
number of windows is reached and the oldest historical windows are
overwritten.

- Fixes: redpanda-data#9524

(cherry picked from commit a47866c)
@graphcareful
Copy link
Contributor

For reviewers, the PR initially failed to compile. This is because in the first commit of this PR i removed a static member var called current_version. In dev there are no references to this var in cluster_health_types.cc because of PR #9121. However those changes do not exist in v23.1.x.

To prevent having to depend on #9121 , i've manually edited the first commit in this PR to re-include the current_version static member variable in the cluster_health_report struct

@graphcareful
Copy link
Contributor

Apparently this also depends on #9204

- Its impossible to know how much data should be expected over 9092 at
test start.

- Further assertions in the test are sufficent to guarantee that the
feature is working as expected anyway.

(cherry picked from commit 6d27e3d)
- In scenarios where options are triggered on and off quickly it is
possible that inconsistent data may appear on disk.

- There is a race between shutting down of a
usage_manager::accounting_fiber and the start of a new one.

- Fixes: redpanda-data#9652
@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py

@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py::UsageTest.test_usage_settings_changed

@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py

@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py::UsageTest

@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test

@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py::UsageTest

- Test asserts that 0 bytes are accounted for within metering after
starting up a new node, but that is not necessiarily true. A small
amount of non-client traffic may be observed, and the disk persistance
interval has been set to 1s so post-restart it is expected that some
data may have been accounted for.

- Fixes: redpanda-data#9568
@graphcareful
Copy link
Contributor

/ci-repeat 2
skip-units
dt-repeat=50
tests/rptest/tests/usage_test.py::UsageTest

@graphcareful
Copy link
Contributor

Leaving a helpful note to explain what occured in the manual rebase here:

  1. Initial cherry-pick failed due to a conflict with rpc/raft/cluster: start removing ADL encoding for RPCs #9121
  • 9121 is a deprecation of ADL, the conflict was with removing an unused member variable, the conflict was manually updated
  1. Next conflict was in test code with tests: improvements to SI utils #9204 , when this PR was written there was a BucketView class in the DT test framework.
  • Bringing in 9204 would necessitate other dependencies as well.
  • Solution is to just manually call the admin API to find out the information that the BucketView had within it, namely the amount of bytes in cloud storage

@graphcareful graphcareful merged commit 2cda706 into redpanda-data:v23.1.x Mar 30, 2023
@vshtokman vshtokman modified the milestones: v23.1.x-next, v23.1.4 Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants