Skip to content

Comments

storage: add more log context to continuous_batch_parser::consume_records#25395

Merged
WillemKauf merged 2 commits intoredpanda-data:devfrom
WillemKauf:parser_log_line
Mar 24, 2025
Merged

storage: add more log context to continuous_batch_parser::consume_records#25395
WillemKauf merged 2 commits intoredpanda-data:devfrom
WillemKauf:parser_log_line

Conversation

@WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Mar 15, 2025

The ERROR log line we see appearing in the continuous_batch_parser would be more helpful with added information about the batch consumer type and record batch header.

Add a new ERROR log line to output this additional context in the cold path.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 15, 2025

CI test results

test results on build#63192
test_id test_kind job_url test_status passed
cloud_storage_rpfixture.cloud_storage_rpfixture unit https://buildkite.com/redpanda/redpanda/builds/63192#0195988e-9613-41a8-8403-65043aac9ef1 FAIL 0/2
partition_balancer_simulator_test_rpunit.partition_balancer_simulator_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/63192#0195988e-9613-41a8-8403-65043aac9ef1 FLAKY 1/2
rptest.tests.consumer_group_test.ConsumerGroupTest.test_group_lag_metrics ducktape https://buildkite.com/redpanda/redpanda/builds/63192#019598d8-f30f-4e6a-8cd1-cce545aac06a FLAKY 1/2
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=True.params=.cancellation.dir.in.stage.preparing.use_alias.False ducktape https://buildkite.com/redpanda/redpanda/builds/63192#019598e6-84fa-4a1f-83b4-d37ad2020dcc FLAKY 1/2
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=False.with_tiered_storage=True.with_iceberg=False.with_chunked_compaction=True.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/63192#019598d8-f30f-4e6a-8cd1-cce545aac06a FLAKY 1/2
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=True.with_iceberg=True.with_chunked_compaction=False.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/63192#019598d8-f30e-4e20-92f2-f85c21a25b58 FLAKY 1/2
test results on build#63196
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/63196#01959a61-e39c-404f-ae52-9d0a90af1e60 FLAKY 1/2
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=True.params=.cancellation.dir.in.stage.preparing.use_alias.False ducktape https://buildkite.com/redpanda/redpanda/builds/63196#01959a47-cc7b-4856-856a-2a4767f588d1 FLAKY 1/2
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE ducktape https://buildkite.com/redpanda/redpanda/builds/63196#01959a47-cc7a-4990-9acc-ab6c805b2d07 FLAKY 1/2
rptest.tests.datalake.datalake_upgrade_test.DatalakeUpgradeTest.test_upload_through_upgrade.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK ducktape https://buildkite.com/redpanda/redpanda/builds/63196#01959a61-e39d-4c78-ac0b-b6626b887978 FAIL 0/20
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/63196#01959a61-e39b-411d-9726-dd7954415a9d FLAKY 1/2
test results on build#63357
test_id test_kind job_url test_status passed
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_higher_level_migration_api ducktape https://buildkite.com/redpanda/redpanda/builds/63357#0195af54-9ed4-4e54-a2f7-7220e16cf01f FLAKY 1/2
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=False.params=.cancellation.dir.in.stage.preparing.use_alias.False ducktape https://buildkite.com/redpanda/redpanda/builds/63357#0195af54-9ed4-4e54-a2f7-7220e16cf01f FLAKY 1/2
rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON ducktape https://buildkite.com/redpanda/redpanda/builds/63357#0195af54-9ed4-4e54-a2f7-7220e16cf01f FLAKY 1/2

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#63196

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/datalake_upgrade_test.py::DatalakeUpgradeTest.test_upload_through_upgrade@{"cloud_storage_type":1,"query_engine":"spark"}

Comment on lines 205 to +208
return verify_read_iobuf(
get_stream(), sz, "parser::consume_records", _recovery)
.then([this](result<iobuf> record) -> ss::future<result<stop_parser>> {
if (!record) {
return ss::make_ready_future<result<stop_parser>>(record.error());
}
_consumer->consume_records(std::move(record.value()));
return _consumer->consume_batch_end().then([](stop_parser sp) {
return ss::make_ready_future<result<stop_parser>>(sp);
});
});
.then(
[this](result<iobuf, parser_errc> record)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was being implicitly converted to std::error_code.

why does only this call site need to change? seems like there are other callers of verify_read_iobuf that didn't change, and then it's not clear why the conversion was a problem in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear why the conversion was a problem in the first place

I'm outputting the error type in the new log line added here, and to_string(parser_errc) is defined here.

why does only this call site need to change?

This is the only call site which has the return type written out explicitly (others use auto).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call site which has the return type written out explicitly (others use auto).

right, so in consume_header when auto b = verify... contains an error then we have

if (!b) {
  co_return b.error();
}

in a function that returns result<stop_parser> instead of result<stop_parser, parser_errc> but now it's getting a parser_errc instead std::error_code so presumably some new conversion is happening somewhere? to just say it was being implicitly converted doesn't seem complete enough to understand the impact or non impact of the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I am not correcting all of the sites of implicit conversion with this change- only at this level/call site where I want to log the error, other users still have the implicit conversion to std::error_code.

There's even this awkward conversion back to parser_errc in parser.cc...

I can go back through and correct all the call sites if you'd like to see that in this PR for completion's sake.

Copy link
Member

@dotnwat dotnwat Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not so much about "correcting" as it is about identifying the concern, and addressing it fully in the commit message and/or changing code. i only did the first part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended the commit message to indict why only this call site was corrected.

Copy link
Member

@dotnwat dotnwat Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillemKauf i still don't feel like i understand the implications of this. the conversion escapes through public methods like consume_header() via b.error(). std::error_code and other error codes have different boolean polarities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the concern. This was merged (accidentally) because I had auto-merge enabled and another reviewer approved.

If we don't fully grasp the implications of this change we can revert the commit, and add the log line without the error code being printed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to revert it, but we need explain how things are fine. because just saying that we were casting to std::error_code incorrectly feels like there is potential hidden semantic changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we have simply moved the location of the implicit cast of parser_errc to std::error_code from verify_read_iobuf() to the return statement in its three callers, namely

All of these functions return a result<> type, which is default typedef'ed to an std::error_code for the error_type (template argument S).

template<
class R,
class S = std::error_code,
class NoValuePolicy = outcome::policy::default_policy<R, S, void>>
using result = outcome::basic_result<R, S, NoValuePolicy>;

The boolean polarity of the error code return type here doesn't matter- we are using operator bool() from outcome::basic_result to check for errors.

There are no outward functional changes to public users of these functions after the return type change/implicit cast location shift.

This was being implicitly converted to `std::error_code`, and a future
commit will attempt to log the error using `to_string(parser_errc)`.

For that reason, the return type and a single call site where the log
line is to be added is amended with the proper type,
`result<iobuf, parser_errc>`, in order to properly log the error.
…records`

This `ERROR` log line would be more helpful with added information about the
batch consumer type and record batch header.

Add a new `ERROR` log line to output these in the cold path.
@WillemKauf
Copy link
Contributor Author

@dotnwat do you still feel there are any blockers for this PR?

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@WillemKauf WillemKauf merged commit 3f8a360 into redpanda-data:dev Mar 24, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants