rptest: Refactor consumer validation and tune table idle settings #16250

savex · 2024-01-23T19:07:24Z

Since Table API consumer waits for the index indefinitely, update validation to

first wait for data file to be created
then wait for data files to have proper index value

This PR also fixes logs copy for docker envs by properly getting hostname

Fixes: redpanda-data/devprod#1031

Backports Required

Release Notes

none

vbotbuildovich · 2024-01-23T21:39:37Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44166#018d37f8-1d56-4f2d-ad12-67cb9060e267

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44227#018d3d0c-2afa-482d-835e-713d10aae674

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44312#018d42d5-d04c-4b23-83e0-5fcf04130c1a

bharathv · 2024-01-25T00:15:23Z

tests/rptest/e2e_tests/workloads/flink_table_transactions_basic.py

@@ -128,6 +128,12 @@ def setup(self):
        table_env = StreamTableEnvironment.create(
            stream_execution_environment=env, environment_settings=settings)

+        # Tune table idle state handling
+        # Clear the state if it has not changed


just curious: are these needed for this fix or these just nice to have

This is a safe guard to trigger cleanups/aborts for ongoing actions/transformations. Like we do not want flink to wait infinitely (default) for other party to answer if Table API is querying something in RP

tests/rptest/tests/flink_basic_test.py

tests/rptest/e2e_tests/workloads/flink_table_transactions_basic.py

bharathv · 2024-01-25T01:00:31Z

/ci-repeat 1
skip-units
dt-repeat=20
tests/rptest/tests/flink_basic_test.py

savex · 2024-01-25T14:42:37Z

/ci-repeat 1
skip-units
dt-repeat=10
tests/rptest/tests/flink_basic_test.py

Use internal jobmanager metrics to detect if job's subtasks (vertices) has been idle for 30 sec. Key metrics used is accumulated-idle-time and accumulated-busy-time

savex · 2024-01-25T22:10:17Z

EC2 check

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_id:    rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload
status:     PASS
run time:   1 minute 46.423 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
==============================================================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-01-25--025
run time:         48 minutes 55.396 seconds
tests run:        20
passed:           20
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
==============================================================================================================================================================================================================================================================================================================================

savex · 2024-01-25T22:11:05Z

/ci-repeat 1
skip-units
dt-repeat=20
tests/rptest/tests/flink_basic_test.py

vbotbuildovich · 2024-01-25T23:32:38Z

new failures in https://buildkite.com/redpanda/redpanda/builds/44317#018d42dc-79df-4324-8242-621b6f201002:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"
"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44312#018d4672-d00c-4fc6-84f7-dd4cd71c7e0c:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

savex · 2024-01-26T19:11:03Z

Caught the reason for above errors:

        Caused by: org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka flink_workload_topic-0@-1 with FlinkKafkaInternalProducer{transactionalId='flink_transaction_test_1-0-1', inTransaction=true, closed=false}
        at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.decorateException(KafkaWriter.java:477) ~[?:?]
        at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.onCompletion(KafkaWriter.java:451) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1418) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:273) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.ProducerBatch.abort(ProducerBatch.java:161) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:794) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.Sender.maybeAbortBatches(Sender.java:498) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:307) ~[?:?]
        at org.apache.flink.kafka.shaded.org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243) ~[?:?]

When flink parallelize INSERT operators to different jobs, it uses the same transactionId

savex · 2024-01-26T19:39:56Z

This is the result of this issue.

When flink parallelize jobs based on different INSERT operators it uses the same sink.transactional-id-prefix as set for single temporary table. This causes fencing on the RP side and as a result provides flackyness when run in slower environments (docker). This is fixed by create/delete a temporary table for each batch.

savex · 2024-01-29T14:06:58Z

/ci-repeat 1
skip-units
dt-repeat=20
tests/rptest/tests/flink_basic_test.py

copy operation sometimes not finishes copying and/or not copying file at all. Rely on a ssh_output in scope of basic test

savex · 2024-01-29T19:21:34Z

/ci-repeat 1
skip-units
dt-repeat=20
tests/rptest/tests/flink_basic_test.py

bharathv · 2024-01-29T22:00:19Z

tests/rptest/services/flink.py

        active = [
            job for job in jobs['jobs']
            if job['status'] in self.job_active_statuses
        ]
        return active

-    def _has_active_jobs(self):
+    def is_job_idle(self, jobid) -> bool:


this is way too complicated lol, cant imagine flink API is this horrible, we are probably missing something.

tests/rptest/e2e_tests/workloads/flink_table_transactions_basic.py

savex requested a review from bharathv January 23, 2024 19:07

savex marked this pull request as ready for review January 23, 2024 19:08

savex marked this pull request as draft January 24, 2024 00:12

savex force-pushed the dp-1031-fix-transaction-flackyness branch 5 times, most recently from 93cbf3d to 36adda4 Compare January 24, 2024 18:19

savex marked this pull request as ready for review January 24, 2024 18:19

savex force-pushed the dp-1031-fix-transaction-flackyness branch from 36adda4 to 7c37a4f Compare January 24, 2024 21:56

bharathv reviewed Jan 25, 2024

View reviewed changes

savex added 3 commits January 25, 2024 15:17

rptest: Refactor consumer validation and tune table idle settings

711a98e

rptest: Fix python class path and docker logs copy

cfe0b37

rptest: Improve flink.wait to detect idle jobs

bd25808

Use internal jobmanager metrics to detect if job's subtasks (vertices) has been idle for 30 sec. Key metrics used is accumulated-idle-time and accumulated-busy-time

savex force-pushed the dp-1031-fix-transaction-flackyness branch from 10a1656 to bd25808 Compare January 25, 2024 21:21

redpanda-data deleted a comment from CLAassistant Jan 25, 2024

savex requested a review from bharathv January 25, 2024 22:11

savex self-assigned this Jan 26, 2024

rptest: remove ducktape's copy_from as unreliable method

7506cef

copy operation sometimes not finishes copying and/or not copying file at all. Rely on a ssh_output in scope of basic test

savex force-pushed the dp-1031-fix-transaction-flackyness branch from abbd201 to 7506cef Compare January 29, 2024 19:19

bharathv approved these changes Jan 29, 2024

View reviewed changes

savex merged commit bc76650 into dev Jan 29, 2024
17 checks passed

savex deleted the dp-1031-fix-transaction-flackyness branch January 29, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rptest: Refactor consumer validation and tune table idle settings #16250

rptest: Refactor consumer validation and tune table idle settings #16250

savex commented Jan 23, 2024 •

edited

vbotbuildovich commented Jan 23, 2024 •

edited

bharathv Jan 25, 2024

savex Jan 25, 2024

bharathv commented Jan 25, 2024

savex commented Jan 25, 2024

savex commented Jan 25, 2024

savex commented Jan 25, 2024

vbotbuildovich commented Jan 25, 2024 •

edited

savex commented Jan 26, 2024 •

edited

savex commented Jan 26, 2024

savex commented Jan 29, 2024

savex commented Jan 29, 2024

bharathv Jan 29, 2024

rptest: Refactor consumer validation and tune table idle settings #16250

rptest: Refactor consumer validation and tune table idle settings #16250

Conversation

savex commented Jan 23, 2024 • edited

Backports Required

Release Notes

vbotbuildovich commented Jan 23, 2024 • edited

bharathv Jan 25, 2024

Choose a reason for hiding this comment

savex Jan 25, 2024

Choose a reason for hiding this comment

bharathv commented Jan 25, 2024

savex commented Jan 25, 2024

savex commented Jan 25, 2024

savex commented Jan 25, 2024

vbotbuildovich commented Jan 25, 2024 • edited

savex commented Jan 26, 2024 • edited

savex commented Jan 26, 2024

savex commented Jan 29, 2024

savex commented Jan 29, 2024

bharathv Jan 29, 2024

Choose a reason for hiding this comment

savex commented Jan 23, 2024 •

edited

vbotbuildovich commented Jan 23, 2024 •

edited

vbotbuildovich commented Jan 25, 2024 •

edited

savex commented Jan 26, 2024 •

edited