Option to use message index instead of record sequence for partitioning #470

nicoloboschi · 2022-10-17T09:43:44Z

Motivation

Currently all the partitioners use the recordSequence field in order to partition messages. That field doesn't handle batch messages and, consequently, objects could be loss on the target storage. The record sequence has the same value for all the entries in a batch message. Only the last entry of the batch will be persisted on the storage.

Modifications

New option partitionerUseIndexAsOffset (default to false for compatibility) to use the message's index for partitioning.
The index is exposed only if the brokers expose the metadata (see https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-broker-entry-metadata). In case the index is not available on a message, the record sequence will be used.

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (docs)

github-actions · 2022-10-17T09:44:01Z

@nicoloboschi:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

alpreu · 2022-10-17T14:31:57Z

The record sequence has the same value for all the entries in a batch message. Only the last entry of the batch will be persisted on the storage.

Can you explain this scenario a bit more, why would only the last message be persisted? If all messages in a batch share the same recordSequence, I would expect them to end up in the same output file

nicoloboschi · 2022-10-17T20:25:55Z

The record sequence has the same value for all the entries in a batch message. Only the last entry of the batch will be persisted on the storage.

Can you explain this scenario a bit more, why would only the last message be persisted? If all messages in a batch share the same recordSequence, I would expect them to end up in the same output file

Sure. Let's take this scenario. We set batchSize=2.
We receive 3 records that are in the same BK entry, so they have the same recordSequence.
1 and 2 are in the same flush cycle. Record 1 is taken to get the output filename (-> recordSequence). S3 objects created.

When the third record is flushed, the resulting output filename is the same because recordSequence is the same of the first one. So the previous object is overridden in S3.

freeznet

Thanks @nicoloboschi
The use of a message index for the partitioner is okay, it gives the user another option to use. So overall the feature LGTM. But this PR is not solving the root cause here, which is the cloud storage connector do not handle file overwritten case. When a target file exists, the cloud storage connector should either load the file content, append the new content and overwrite it, or use another file name to save the file.

src/main/java/org/apache/pulsar/io/jcloud/partitioner/AbstractPartitioner.java

nicoloboschi · 2022-10-19T07:06:42Z

But this PR is not solving the root cause here, which is the cloud storage connector do not handle file overwritten case.

It's true. Moreover the index approach is not enabled by default. But the solution is not that easy.

When a target file exists, the cloud storage connector should either load the file content, append the new content and overwrite it

This will violate the batchSize parameter

use another file name to save the file.

That would make sense. A time-based solution could be implemented as a new partitioner type.

However both the time-based approach and the index one will not guarantee absence of duplicates in case of ProcessingGuarantees = ATLEAST_ONCE.
The only way to avoid duplicates is to set batchSize = 1 (but it's not a good idea in terms of efficiency)

I think it's fine to merge this pull anyway.

alpreu · 2022-10-25T10:56:17Z

src/main/java/org/apache/pulsar/io/jcloud/partitioner/AbstractPartitioner.java

+                } else {
+                    LOGGER.debug("Found message {} with hasIndex=true but index is empty, using recordSequence",
+                            message.getMessageId());
+                }
+            } else {
+                LOGGER.debug("partitionerUseIndexAsOffset configured to true but no index found on the message {}, "
+                                + "perhaps the broker didn't exposed the metadata, using recordSequence",
+                        message.getMessageId());
+            }


I don't think it's a good idea to implicitly fall back to the recordSequence if the user has configured the partitionerUseIndexAsOffset property. At the very least this should be a WARN log, or we should throw an exception here imo.

I understand your concern.
The problem is that who installs the sink may not know if the broker will put the index in the record metadata.

In my first implementation this was a WARN level: 15659aa#diff-ed19908cc0ff954ebf1f795a0e3fd708a3b1be501cda66c22ea1906d771da90cR91

This is the same behavior (except for the special batch messages handling) present in Kafka connect's connectors in Apache Pulsar: https://github.com/apache/pulsar/blob/4c22159f5a972e7a92382b9a90c6c70c43c5d166/pulsar-io/kafka-connect-adaptor/src/main/java/org/apache/pulsar/io/kafka/connect/KafkaConnectSink.java#L312-L359

To improve the usability we could:

Explicitly state the behavior in the parameter doc

Set those logs to WARN

Another (more complex) option would be to add another flag "noIndexAction" that regulates if throws error or accept records without index when partitionerUseIndexAsOffset is true.
Although, I would avoid to add another boilerplate option

Yeah let's not add another flag :) I think moving the log to WARN and extending the parameter description is a good tradeoff

alpreu · 2022-10-26T14:12:24Z

Could you resolve the doc conflicts and add the new parameter to the azure provider config properties too? Then it's good to merge IMO :)

alpreu

Thank you! LGTM

…titioning (#470)

…titioning (streamnative#470) (cherry picked from commit a7672fb)

…titioning (#470)

…titioning (#470) (cherry picked from commit a7672fb)

nicoloboschi requested review from freeznet and a team as code owners October 17, 2022 09:43

github-actions bot assigned nicoloboschi Oct 17, 2022

github-actions bot added the doc-info-missing This pr needs to mark a document option in description label Oct 17, 2022

freeznet reviewed Oct 19, 2022

View reviewed changes

src/main/java/org/apache/pulsar/io/jcloud/partitioner/AbstractPartitioner.java Show resolved Hide resolved

src/main/java/org/apache/pulsar/io/jcloud/partitioner/AbstractPartitioner.java Show resolved Hide resolved

alpreu suggested changes Oct 25, 2022

View reviewed changes

nicoloboschi added 6 commits October 28, 2022 10:04

Option to use message index instead of record sequence for partitioning

315fdf6

improve msg

c362f5c

style

422d61e

doc

2fe9a6d

move logs to debug

3bc3bf4

better doc and set log to WARN

afcf914

nicoloboschi force-pushed the index-partitioner-type branch from 2db6fa4 to afcf914 Compare October 28, 2022 08:08

alpreu approved these changes Nov 1, 2022

View reviewed changes

alpreu merged commit a7672fb into streamnative:master Nov 1, 2022

alpreu pushed a commit that referenced this pull request Nov 9, 2022

[feat] Option to use message index instead of record sequence for par…

aa16afc

…titioning (#470)

alpreu pushed a commit to alpreu/pulsar-io-cloud-storage that referenced this pull request Mar 8, 2023

[feat] Option to use message index instead of record sequence for par…

ed61bf7

…titioning (streamnative#470) (cherry picked from commit a7672fb)

alpreu pushed a commit that referenced this pull request Mar 9, 2023

[feat] Option to use message index instead of record sequence for par…

53f004c

…titioning (#470)

alpreu pushed a commit that referenced this pull request Mar 9, 2023

[feat] Option to use message index instead of record sequence for par…

ec9643c

…titioning (#470) (cherry picked from commit a7672fb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use message index instead of record sequence for partitioning #470

Option to use message index instead of record sequence for partitioning #470

nicoloboschi commented Oct 17, 2022 •

edited

github-actions bot commented Oct 17, 2022

alpreu commented Oct 17, 2022

nicoloboschi commented Oct 17, 2022 •

edited

freeznet left a comment

nicoloboschi commented Oct 19, 2022

alpreu Oct 25, 2022

nicoloboschi Oct 26, 2022

alpreu Oct 26, 2022

alpreu commented Oct 26, 2022

alpreu left a comment

Option to use message index instead of record sequence for partitioning #470

Option to use message index instead of record sequence for partitioning #470

Conversation

nicoloboschi commented Oct 17, 2022 • edited

Motivation

Modifications

Documentation

github-actions bot commented Oct 17, 2022

alpreu commented Oct 17, 2022

nicoloboschi commented Oct 17, 2022 • edited

freeznet left a comment

Choose a reason for hiding this comment

nicoloboschi commented Oct 19, 2022

alpreu Oct 25, 2022

Choose a reason for hiding this comment

nicoloboschi Oct 26, 2022

Choose a reason for hiding this comment

alpreu Oct 26, 2022

Choose a reason for hiding this comment

alpreu commented Oct 26, 2022

alpreu left a comment

Choose a reason for hiding this comment

nicoloboschi commented Oct 17, 2022 •

edited

nicoloboschi commented Oct 17, 2022 •

edited