feat(sink): implement pulsar sink #12286

Rossil2012 · 2023-09-13T15:02:19Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

WIP: implement pulsar sink

TODO:

Extract common properties of pulsar sink and source and the client builder into common.rs.
Append-only sink.
Upsert sink.
Careful error handling.
Support more config fields (e.g. retry interval/times).
Add necessary unit/integration tests.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

tabVersion

took a rough look, will refactor after #12321
merge after @xzhseh review

tabVersion · 2023-09-15T07:35:29Z

src/connector/src/sink/pulsar.rs

+    #[serde(
+        rename = "properties.retry.max",
+        default = "_default_max_retries",
+        deserialize_with = "deserialize_u32_from_string"
+    )]


may use serde_as instead, be like

risingwave/src/connector/src/common.rs

Line 157 in 59bb645

#[serde_as(as = "Option<DisplayFromStr>")]

src/connector/src/sink/pulsar.rs

tabVersion · 2023-09-15T07:39:09Z

src/connector/src/sink/pulsar.rs

+            )));
+        }
+
+        // TODO: validate pulsar connection


you can do as the source enumerator does here. but you only need to know whether the call is successful or not

risingwave/src/connector/src/source/pulsar/enumerator/client.rs

Lines 80 to 108 in 59bb645

async fn list_splits(&mut self) -> anyhow::Result<Vec<PulsarSplit>> {

let offset = self.start_offset.clone();

// MessageId is only used when recovering from a State

assert!(!matches!(offset, PulsarEnumeratorOffset::MessageId(_)));

let topic_partitions = self

.client

.lookup_partitioned_topic_number(&self.topic.to_string())

.await

.map_err(|e| anyhow!(e))?;

let splits = if topic_partitions > 0 {

// partitioned topic

(0..topic_partitions as i32)

.map(|p| PulsarSplit {

topic: self.topic.sub_topic(p).unwrap(),

start_offset: offset.clone(),

})

.collect_vec()

} else {

// non partitioned topic

vec![PulsarSplit {

topic: self.topic.clone(),

start_offset: offset.clone(),

}]

};

Ok(splits)

}

Producer in pulsar-rs has a check_connection method. However, calling it on a valid producer client which can send messages successfully, causes a connection error for no reason. Therefore I just build a producer client to validate the connection. The topic/url/token will be validated when building the producer.

src/connector/src/sink/pulsar.rs

xzhseh

Rest LGTM

xzhseh · 2023-09-16T01:36:34Z

src/connector/src/sink/pulsar.rs

+                    break;
+                }
+                // error upon sending
+                Err(e) => {


The logic in KafkaSinkWriter here is to retry just when the queue is full or the message time out.
But according to the error type in pulsar, we have the below types.

pub enum Error { Connection(ConnectionError), Consumer(ConsumerError), Producer(ProducerError), ServiceDiscovery(ServiceDiscoveryError), Authentication(AuthenticationError), Custom(String), Executor, }

In which cases (errors) we should retry maybe future considered.
Personally I'd prefer adding Producer & Consumer error checking.

xzhseh · 2023-09-16T01:40:22Z

src/connector/src/sink/pulsar.rs

+                // a SendFuture holding the message receipt 
+                // or error after sending is returned
+                Ok(send_future) => {
+                    // Check if send_future_buffer is greater than the preset limit


The limit is set to 65536 for kafka b/c rdkafka has the corresponding properties (number of messages that will be batched before sending).
Do we have the same parameter properties in pulsar crate? (I have not found one)
See the screenshot below or refer to the original PR for details.
cc @hzxa21.

batch_size and batch_byte_size can be specified upon building the producer.
See doc and example.

I suggest to create batch.num and batch.size in the config fields, with the same default value (10k and 1MB) as rdkafka. linger can not be specified in pulsar-rs.

One question: since maximum number and size of messages in a batch is handled by the kafka/pulsar client, why should we check the 65536 threshold of buffered DeliveryFuture/SendFuture?

One question: since maximum number and size of messages in a batch is handled by the kafka/pulsar client, why should we check the 65536 threshold of buffered DeliveryFuture/SendFuture?

The batched messages in rdkafka or pulsar are not essentially the same as the buffered futures in RW's connector.
In general, after a message is successfully sent to the client, we'll have a corresponding future buffered. The reason for this is to ensure the message has eventually been sent to the downstream (i.e., The Kafka Broker) by the rdkafka or pulsar. Otherwise the status of the messages can not be known without this.
If the messages are eventually delivered, there's nothing to do. If not, we will directly rollback to the latest checkpoint when any error is returned through the process.
Also, when the barrier is triggered by the sink coordinator, we'll group commit all the current buffered future to ensure every message is delivered. If not, also rollback to the latest checkpoint.
This is why we need an extra threshold for the buffered futures, in which case will ensure a efficient interval to check the status of all the sent but not yet delivered messages.

xzhseh · 2023-09-16T02:02:06Z

src/connector/src/sink/pulsar.rs

+        Ok(())
+    }
+
+    async fn barrier(&mut self, is_checkpoint: bool) -> Result<Self::CommitMetadata> {


In pulsar crate we have a send_batch function, to ensure all batched messages are sent.
Should we add a send_batch before commit_inner, in this case we do not need to await unnecessary futures, cc @tabVersion.

/// sends the current batch of messages #[cfg_attr(feature = "telemetry", tracing::instrument(skip_all))] pub async fn send_batch(&mut self) -> Result<(), Error> { match &mut self.inner { ProducerInner::Single(p) => p.send_batch().await, ProducerInner::Partitioned(p) => { try_join_all(p.producers.iter_mut().map(|p| p.send_batch())) .await .map(drop) } } }

In my understanding, send_batch ensures the messages in the current batch are sent, but does not ensure the receipt of the message held by the SendFuture (returned by producer.send). If messages are sent in batches, indeed we need to call send_batch first during commit. But we still need to await on all SendFutures to ensure the receipts.

This is to accelerate the later await speed, since we already ensure the receipt of the messages from this send_batch. So I think it's good to add one before the actual group await.

src/connector/src/sink/pulsar.rs

xzhseh

LGTM, Thanks for the effort!

Rossil2012 added 2 commits September 13, 2023 22:28

feat(sink): implement pulsar sink

dca5039

fix(sink): add license header

ae77506

Rossil2012 added the type/feature label Sep 14, 2023

Rossil2012 added 2 commits September 15, 2023 12:21

Merge branch 'risingwavelabs:main' into main

621ec6d

feat(sink): implement upsert pulsar sink

c438327

tabVersion marked this pull request as ready for review September 15, 2023 07:13

tabVersion approved these changes Sep 15, 2023

View reviewed changes

xzhseh reviewed Sep 16, 2023

View reviewed changes

Rossil2012 added 6 commits September 16, 2023 11:49

Merge branch 'risingwavelabs:main' into main

844172e

Merge branch 'risingwavelabs:main' into main

eddf3e2

style check by risedev

c90d67e

simplify deserialize_with with serde_as

7cbcb35

implement pulsar connection validation

48dbf64

retry on Producer & Consumer error

b63308b

Rossil2012 requested review from xzhseh and tabVersion September 18, 2023 03:52

implement batched message delivery

c04c57d

xzhseh approved these changes Sep 18, 2023

View reviewed changes

Rossil2012 added 3 commits September 19, 2023 10:24

Merge branch 'main' into main

241cb76

refactor with FormattedSink

e7eddad

Merge branch 'main' into pulsar-sink-dev

2cae389

Rossil2012 enabled auto-merge September 19, 2023 05:16

Rossil2012 added this pull request to the merge queue Sep 19, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2023

Rossil2012 added this pull request to the merge queue Sep 19, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2023

Rossil2012 added this pull request to the merge queue Sep 19, 2023

Merged via the queue into risingwavelabs:main with commit aa5e798 Sep 19, 2023
25 of 26 checks passed

Rossil2012 deleted the pulsar-sink-dev branch September 19, 2023 07:03

emile-00 mentioned this pull request Oct 4, 2023

Document Pulsar sink risingwavelabs/risingwave-docs#1351

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sink): implement pulsar sink #12286

feat(sink): implement pulsar sink #12286

Rossil2012 commented Sep 13, 2023 •

edited

Loading

tabVersion left a comment

tabVersion Sep 15, 2023

Rossil2012 Sep 18, 2023

tabVersion Sep 15, 2023

Rossil2012 Sep 18, 2023 •

edited

Loading

xzhseh left a comment

xzhseh Sep 16, 2023

Rossil2012 Sep 18, 2023

xzhseh Sep 16, 2023

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

xzhseh Sep 18, 2023

xzhseh Sep 16, 2023

Rossil2012 Sep 18, 2023 •

edited

Loading

xzhseh Sep 18, 2023

xzhseh left a comment

	async fn list_splits(&mut self) -> anyhow::Result<Vec<PulsarSplit>> {
	let offset = self.start_offset.clone();
	// MessageId is only used when recovering from a State
	assert!(!matches!(offset, PulsarEnumeratorOffset::MessageId(_)));

	let topic_partitions = self
	.client
	.lookup_partitioned_topic_number(&self.topic.to_string())
	.await
	.map_err(\|e\| anyhow!(e))?;

	let splits = if topic_partitions > 0 {
	// partitioned topic
	(0..topic_partitions as i32)
	.map(\|p\| PulsarSplit {
	topic: self.topic.sub_topic(p).unwrap(),
	start_offset: offset.clone(),
	})
	.collect_vec()
	} else {
	// non partitioned topic
	vec![PulsarSplit {
	topic: self.topic.clone(),
	start_offset: offset.clone(),
	}]
	};

	Ok(splits)
	}

feat(sink): implement pulsar sink #12286

feat(sink): implement pulsar sink #12286

Conversation

Rossil2012 commented Sep 13, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

tabVersion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rossil2012 Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

xzhseh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rossil2012 Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Rossil2012 Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Rossil2012 Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rossil2012 Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xzhseh left a comment

Choose a reason for hiding this comment

Rossil2012 commented Sep 13, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading

Rossil2012 Sep 18, 2023 •

edited

Loading