Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why pulsar sink so much slower than sink kafka? #14886

Open
littlejoyo opened this issue Oct 19, 2022 · 5 comments
Open

why pulsar sink so much slower than sink kafka? #14886

littlejoyo opened this issue Oct 19, 2022 · 5 comments

Comments

@littlejoyo
Copy link

littlejoyo commented Oct 19, 2022

Hi,my vector configuration is as follows:

[sources.syslog]
type = "syslog"
address = "0.0.0.0:524"
mode = "tcp"
path = "/path/to/socket"

# Parse Syslog logs
[transforms.parse_syslogs]
type = "remap"
inputs = ["syslog"]
source = '''
. |= parse_syslog!(.message)
.timestamp = to_timestamp!(to_unix_timestamp!(.timestamp)+3600*8)
'''
# Print parsed logs to stdout
[sinks.printsys]
type = "pulsar"
inputs = ["parse_syslogs"]
endpoint = "pulsar://10.65.172.58:6650"
topic = "public/default/vector"
encoding.codec = "json"
buffer.type = "memory"
buffer.when_full = "block"
buffer.max_events = 5000000

for example:
A(syslog) -> B(vector) ->C(pulsar)

by vector top can get:
A to B :60k/s (source/transform)
B to C :2k/s (pulsar sink)

vector server is 4C 8G centos8,
and if i chance to kafka sink can get:
A to B :60k/s (source/transform)
B to C :60k/s (kafka sink)

for example:

[sources.syslog]
type = "syslog"
address = "0.0.0.0:524"
mode = "tcp"
path = "/path/to/socket"

# Parse Syslog logs
[transforms.parse_syslogs]
type = "remap"
inputs = ["syslog"]
source = '''
. |= parse_syslog!(.message)
.timestamp = to_timestamp!(to_unix_timestamp!(.timestamp)+3600*8)
'''
# Print parsed logs to stdout
[sinks.printsys]
type = "pulsar"
inputs = ["parse_syslogs"]
bootstrap_servers = "10.65.172.58:9092/kafka"
topic = "mykafka"
key_field = "user_id"
compression = "none"
encoding.codec = "json"
buffer.type = "memory"
buffer.when_full = "block"
buffer.max_events = 5000000

ps:buffer set 5000000 is more than client syslog events,so is not exist block

I want to know if there is a configuration problem for pulsar sink?
I don't think 2k/s is normal for pulsar sink,and why doesn't pulsar have batch configuration like kafka?
Thank you very much!Hope to receive your help soon!

@littlejoyo
Copy link
Author

littlejoyo commented Oct 20, 2022

my vector version is 0.24.1

@littlejoyo
Copy link
Author

Can someone help me this is a problem with vector or pulsar configuration?

@jszwedko
Copy link
Member

Hi @littlejoyo!

It is somewhat expected that different Vector sinks will have different performance characteristics as their implementations and the capabilities of the downstream systems can be different. For example, the pulsar sink only sends one event at a time where the kafka sink takes advantage of internal batching provided by librdkafka. I believe rdkafka also uses threads internally. I imagine that Pulsar and Kafka also differ in their semantics though I'm not terribly familiar with either.

If your request is that we improve the performance of the pulsar sink we can use this issue to track that.

@littlejoyo
Copy link
Author

Hi @jszwedko! Thank you very much for your reply!
Yes, I really want pulsar to perform similarly to kafka.
In addition, I would like to know if there are other ways to replace this built-in pulsar sink, such as whether I can use http sink+pulsar client sdk to directly push data to the pulsar server?
In fact, I have implemented pulsar based on http sink to store data, but I want to know if there will be any problems in this way? Like reliability?

@jszwedko
Copy link
Member

No problem! I'm not super familiar with Pulsar but if it has an HTTP intake then it could be possible to use the http sink. Another workaround would be to use the http sink to forward to a custom program that you write, using the pulsar client SDK, that receives the data and forwards it to Pulsar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants