Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch processing to redis source to improve performance #12363

Open
roland-troeger opened this issue Apr 22, 2022 · 1 comment
Open

Add batch processing to redis source to improve performance #12363

roland-troeger opened this issue Apr 22, 2022 · 1 comment
Labels
domain: performance Anything related to Vector's performance source: redis Anything `redis` source related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@roland-troeger
Copy link

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Currently I am trying to implement vector as a "man in the middle" between our devices that send syslog data and our central syslog server. The goal is to protect the syslog server from larger spikes of log volume while still accepting all logs from the sources as fast as possible.
That's why we chose redis as a buffer for incoming logs. After the logs are buffered, the output to the syslog server is limited using the throttle transform.

This basically is our pipeline for log processing in vector:
[ syslog sending device ]-> source:syslog -> sink:redis -> [ redis ] -> source:redis -> transform:throttle -> sink:syslog -> [ central syslog server]

This is possible since 0.21.0, as we need the redis source. However, the performance of the redis source isn't as good as needed. We achieved about 12k logs per second on the redis source on our test machine, that could easily handle 10x as many logs on the redis sink (using the batch configuration options). Our goal is to reach comparable performance on source and sink.

Attempted Solutions

I was able to improve performance by defining multiple identical redis sources. This scales close to linear, but needs ~10 sources to achieve our needed performance. The configuration gets pretty messy using this workaround.

[sources.redis_source1]
type = "redis"
url = "redis://localhost:6379/0"
data_type = "list"
key = "my_key"
decoding.codec = "json"

[sources.redis_source2]
type = "redis"
url = "redis://localhost:6379/0"
data_type = "list"
key = "my_key"
decoding.codec = "json"

[sources.redis_source3]
[...]

[sinks.syslog]
inputs = ["redis_source1", "redis_source2", "redis_source3"]
[...]

Proposal

The problem appears to be the way the redis source takes the events out of the database - one event at a time using BLPOP/BRPOP.

Since Redis 6.2.0 the LPOP Command is able to pop multiple elements using the count parameter. This can be used to improve performance by reducing the number of database requests.

I already implemented a proof of concept using this method, but I guess it still needs a lot of work.

Since parallelization worked when configuring multiple sources, I think parallelizing the workload inside the redis source would also be a viable solution. I just don't know where to start to implement a PoC for this.

References

No response

Version

vector 0.21.0 (x86_64-unknown-linux-gnu c1edb89 2022-04-14)

@roland-troeger roland-troeger added the type: feature A value-adding code addition that introduce new functionality. label Apr 22, 2022
@jszwedko jszwedko added the source: redis Anything `redis` source related label Apr 22, 2022
@bruceg bruceg added type: enhancement A value-adding code change that enhances its existing functionality. domain: performance Anything related to Vector's performance and removed type: feature A value-adding code addition that introduce new functionality. labels Apr 25, 2022
@jszwedko
Copy link
Member

Thanks for this feature request @roland-troeger !

I agree, it does seem like we could have concurrency here for users that don't need event ordering. The aws_sqs source has a concurrency option that you could model this after if you are interested in contributing:

// number of concurrent tasks spawned for receiving/processing SQS messages
#[serde(default = "default_client_concurrency")]
#[derivative(Default(value = "default_client_concurrency()"))]
pub client_concurrency: u32,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance source: redis Anything `redis` source related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

3 participants