Add batch processing to redis source to improve performance #12363

roland-troeger · 2022-04-22T08:54:23Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Currently I am trying to implement vector as a "man in the middle" between our devices that send syslog data and our central syslog server. The goal is to protect the syslog server from larger spikes of log volume while still accepting all logs from the sources as fast as possible.
That's why we chose redis as a buffer for incoming logs. After the logs are buffered, the output to the syslog server is limited using the throttle transform.

This basically is our pipeline for log processing in vector:
[ syslog sending device ]-> source:syslog -> sink:redis -> [ redis ] -> source:redis -> transform:throttle -> sink:syslog -> [ central syslog server]

This is possible since 0.21.0, as we need the redis source. However, the performance of the redis source isn't as good as needed. We achieved about 12k logs per second on the redis source on our test machine, that could easily handle 10x as many logs on the redis sink (using the batch configuration options). Our goal is to reach comparable performance on source and sink.

Attempted Solutions

I was able to improve performance by defining multiple identical redis sources. This scales close to linear, but needs ~10 sources to achieve our needed performance. The configuration gets pretty messy using this workaround.

[sources.redis_source1]
type = "redis"
url = "redis://localhost:6379/0"
data_type = "list"
key = "my_key"
decoding.codec = "json"

[sources.redis_source2]
type = "redis"
url = "redis://localhost:6379/0"
data_type = "list"
key = "my_key"
decoding.codec = "json"

[sources.redis_source3]
[...]

[sinks.syslog]
inputs = ["redis_source1", "redis_source2", "redis_source3"]
[...]

Proposal

The problem appears to be the way the redis source takes the events out of the database - one event at a time using BLPOP/BRPOP.

Since Redis 6.2.0 the LPOP Command is able to pop multiple elements using the count parameter. This can be used to improve performance by reducing the number of database requests.

I already implemented a proof of concept using this method, but I guess it still needs a lot of work.

Since parallelization worked when configuring multiple sources, I think parallelizing the workload inside the redis source would also be a viable solution. I just don't know where to start to implement a PoC for this.

References

No response

Version

vector 0.21.0 (x86_64-unknown-linux-gnu c1edb89 2022-04-14)

The text was updated successfully, but these errors were encountered:

jszwedko · 2022-04-25T20:24:07Z

Thanks for this feature request @roland-troeger !

I agree, it does seem like we could have concurrency here for users that don't need event ordering. The aws_sqs source has a concurrency option that you could model this after if you are interested in contributing:

vector/src/sources/aws_sqs/config.rs

Lines 40 to 43 in 59cfccc

    
           // number of concurrent tasks spawned for receiving/processing SQS messages 
        
           #[serde(default = "default_client_concurrency")] 
        
           #[derivative(Default(value = "default_client_concurrency()"))] 
        
           pub client_concurrency: u32,

roland-troeger added the type: feature A value-adding code addition that introduce new functionality. label Apr 22, 2022

jszwedko added the source: redis Anything `redis` source related label Apr 22, 2022

bruceg added type: enhancement A value-adding code change that enhances its existing functionality. domain: performance Anything related to Vector's performance and removed type: feature A value-adding code addition that introduce new functionality. labels Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch processing to redis source to improve performance #12363

Add batch processing to redis source to improve performance #12363

roland-troeger commented Apr 22, 2022

jszwedko commented Apr 25, 2022

Add batch processing to redis source to improve performance #12363

Add batch processing to redis source to improve performance #12363

Comments

roland-troeger commented Apr 22, 2022

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

jszwedko commented Apr 25, 2022