Skip to content

Add cache to optimize header match performance. #3879

@chickenchickenlove

Description

@chickenchickenlove
Contributor

Context

Currently, I'm working on #3067 issue.
To implement this issue, I used the PatternMatchUtils.
If #3874 is merged, it would be better to consider introducing cache to optimize pattern match.

A consumer will subscribe to only a few topics, and it’s very likely that all ConsumerRecords for those topics will carry the same headers.
Therefore, the following assumptions are reasonable:

  • The number of distinct header types won’t grow without bound (i.e., low cardinality).
  • Once a header has been resolved via a pattern match, the result can be reused thereafter without any issues.
  • the number of records that need to be processed per second may be in the thousands, tens of thousands, or even hundreds of thousands.

So, If we put result of pattern matching to cache and use it next iteration,
The performance of multi-value headers will be improved drastically.

As artembilan mentioned before, it cause the memory leak problem.
I think It would be better to introduce Caffein for LFU cache.

If this issue makes sense to you guys, I will take a crack. 🙇‍♂

Simple Performance Result

https://gist.github.com/chickenchickenlove/8c0935883bab446a1ff2efe728b6f99b

Case Without Cache (ms) With Cache (ms)
First Match – 10,000,000 91 0
First Match – 50,000,000 479 0
First Match – 100,000,000 480 0
Last Match – 10,000,000 2500 0
Last Match – 50,000,000 627 0
Last Match – 100,000,000 1112 0
Last Match and more patterns – 10,000,000 202 0
Last Match and more patterns – 50,000,000 1023 0
Last Match and more patterns – 100,000,000 2148 0

Activity

artembilan

artembilan commented on May 6, 2025

@artembilan
Member

I think It would be better to introduce Caffein for LFU cache.

Well, no. We develop here a library for target projects, therefore any extra dependency has to be really good justified.
Doesn't look like the whole Caffein has to be used just for this simple optimization.
Consider to use LinkedHashMap and its removeEldestEntry() implementation.

sobychacko

sobychacko commented on May 6, 2025

@sobychacko
Contributor

While we can't introduce dependencies like Caffeine in Spring Kafka proper, I am thinking aloud here to see if we can provide a hook from the framework so that if the users want to bring a 3rd party caching solution, that is possible (of course, without having any direct dependency on it)

artembilan

artembilan commented on May 6, 2025

@artembilan
Member

if the users want to bring a 3rd party caching solution

I don't see how is that useful for just header names caching.
Everything is done in the memory, so no need to overhead with fully-blown cache implementation.

sobychacko

sobychacko commented on May 6, 2025

@sobychacko
Contributor

I don't remember the details of Caffeine, but it is also an in-memory cache with advanced algorithms for caching (some applications prefer it). Spring Pulsar has some use cases for it: https://github.com/spring-projects/spring-pulsar/tree/bd9aea58589a290f0fcb434383d26b7914db6196/spring-pulsar-cache-provider-caffeine.

I am not saying that we have to do it, but if some applications prefer to use it, we may want to think about having a way to support those use cases without explicitly bringing that as a dependency in the framework.

chickenchickenlove

chickenchickenlove commented on May 9, 2025

@chickenchickenlove
ContributorAuthor

Thanks for you guys comments!
Let me suggest something.

A header-mapping cache isn't efficient for certain use cases.

For examples,

  1. When there are too many distinct headers
  2. When the cache is smaller than header set, causing frequent evictions

In these scenarios, caching add overhead rather than helping.
To let users tune the behaviour, I suggest we provide:

  1. opt-in Flag : Cache enable / disable
  2. Cache Interface: User can suplly their own implementation and inject it into KafkaHeaderMapper.
  3. Default Implementation of header cache interface - a simple LinkedHashMap using removeEldestEntry().

As, @artembilan mentioned earlier, the header-pattern matching logic will be invoked for every ConsumerRecord.
Given the properties of Kafka, there will be a large number of ConsumerRecords, so I believe the performance of the Header Mapper is critical.

To optimize it for each use cases, I think it would be better to provide a pluggable cache interface.

I would love to hear you guys thougths.!
Thanks for reading this. 🙇‍♂

artembilan

artembilan commented on May 9, 2025

@artembilan
Member

User can suplly their own implementation and inject it into KafkaHeaderMapper.

If I would develop target application, I'm not sure if I would bother myself with extra cache abstraction to be injected into the framework component for some tiny CPU optimization for know reasons for my application.
I would state if there is a problem in the framework component, that has to be done on the framework level, or I would go my own KafkaHeaderMapper implementation.

We do use headers pattern matching in many other projects and no one ever raises a question to cache whatever is just result of pattern matching.

Just my opinion because there are too much else to do rather than try to come up with some caching pattern matching externalization.

chickenchickenlove

chickenchickenlove commented on May 13, 2025

@chickenchickenlove
ContributorAuthor

I agree 👍
your points about avoiding heavy dependencies make sense to me! 🙇‍♂

What do you think about an opt-in, lightweight LRU cache based on LinkedHashMap (using removeEldestEntry(...))?

  • It keeps spring-kafka self-contained — no external libraries
  • applications that don't need it can leave it disabled (default).
  • For high-throughput cases, even a small cache (~1K entries) shaved noticeable CPU in my micro BM.

Given the fact that spring-kafka supports enterprise environment, IMHO, it might be helpful to provide LRU cache in the framework component to optimize pattern matching.

In my earlier experience, there were some cases that input topic's rate was approximately 2M/sec.
In that case, I guess LRU cache for header pattern matching makes a noticeable difference in terms of CPU usage.

What do you think?

sobychacko

sobychacko commented on May 13, 2025

@sobychacko
Contributor

Yeah, that was my original thought when you raised it in the first place. It would be nice to support things like that as long as there is minimal impact on Spring Kafka. The problem arises when someone tries to opt in for some enterprise caching, runs into issues, and then thinks it is Spring Kafka-related. It could potentially become a maintenance burden on us in the long term. Although it could be a minor change, we need to consider the pros/cons more prudently. I like the overall idea, but is it worth it? You made the case for CPU usage benefits in heavy data-intensive apps, but let's continue to explore any downsides since @artembilan has other thoughts about this.

artembilan

artembilan commented on May 13, 2025

@artembilan
Member

Why opt-in?
The LinkedHashMap with removeEldestEntry(...) should be enough for this internal use-case always.
There is no external interaction, so not need for any enterprise cache.
The CPU burden is noted, so LinkedHashMap must do the trick.

For example, in Spring Integration, we have this feature: https://docs.spring.io/spring-integration/reference/router/dynamic-routers.html.
This one relies internally on the:

	private static final int DEFAULT_DYNAMIC_CHANNEL_LIMIT = 100;

	private int dynamicChannelLimit = DEFAULT_DYNAMIC_CHANNEL_LIMIT;

	@SuppressWarnings("serial")
	private final Map<String, MessageChannel> dynamicChannels =
			Collections.synchronizedMap(
					new LinkedHashMap<>(DEFAULT_DYNAMIC_CHANNEL_LIMIT, 0.75f, true) {

						@Override
						protected boolean removeEldestEntry(Entry<String, MessageChannel> eldest) {
							return size() > AbstractMappingMessageRouter.this.dynamicChannelLimit;
						}

					});

And that was enough for years. No one complained that there is no enterprise cache support.
Just because we just optimize some CPU ticks for same bean resolution.
We do over there similar solution for local Lock instances caching in the JdbcLockRegistry:

Starting with version 5.5.6, the JdbcLockRegistry is support automatically clean up cache for JdbcLock in JdbcLockRegistry.locks via JdbcLockRegistry.setCacheCapacity(). See its JavaDocs for more information.

We also do have cache for SimpleJdbcCall instances in the StoredProcExecutor.

I don't mind LinkedHashMap for this our header names use-case, but I don't see a reason in an external cache abstraction overhear.

chickenchickenlove

chickenchickenlove commented on May 13, 2025

@chickenchickenlove
ContributorAuthor

Apologies for the lack of clarity in my previous explanation.🙇‍♂️

What I meant by “opt-in” was simply the ability to turn the header cache functionality on or off.

Also, I wasn’t suggesting introducing any enterprise-level caching solution.
Rather, my point was that in enterprise environments where a large number of messages are consumed, even a simple LinkedHashMap-based cache can offer meaningful CPU optimizations.

To summarize:

  • “Opt-in” refers to enabling or disabling the header cache feature.
  • The header cache itself would be implemented with a simple LinkedHashMap, without introducing any external dependencies.

What do you think?
Please let me know you opinion!

artembilan

artembilan commented on May 13, 2025

@artembilan
Member

My suggestion: always caching via LinkedHashMap with some meaningful size like 1000.
No need in opt-in.
Thanks

chickenchickenlove

chickenchickenlove commented on May 14, 2025

@chickenchickenlove
ContributorAuthor

Thanks for your comments!
That makes sense!
Would it be okay if I create a PR for the header pattern cache you suggested?

artembilan

artembilan commented on May 14, 2025

@artembilan
Member

Would it be okay if I create a PR for the header pattern cache you suggested?

Happy to review and merge!
Thanks

7 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @sobychacko@artembilan@chickenchickenlove

      Issue actions

        Add cache to optimize header match performance. · Issue #3879 · spring-projects/spring-kafka