Skip to content

KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: trunk
Choose a base branch
from

Conversation

Forest0923
Copy link

https://issues.apache.org/jira/browse/KAFKA-19390

The AbstractIndex.resize() method does not release the old memory map for both index and time index files.
In some cases, Mixed GC may not run for a long time, which can cause the broker to crash when the vm.max_map_count limit is reached.

The root cause is that safeForceUnmap() is not being called on Linux within resize(), so we have changed the code to unmap old mmap on all operating systems.

The same problem was reported in KAFKA-7442, but the PR submitted at that time did not acquire all necessary locks around the mmap accesses and was closed without fixing the issue.

@github-actions github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module labels Jun 13, 2025
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Forest0923 : Thanks for the PR. Great fix! Left a couple of comments.

@@ -67,19 +70,23 @@ public TimeIndex(File file, long baseOffset, int maxIndexSize, boolean writable)

this.lastEntry = lastEntryFromIndexFile();

log.debug("Loaded index file {} with maxEntries = {}, maxIndexSize = {}, entries = {}, lastOffset = {}, file position = {}",
file.getAbsolutePath(), maxEntries(), maxIndexSize, entries(), lastEntry.offset, mmap().position());
inLock(lock, () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lock needed here in the constructor? Ditto for TimeIndex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it’s necessary in the constructor, so I’ll remove it.

@@ -95,7 +99,7 @@ public void sanityCheck() {
* the pair (baseOffset, 0) is returned.
*/
public OffsetPosition lookup(long targetOffset) {
return maybeLock(lock, () -> {
return inLock(lock, () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in Linux, we don't acquire the lock for lookup(), which is used for fetch requests. Both fetch and produce requests are common. Acquiring the lock in the fetch path reduces read/write concurrency. The reader only truly needs to lock when the underlying mmap changes by resize(). Since resize() is an infrequent event, we could introduce a separate resize lock, which will be held by resize() and all readers (currently calling maybeLock()). This will help maintain the current level of concurrency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to ask the following questions:

  1. The JavaDoc for java.nio.Buffer says it is not thread-safe. In the case of reads, is it still safe for multiple threads to access the same buffer concurrently?
    (See: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/Buffer.html)

  2. To improve read-operation concurrency, are you suggesting the use of ReentrantReadWriteLock instead of ReentrantLock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Forest0923 : Good question. It's true that happens-before guarantee is not guaranteed according to javadoc. However, according to this, MappedByteBuffer is a thin wrapper and provides direct access to buffer cache of the OS. So, once you have written to it, every thread, and every process, will see the update. In addition to the index file, we read from the buffer cache for the log file without synchronization too. The implementation may change in the future, but probably unlikely.

So for now, it's probably better to use a separate resize lock to maintain the current read/write concurrency?

Copy link
Author

@Forest0923 Forest0923 Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the advice!
I have added resizeLock, could you please check?

Please let me know if I have misunderstood anything.

throw new CorruptIndexException("Corrupt time index found, time index file (" + file().getAbsolutePath() + ") has "
+ "non-zero size but the last timestamp is " + lastTimestamp + " which is less than the first timestamp "
+ timestamp(mmap(), 0));
inLock(lock, () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch here!

@github-actions github-actions bot removed the triage PRs from the community label Jun 25, 2025
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Forest0923 : Thanks for the updated PR. A couple of more comments. Also, could you run some perf test (produce perf + consumer perf together) to make sure there is no degradation?

@@ -48,6 +51,7 @@ private enum SearchResultType {
private static final Logger log = LoggerFactory.getLogger(AbstractIndex.class);

protected final ReentrantLock lock = new ReentrantLock();
protected final ReentrantReadWriteLock resizeLock = new ReentrantReadWriteLock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have two locks, could we add comments on how each one is being used? Also, remapLock is probably more accurate?

protected final <T, E extends Exception> T maybeLock(Lock lock, StorageAction<T, E> action) throws E {
if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS)
lock.lock();
protected final <T, E extends Exception> T inResizeLock(Lock lock, StorageAction<T, E> action) throws E {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's better to make this inResizeReadLock and avoid passing in lock from the input. For consistency, we could also introduce a method like inLock for the other lock. This way, we could make both locks private.

@Forest0923
Copy link
Author

Forest0923 commented Jun 27, 2025

Here is the result of perf test:

  • OS: Rocky Linux release 9.5 (linux 5.14.0)
  • openjdk version : 17.0.15
Metric PR (b035e4c) Trunk (7cd99ea)
Producer throughput
(records/sec)
142,388.5 143,548.6
Producer throughput
(MB/sec)
139.05 140.18
Producer average latency
(ms)
171.25 175.20
Producer max latency
(ms)
1,340.00 1,505.00
Producer 50th percentile latency
(ms)
185 192
Producer 95th percentile latency
(ms)
233 230
Producer 99th percentile latency
(ms)
427 380
Producer 99.9th percentile latency
(ms)
1,214 1,190
Consumer throughput
(messages/sec)
141,071.6 140,420.2
Consumer throughput
(MB/sec)
137.77 137.13
pr (b035e4c)
$ bin/kafka-producer-perf-test.sh --topic sandbox --num-records 25000000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092
123886 records sent, 24777.2 records/sec (24.20 MB/sec), 1026.2 ms avg latency, 1340.0 ms max latency.
463395 records sent, 92679.0 records/sec (90.51 MB/sec), 342.5 ms avg latency, 784.0 ms max latency.
670455 records sent, 134091.0 records/sec (130.95 MB/sec), 223.0 ms avg latency, 304.0 ms max latency.
735030 records sent, 147006.0 records/sec (143.56 MB/sec), 197.5 ms avg latency, 251.0 ms max latency.
735559 records sent, 147111.8 records/sec (143.66 MB/sec), 170.4 ms avg latency, 237.0 ms max latency.
725546 records sent, 145109.2 records/sec (141.71 MB/sec), 200.9 ms avg latency, 290.0 ms max latency.
755025 records sent, 151005.0 records/sec (147.47 MB/sec), 155.4 ms avg latency, 242.0 ms max latency.
764055 records sent, 152811.0 records/sec (149.23 MB/sec), 68.0 ms avg latency, 152.0 ms max latency.
740955 records sent, 148191.0 records/sec (144.72 MB/sec), 136.4 ms avg latency, 216.0 ms max latency.
756360 records sent, 151272.0 records/sec (147.73 MB/sec), 156.4 ms avg latency, 215.0 ms max latency.
752130 records sent, 150426.0 records/sec (146.90 MB/sec), 185.4 ms avg latency, 235.0 ms max latency.
744810 records sent, 148962.0 records/sec (145.47 MB/sec), 178.6 ms avg latency, 266.0 ms max latency.
712755 records sent, 142551.0 records/sec (139.21 MB/sec), 89.4 ms avg latency, 222.0 ms max latency.
716835 records sent, 143367.0 records/sec (140.01 MB/sec), 206.6 ms avg latency, 240.0 ms max latency.
752910 records sent, 150582.0 records/sec (147.05 MB/sec), 167.7 ms avg latency, 243.0 ms max latency.
780090 records sent, 156018.0 records/sec (152.36 MB/sec), 158.5 ms avg latency, 259.0 ms max latency.
739125 records sent, 147825.0 records/sec (144.36 MB/sec), 96.2 ms avg latency, 220.0 ms max latency.
715530 records sent, 143106.0 records/sec (139.75 MB/sec), 141.1 ms avg latency, 224.0 ms max latency.
727080 records sent, 145416.0 records/sec (142.01 MB/sec), 196.2 ms avg latency, 246.0 ms max latency.
781545 records sent, 156309.0 records/sec (152.65 MB/sec), 82.0 ms avg latency, 230.0 ms max latency.
708045 records sent, 141609.0 records/sec (138.29 MB/sec), 192.7 ms avg latency, 242.0 ms max latency.
737985 records sent, 147597.0 records/sec (144.14 MB/sec), 195.4 ms avg latency, 231.0 ms max latency.
746805 records sent, 149361.0 records/sec (145.86 MB/sec), 177.8 ms avg latency, 242.0 ms max latency.
785925 records sent, 157185.0 records/sec (153.50 MB/sec), 73.2 ms avg latency, 185.0 ms max latency.
754560 records sent, 150912.0 records/sec (147.38 MB/sec), 55.3 ms avg latency, 215.0 ms max latency.
767535 records sent, 153507.0 records/sec (149.91 MB/sec), 169.1 ms avg latency, 259.0 ms max latency.
721935 records sent, 144387.0 records/sec (141.00 MB/sec), 199.8 ms avg latency, 243.0 ms max latency.
753615 records sent, 150723.0 records/sec (147.19 MB/sec), 161.9 ms avg latency, 223.0 ms max latency.
725565 records sent, 145113.0 records/sec (141.71 MB/sec), 188.2 ms avg latency, 233.0 ms max latency.
706680 records sent, 141336.0 records/sec (138.02 MB/sec), 206.6 ms avg latency, 268.0 ms max latency.
701055 records sent, 140211.0 records/sec (136.92 MB/sec), 213.3 ms avg latency, 265.0 ms max latency.
759360 records sent, 151872.0 records/sec (148.31 MB/sec), 195.0 ms avg latency, 233.0 ms max latency.
744345 records sent, 148869.0 records/sec (145.38 MB/sec), 177.8 ms avg latency, 259.0 ms max latency.
707447 records sent, 141489.4 records/sec (138.17 MB/sec), 199.7 ms avg latency, 232.0 ms max latency.
710563 records sent, 142112.6 records/sec (138.78 MB/sec), 209.5 ms avg latency, 255.0 ms max latency.
25000000 records sent, 142388.5 records/sec (139.05 MB/sec), 171.25 ms avg latency, 1340.00 ms max latency, 185 ms 50th, 233 ms 95th, 427 ms 99th, 1214 ms 99.9th.

$ bin/kafka-consumer-perf-test.sh --topic sandbox --messages 25000000 --bootstrap-server localhost:9092 --timeout 50000000
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2025-06-27 11:25:00:198, 2025-06-27 11:27:57:413, 24414.0625, 137.7652, 25000000, 141071.5797, 3432, 173783, 140.4859, 143857.5695
trunk (7cd99ea)
$ bin/kafka-producer-perf-test.sh --topic sandbox --num-records 25000000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092
123001 records sent, 24600.2 records/sec (24.02 MB/sec), 1021.5 ms avg latency, 1505.0 ms max latency.
520665 records sent, 104133.0 records/sec (101.69 MB/sec), 308.8 ms avg latency, 797.0 ms max latency.
677145 records sent, 135429.0 records/sec (132.25 MB/sec), 219.0 ms avg latency, 407.0 ms max latency.
737490 records sent, 147498.0 records/sec (144.04 MB/sec), 196.1 ms avg latency, 231.0 ms max latency.
748560 records sent, 149712.0 records/sec (146.20 MB/sec), 177.2 ms avg latency, 233.0 ms max latency.
792915 records sent, 158583.0 records/sec (154.87 MB/sec), 103.6 ms avg latency, 212.0 ms max latency.
723735 records sent, 144747.0 records/sec (141.35 MB/sec), 101.5 ms avg latency, 183.0 ms max latency.
773235 records sent, 154647.0 records/sec (151.02 MB/sec), 76.4 ms avg latency, 150.0 ms max latency.
765450 records sent, 153090.0 records/sec (149.50 MB/sec), 123.9 ms avg latency, 212.0 ms max latency.
756675 records sent, 151335.0 records/sec (147.79 MB/sec), 188.3 ms avg latency, 225.0 ms max latency.
760305 records sent, 152061.0 records/sec (148.50 MB/sec), 195.7 ms avg latency, 223.0 ms max latency.
751275 records sent, 150255.0 records/sec (146.73 MB/sec), 165.1 ms avg latency, 210.0 ms max latency.
780525 records sent, 156105.0 records/sec (152.45 MB/sec), 113.8 ms avg latency, 237.0 ms max latency.
755385 records sent, 151077.0 records/sec (147.54 MB/sec), 60.0 ms avg latency, 165.0 ms max latency.
765375 records sent, 153075.0 records/sec (149.49 MB/sec), 103.4 ms avg latency, 217.0 ms max latency.
724890 records sent, 144949.0 records/sec (141.55 MB/sec), 205.4 ms avg latency, 248.0 ms max latency.
694965 records sent, 138993.0 records/sec (135.74 MB/sec), 209.7 ms avg latency, 251.0 ms max latency.
750165 records sent, 150033.0 records/sec (146.52 MB/sec), 184.6 ms avg latency, 225.0 ms max latency.
712050 records sent, 142410.0 records/sec (139.07 MB/sec), 194.2 ms avg latency, 296.0 ms max latency.
717405 records sent, 143481.0 records/sec (140.12 MB/sec), 211.3 ms avg latency, 253.0 ms max latency.
721352 records sent, 144270.4 records/sec (140.89 MB/sec), 210.4 ms avg latency, 258.0 ms max latency.
706903 records sent, 141380.6 records/sec (138.07 MB/sec), 211.1 ms avg latency, 242.0 ms max latency.
735135 records sent, 147027.0 records/sec (143.58 MB/sec), 187.2 ms avg latency, 244.0 ms max latency.
741525 records sent, 148305.0 records/sec (144.83 MB/sec), 198.4 ms avg latency, 259.0 ms max latency.
766440 records sent, 153288.0 records/sec (149.70 MB/sec), 122.5 ms avg latency, 189.0 ms max latency.
760755 records sent, 152151.0 records/sec (148.58 MB/sec), 156.4 ms avg latency, 233.0 ms max latency.
734430 records sent, 146886.0 records/sec (143.44 MB/sec), 188.5 ms avg latency, 244.0 ms max latency.
743400 records sent, 148680.0 records/sec (145.20 MB/sec), 195.1 ms avg latency, 254.0 ms max latency.
745545 records sent, 149109.0 records/sec (145.61 MB/sec), 175.8 ms avg latency, 218.0 ms max latency.
735855 records sent, 147171.0 records/sec (143.72 MB/sec), 179.6 ms avg latency, 245.0 ms max latency.
727380 records sent, 145476.0 records/sec (142.07 MB/sec), 201.9 ms avg latency, 254.0 ms max latency.
724200 records sent, 144840.0 records/sec (141.45 MB/sec), 203.5 ms avg latency, 260.0 ms max latency.
749055 records sent, 149811.0 records/sec (146.30 MB/sec), 199.1 ms avg latency, 222.0 ms max latency.
771150 records sent, 154230.0 records/sec (150.62 MB/sec), 178.8 ms avg latency, 232.0 ms max latency.
25000000 records sent, 143548.6 records/sec (140.18 MB/sec), 175.20 ms avg latency, 1505.00 ms max latency, 192 ms 50th, 230 ms 95th, 380 ms 99th, 1190 ms 99.9th.

$ bin/kafka-consumer-perf-test.sh --topic sandbox --messages 25000000 --bootstrap-server localhost:9092 --timeout 500000000
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2025-06-27 12:43:59:905, 2025-06-27 12:46:57:942, 24414.0625, 137.1292, 25000000, 140420.2497, 3424, 174613, 139.8181, 143173.7614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved core Kafka Broker storage Pull requests that target the storage module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants