KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

Forest0923 · 2025-06-13T05:52:57Z

https://issues.apache.org/jira/browse/KAFKA-19390

The AbstractIndex.resize() method does not release the old memory map for both index and time index files.
In some cases, Mixed GC may not run for a long time, which can cause the broker to crash when the vm.max_map_count limit is reached.

The root cause is that safeForceUnmap() is not being called on Linux within resize(), so we have changed the code to unmap old mmap on all operating systems.

The same problem was reported in KAFKA-7442, but the PR submitted at that time did not acquire all necessary locks around the mmap accesses and was closed without fixing the issue.

github-actions · 2025-06-21T03:26:42Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

junrao

@Forest0923 : Thanks for the PR. Great fix! Left a couple of comments.

junrao · 2025-06-23T23:29:56Z

storage/src/main/java/org/apache/kafka/storage/internals/log/TimeIndex.java

@@ -67,19 +70,23 @@ public TimeIndex(File file, long baseOffset, int maxIndexSize, boolean writable)

        this.lastEntry = lastEntryFromIndexFile();

-        log.debug("Loaded index file {} with maxEntries = {}, maxIndexSize = {}, entries = {}, lastOffset = {}, file position = {}",
-            file.getAbsolutePath(), maxEntries(), maxIndexSize, entries(), lastEntry.offset, mmap().position());
+        inLock(lock, () -> {


Is lock needed here in the constructor? Ditto for TimeIndex.

I don’t think it’s necessary in the constructor, so I’ll remove it.

junrao · 2025-06-24T17:02:41Z

storage/src/main/java/org/apache/kafka/storage/internals/log/OffsetIndex.java

@@ -95,7 +99,7 @@ public void sanityCheck() {
     *         the pair (baseOffset, 0) is returned.
     */
    public OffsetPosition lookup(long targetOffset) {
-        return maybeLock(lock, () -> {
+        return inLock(lock, () -> {


Currently, in Linux, we don't acquire the lock for lookup(), which is used for fetch requests. Both fetch and produce requests are common. Acquiring the lock in the fetch path reduces read/write concurrency. The reader only truly needs to lock when the underlying mmap changes by resize(). Since resize() is an infrequent event, we could introduce a separate resize lock, which will be held by resize() and all readers (currently calling maybeLock()). This will help maintain the current level of concurrency.

I’d like to ask the following questions:

The JavaDoc for java.nio.Buffer says it is not thread-safe. In the case of reads, is it still safe for multiple threads to access the same buffer concurrently?
(See: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/Buffer.html)

To improve read-operation concurrency, are you suggesting the use of ReentrantReadWriteLock instead of ReentrantLock?

@Forest0923 : Good question. It's true that happens-before guarantee is not guaranteed according to javadoc. However, according to this, MappedByteBuffer is a thin wrapper and provides direct access to buffer cache of the OS. So, once you have written to it, every thread, and every process, will see the update. In addition to the index file, we read from the buffer cache for the log file without synchronization too. The implementation may change in the future, but probably unlikely.

So for now, it's probably better to use a separate resize lock to maintain the current read/write concurrency?

Thank you for the advice!
I have added resizeLock, could you please check?

Please let me know if I have misunderstood anything.

junrao · 2025-06-24T17:18:39Z

storage/src/main/java/org/apache/kafka/storage/internals/log/TimeIndex.java

-            throw new CorruptIndexException("Corrupt time index found, time index file (" + file().getAbsolutePath() + ") has "
-                + "non-zero size but the last timestamp is " + lastTimestamp + " which is less than the first timestamp "
-                + timestamp(mmap(), 0));
+        inLock(lock, () -> {


Good catch here!

junrao

@Forest0923 : Thanks for the updated PR. A couple of more comments. Also, could you run some perf test (produce perf + consumer perf together) to make sure there is no degradation?

junrao · 2025-06-26T17:51:15Z

storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java

@@ -48,6 +51,7 @@ private enum SearchResultType {
    private static final Logger log = LoggerFactory.getLogger(AbstractIndex.class);

    protected final ReentrantLock lock = new ReentrantLock();
+    protected final ReentrantReadWriteLock resizeLock = new ReentrantReadWriteLock();


Now that we have two locks, could we add comments on how each one is being used? Also, remapLock is probably more accurate?

junrao · 2025-06-26T18:01:28Z

storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java

-    protected final <T, E extends Exception> T maybeLock(Lock lock, StorageAction<T, E> action) throws E {
-        if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS)
-            lock.lock();
+    protected final <T, E extends Exception> T inResizeLock(Lock lock, StorageAction<T, E> action) throws E {


Perhaps it's better to make this inResizeReadLock and avoid passing in lock from the input. For consistency, we could also introduce a method like inLock for the other lock. This way, we could make both locks private.

Forest0923 · 2025-06-27T04:20:24Z

Here is the result of perf test:

OS: Rocky Linux release 9.5 (linux 5.14.0)
openjdk version : 17.0.15

Metric	PR (`b035e4c`)	Trunk (`7cd99ea`)
Producer throughput (records/sec)	142,388.5	143,548.6
Producer throughput (MB/sec)	139.05	140.18
Producer average latency (ms)	171.25	175.20
Producer max latency (ms)	1,340.00	1,505.00
Producer 50th percentile latency (ms)	185	192
Producer 95th percentile latency (ms)	233	230
Producer 99th percentile latency (ms)	427	380
Producer 99.9th percentile latency (ms)	1,214	1,190
Consumer throughput (messages/sec)	141,071.6	140,420.2
Consumer throughput (MB/sec)	137.77	137.13

pr (b035e4c)

$ bin/kafka-producer-perf-test.sh --topic sandbox --num-records 25000000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092
123886 records sent, 24777.2 records/sec (24.20 MB/sec), 1026.2 ms avg latency, 1340.0 ms max latency.
463395 records sent, 92679.0 records/sec (90.51 MB/sec), 342.5 ms avg latency, 784.0 ms max latency.
670455 records sent, 134091.0 records/sec (130.95 MB/sec), 223.0 ms avg latency, 304.0 ms max latency.
735030 records sent, 147006.0 records/sec (143.56 MB/sec), 197.5 ms avg latency, 251.0 ms max latency.
735559 records sent, 147111.8 records/sec (143.66 MB/sec), 170.4 ms avg latency, 237.0 ms max latency.
725546 records sent, 145109.2 records/sec (141.71 MB/sec), 200.9 ms avg latency, 290.0 ms max latency.
755025 records sent, 151005.0 records/sec (147.47 MB/sec), 155.4 ms avg latency, 242.0 ms max latency.
764055 records sent, 152811.0 records/sec (149.23 MB/sec), 68.0 ms avg latency, 152.0 ms max latency.
740955 records sent, 148191.0 records/sec (144.72 MB/sec), 136.4 ms avg latency, 216.0 ms max latency.
756360 records sent, 151272.0 records/sec (147.73 MB/sec), 156.4 ms avg latency, 215.0 ms max latency.
752130 records sent, 150426.0 records/sec (146.90 MB/sec), 185.4 ms avg latency, 235.0 ms max latency.
744810 records sent, 148962.0 records/sec (145.47 MB/sec), 178.6 ms avg latency, 266.0 ms max latency.
712755 records sent, 142551.0 records/sec (139.21 MB/sec), 89.4 ms avg latency, 222.0 ms max latency.
716835 records sent, 143367.0 records/sec (140.01 MB/sec), 206.6 ms avg latency, 240.0 ms max latency.
752910 records sent, 150582.0 records/sec (147.05 MB/sec), 167.7 ms avg latency, 243.0 ms max latency.
780090 records sent, 156018.0 records/sec (152.36 MB/sec), 158.5 ms avg latency, 259.0 ms max latency.
739125 records sent, 147825.0 records/sec (144.36 MB/sec), 96.2 ms avg latency, 220.0 ms max latency.
715530 records sent, 143106.0 records/sec (139.75 MB/sec), 141.1 ms avg latency, 224.0 ms max latency.
727080 records sent, 145416.0 records/sec (142.01 MB/sec), 196.2 ms avg latency, 246.0 ms max latency.
781545 records sent, 156309.0 records/sec (152.65 MB/sec), 82.0 ms avg latency, 230.0 ms max latency.
708045 records sent, 141609.0 records/sec (138.29 MB/sec), 192.7 ms avg latency, 242.0 ms max latency.
737985 records sent, 147597.0 records/sec (144.14 MB/sec), 195.4 ms avg latency, 231.0 ms max latency.
746805 records sent, 149361.0 records/sec (145.86 MB/sec), 177.8 ms avg latency, 242.0 ms max latency.
785925 records sent, 157185.0 records/sec (153.50 MB/sec), 73.2 ms avg latency, 185.0 ms max latency.
754560 records sent, 150912.0 records/sec (147.38 MB/sec), 55.3 ms avg latency, 215.0 ms max latency.
767535 records sent, 153507.0 records/sec (149.91 MB/sec), 169.1 ms avg latency, 259.0 ms max latency.
721935 records sent, 144387.0 records/sec (141.00 MB/sec), 199.8 ms avg latency, 243.0 ms max latency.
753615 records sent, 150723.0 records/sec (147.19 MB/sec), 161.9 ms avg latency, 223.0 ms max latency.
725565 records sent, 145113.0 records/sec (141.71 MB/sec), 188.2 ms avg latency, 233.0 ms max latency.
706680 records sent, 141336.0 records/sec (138.02 MB/sec), 206.6 ms avg latency, 268.0 ms max latency.
701055 records sent, 140211.0 records/sec (136.92 MB/sec), 213.3 ms avg latency, 265.0 ms max latency.
759360 records sent, 151872.0 records/sec (148.31 MB/sec), 195.0 ms avg latency, 233.0 ms max latency.
744345 records sent, 148869.0 records/sec (145.38 MB/sec), 177.8 ms avg latency, 259.0 ms max latency.
707447 records sent, 141489.4 records/sec (138.17 MB/sec), 199.7 ms avg latency, 232.0 ms max latency.
710563 records sent, 142112.6 records/sec (138.78 MB/sec), 209.5 ms avg latency, 255.0 ms max latency.
25000000 records sent, 142388.5 records/sec (139.05 MB/sec), 171.25 ms avg latency, 1340.00 ms max latency, 185 ms 50th, 233 ms 95th, 427 ms 99th, 1214 ms 99.9th.

$ bin/kafka-consumer-perf-test.sh --topic sandbox --messages 25000000 --bootstrap-server localhost:9092 --timeout 50000000
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2025-06-27 11:25:00:198, 2025-06-27 11:27:57:413, 24414.0625, 137.7652, 25000000, 141071.5797, 3432, 173783, 140.4859, 143857.5695

trunk (7cd99ea)

$ bin/kafka-producer-perf-test.sh --topic sandbox --num-records 25000000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092
123001 records sent, 24600.2 records/sec (24.02 MB/sec), 1021.5 ms avg latency, 1505.0 ms max latency.
520665 records sent, 104133.0 records/sec (101.69 MB/sec), 308.8 ms avg latency, 797.0 ms max latency.
677145 records sent, 135429.0 records/sec (132.25 MB/sec), 219.0 ms avg latency, 407.0 ms max latency.
737490 records sent, 147498.0 records/sec (144.04 MB/sec), 196.1 ms avg latency, 231.0 ms max latency.
748560 records sent, 149712.0 records/sec (146.20 MB/sec), 177.2 ms avg latency, 233.0 ms max latency.
792915 records sent, 158583.0 records/sec (154.87 MB/sec), 103.6 ms avg latency, 212.0 ms max latency.
723735 records sent, 144747.0 records/sec (141.35 MB/sec), 101.5 ms avg latency, 183.0 ms max latency.
773235 records sent, 154647.0 records/sec (151.02 MB/sec), 76.4 ms avg latency, 150.0 ms max latency.
765450 records sent, 153090.0 records/sec (149.50 MB/sec), 123.9 ms avg latency, 212.0 ms max latency.
756675 records sent, 151335.0 records/sec (147.79 MB/sec), 188.3 ms avg latency, 225.0 ms max latency.
760305 records sent, 152061.0 records/sec (148.50 MB/sec), 195.7 ms avg latency, 223.0 ms max latency.
751275 records sent, 150255.0 records/sec (146.73 MB/sec), 165.1 ms avg latency, 210.0 ms max latency.
780525 records sent, 156105.0 records/sec (152.45 MB/sec), 113.8 ms avg latency, 237.0 ms max latency.
755385 records sent, 151077.0 records/sec (147.54 MB/sec), 60.0 ms avg latency, 165.0 ms max latency.
765375 records sent, 153075.0 records/sec (149.49 MB/sec), 103.4 ms avg latency, 217.0 ms max latency.
724890 records sent, 144949.0 records/sec (141.55 MB/sec), 205.4 ms avg latency, 248.0 ms max latency.
694965 records sent, 138993.0 records/sec (135.74 MB/sec), 209.7 ms avg latency, 251.0 ms max latency.
750165 records sent, 150033.0 records/sec (146.52 MB/sec), 184.6 ms avg latency, 225.0 ms max latency.
712050 records sent, 142410.0 records/sec (139.07 MB/sec), 194.2 ms avg latency, 296.0 ms max latency.
717405 records sent, 143481.0 records/sec (140.12 MB/sec), 211.3 ms avg latency, 253.0 ms max latency.
721352 records sent, 144270.4 records/sec (140.89 MB/sec), 210.4 ms avg latency, 258.0 ms max latency.
706903 records sent, 141380.6 records/sec (138.07 MB/sec), 211.1 ms avg latency, 242.0 ms max latency.
735135 records sent, 147027.0 records/sec (143.58 MB/sec), 187.2 ms avg latency, 244.0 ms max latency.
741525 records sent, 148305.0 records/sec (144.83 MB/sec), 198.4 ms avg latency, 259.0 ms max latency.
766440 records sent, 153288.0 records/sec (149.70 MB/sec), 122.5 ms avg latency, 189.0 ms max latency.
760755 records sent, 152151.0 records/sec (148.58 MB/sec), 156.4 ms avg latency, 233.0 ms max latency.
734430 records sent, 146886.0 records/sec (143.44 MB/sec), 188.5 ms avg latency, 244.0 ms max latency.
743400 records sent, 148680.0 records/sec (145.20 MB/sec), 195.1 ms avg latency, 254.0 ms max latency.
745545 records sent, 149109.0 records/sec (145.61 MB/sec), 175.8 ms avg latency, 218.0 ms max latency.
735855 records sent, 147171.0 records/sec (143.72 MB/sec), 179.6 ms avg latency, 245.0 ms max latency.
727380 records sent, 145476.0 records/sec (142.07 MB/sec), 201.9 ms avg latency, 254.0 ms max latency.
724200 records sent, 144840.0 records/sec (141.45 MB/sec), 203.5 ms avg latency, 260.0 ms max latency.
749055 records sent, 149811.0 records/sec (146.30 MB/sec), 199.1 ms avg latency, 222.0 ms max latency.
771150 records sent, 154230.0 records/sec (150.62 MB/sec), 178.8 ms avg latency, 232.0 ms max latency.
25000000 records sent, 143548.6 records/sec (140.18 MB/sec), 175.20 ms avg latency, 1505.00 ms max latency, 192 ms 50th, 230 ms 95th, 380 ms 99th, 1190 ms 99.9th.

$ bin/kafka-consumer-perf-test.sh --topic sandbox --messages 25000000 --bootstrap-server localhost:9092 --timeout 500000000
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2025-06-27 12:43:59:905, 2025-06-27 12:46:57:942, 24414.0625, 137.1292, 25000000, 140420.2497, 3424, 174613, 139.8181, 143173.7614

Forest0923 added 3 commits June 12, 2025 13:23

feat: Add lock helpers

da8841f

fix: Unmap old mmap during resizing

34f40ec

test: Add test for AbstractIndex

5994659

github-actions bot added triage core storage labels Jun 13, 2025

Merge branch 'trunk' into KAFKA-19390

71eec03

github-actions bot added the needs-attention label Jun 21, 2025

junrao added the ci-approved label Jun 23, 2025

github-actions bot removed the needs-attention label Jun 24, 2025

junrao reviewed Jun 24, 2025

View reviewed changes

Forest0923 and others added 2 commits June 25, 2025 09:51

Merge branch 'apache:trunk' into KAFKA-19390

130d8ad

refactor: remove lock in constructors

61220c6

github-actions bot removed the triage label Jun 25, 2025

refactor: Add resizeLock

b5e085c

junrao reviewed Jun 26, 2025

View reviewed changes

refactor: make locks private, refactor lock helper, and add comments

b035e4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

Forest0923 commented Jun 13, 2025

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

junrao left a comment

Uh oh!

junrao Jun 23, 2025

Uh oh!

Forest0923 Jun 25, 2025

Uh oh!

junrao Jun 24, 2025

Uh oh!

Forest0923 Jun 25, 2025

Uh oh!

junrao Jun 25, 2025

Uh oh!

Forest0923 Jun 25, 2025 •

edited

Loading

Uh oh!

junrao Jun 24, 2025

Uh oh!

junrao left a comment

Uh oh!

junrao Jun 26, 2025

Uh oh!

junrao Jun 26, 2025

Uh oh!

Forest0923 commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

Are you sure you want to change the base?

KAFKA-19390: Call AbstractIndex.safeForceUnmap() in AbstractIndex.resize() on Linux to prevent stale index file memory mappings #19961

Conversation

Forest0923 commented Jun 13, 2025

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Forest0923 Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Forest0923 commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Forest0923 Jun 25, 2025 •

edited

Loading

Forest0923 commented Jun 27, 2025 •

edited

Loading