Java Performance Engineering - A Log Scanner Case Study

This repository documents a logscanner engineering journey in Java.
Starting from the most natural, readable solution and improving it step by step.
each version isolates one bottleneck, fixes it, measures the result, and explains why it worked.

The goal is not just to arrive at a fast solution.
It is to understand why the naive solution is slow, and what exactly each change does to the hardware and the JVM.

Test Environment

All benchmarks were run on a single thread - no concurrency, no parallelism.
Every number reflects the performance of a single-threaded scan on one core.


CPU	Intel Core i5-1035G1 @ 1.00GHz (1.19 GHz boost)
RAM	8.00 GB
OS	Windows x64 (64-bit)
JDK	Java 25

The Task

We have a log file containing million lines.
Each line is a single log event produced by an application errors, warnings, debug messages, and others:

2026-03-05 01:28:28 [ERROR] User login failed - invalid credentials
2026-03-05 01:28:31 [DEBUG] Session cache hit for token abc123
2026-03-05 01:28:45 [WARN]  Response time exceeded threshold: 1842ms
2026-03-05 02:14:03 [ERROR] Database connection timeout after 30s

The log level (ERROR, WARN, DEBUG, etc.) and the timestamp are always at fixed offsets from the start of the line.
The message after the log level is in random length so the total line length is not guaranteed to be the same.

The problem: scan the entire file, find every ERROR line, and produce a count of how many errors occurred in each hour of the day.

Expected output - errors grouped by hour (0–23) in array:

Hour 00 →  143 errors
Hour 01 →  311 errors
Hour 02 →   87 errors
...
Hour 23 →  204 errors

array [143, 311, 87...etc ]
each index represent the hour.

Simple problem. The interesting part is how you solve it.

Why This Problem?

Log analysis is one of the most common real-world workloads in backend systems.
It is also a perfect case study for logscanner engineering because:

The input is large enough that naive solutions visibly struggle
The algorithm itself is trivial the bottleneck is never the logic, always the I/O and memory model
Every optimization targets something concrete and measurable: heap allocation, GC pressure, CPU instruction count

This makes it easy to isolate what each change actually does rather than guessing why something got faster.

Repository Structure

Each version lives in its own package and has its own documentation:

src/
└── main/
    ├── java/
    │   ├── generator/
    │   │   └── LogGenerator.java          ← generates the input.txt benchmark file
    │   └── logscanner/
    │       └── version0X/
    │           ├── LogScannerV0X.java     ← JMH benchmark class
    │           ├── Tester.java            ← algorithm test before going to JMH
    │           └── overview.md            ← documentation for this version
    └── resources/
        ├── example.txt                    ← small sample log file used by Tester
        ├── input.txt                      ← real benchmark input (~1M rows, ~120MB)
        └── runOptions.txt                 ← JVM flags and run commands

Tester exists because JMH is not a good environment for debugging it runs in a forked JVM with no easy way to inspect output.
Every new algorithm is first validated in Tester against example.txt with System.out.println() to confirm the hour counts are correct.
then ported into the LogScanner benchmark class to run against input.txt for real measurement.

runOptions.txt contains the exact commands used to run each benchmark and the JVM flags used when inspecting assembly output:

# Running benchmarks with GC profiling
java -jar target/benchmarks.jar logscanner.version01.LogScannerV01 -prof gc
java -jar target/benchmarks.jar logscanner.version02.LogScannerV02 -prof gc
java -jar target/benchmarks.jar logscanner.version03.LogScannerV03 -prof gc
java -jar target/benchmarks.jar logscanner.version04.LogScannerV04 -prof gc

# Printing C2-compiled assembly for a specific method (used in V04 BCE investigation)
-XX:+UnlockDiagnosticVMOptions
-Xbatch -XX:CompileCommand=quiet
-XX:CompileCommand=compileonly,logscanner/version04/Tester.<methodName>
-XX:CompileCommand=print,logscanner/version04/Tester.<methodName>

Version	Approach	Key Idea
V01	`Files.readAllLines()` + Stream API	Baseline, readable, straightforward, GC-bound
V02	`MappedByteBuffer` + byte scanning	Eliminate `String` objects, move I/O off the Heap
V03	`getInt()` + `int[24]`	Remove the last allocations, reach zero GC
V04	BCE bitmask + `MemorySegment`	Eliminate JIT bounds checks, modernize the memory API

How to Read This Repository

Each version's document follows the same structure:

What changed - a clear table of what this version does differently from the last
How it works - a walkthrough of the actual code with explanation
Why it matters - the reasoning behind the change, not just the result
Benchmark results - real JMH numbers with analysis of what they mean
What comes next - what the numbers reveal about the remaining bottleneck

The benchmarks are run with JMH in AverageTime mode over a ~120MB file with one million log lines.
GC allocation rate, GC count, and GC time are tracked alongside execution time so that every source of overhead is visible.

The Bottom Line

Version	Execution Time	GC Alloc Rate	GC Pauses
V01	872ms ± 346ms	156 MB/sec	19/op
V02	394ms ± 402ms	13.4 MB/sec	2/op
V03	194ms ± 24ms	0.011 MB/sec	0/op
V04	78ms ± 9ms	0.019 MB/sec	0/op

From ~870ms and 19 GC pauses per run, to ~78ms and zero GC on the same file,
with the same correct output,
by changing only how the data is represented and accessed.

Requirements

Java 22+ (for FFM API in V04)
Maven
JMH (included via pom.xml)

Each version is a step, not a rewrite. Read them in order.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
src/main		src/main
.gitignore		.gitignore
JavaCoreConcepts.iml		JavaCoreConcepts.iml
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Performance Engineering - A Log Scanner Case Study

Test Environment

The Task

Why This Problem?

Repository Structure

How to Read This Repository

The Bottom Line

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Java Performance Engineering - A Log Scanner Case Study

Test Environment

The Task

Why This Problem?

Repository Structure

How to Read This Repository

The Bottom Line

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages