Skip to content

Pre-size trigram posting list allocations to reduce GC pressure during indexing #1017

@clemlesne

Description

@clemlesne

Context

seek wraps zoekt as a local CLI for AI coding agents. Every search is a blocking tool call — the agent waits for results. On the cold-index path, shard building is the dominant cost, and over half of that cost is Go runtime memory management rather than useful work.

Problem

postingsBuilder.newSearchableString (index/shard_builder.go:90) iterates rune-by-rune through each file, extracting trigrams and appending to posting lists:

s.postings[ng] = append(s.postings[ng], buf[:m]...)

Each trigram's posting list ([]byte) grows via repeated append. Profiled on kubernetes (29k files, 23 shards, Apple M1 Max):

Flat CPU Function What
9.7s (38%) runtime.memclrNoHeapPointers Clearing newly allocated pages
1.5s (6%) runtime.memmove Copying on slice growth
1.8s (7%) runtime.madvise Kernel memory management
0.8s (3%) runtime.mapassign_fast64 Map insertions

54% of CPU is runtime memory management. newSearchableString cumulative: 11.1s (44%), flat: 0.54s (2%) — nearly all time is in runtime calls it triggers.

Related PRs: #430 (skip trigram check for small files, 10% speedup), #522 (B+-tree for posting lists on search side), #680 (faster newLinesIndices), #838/#839 (reduce prepareNormalBuild allocations; #839 body states "we need a true rewrite"). None address newSearchableString.

Possible approaches

1. Pre-size maps (lowest effort): postings and lastOffsets are initialized as empty maps (map[ngram][]byte{} at line 82). Pre-sizing with make(map[ngram][]byte, 100_000) eliminates rehashing. The DocChecker already does this at line 596.

2. Single backing buffer: Instead of independent []byte per trigram, allocate one large []byte and sub-slice it. Eliminates per-trigram memclr/memmove. Go's arena experiment was removed in Go 1.23, but a manual arena (pre-allocated slice + offset tracking) works.

3. Pre-size posting list slices: Estimate per-trigram sizes based on corpus size, pre-allocate to reduce growth events.

4. GC tuning (immediate, no code changes): GOGC=off + GOMEMLIMIT=6GiB reduces GC overhead by 80-95% for batch workloads (Go team's recommended approach since 1.19). Could be set in the indexer entry point.

Impact

Reducing runtime memory management from 54% to ~20% yields ~1.5x speedup in shard building. On kubernetes, cold index from ~8s to ~5-6s.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions