Problem
MemForge uses PostgreSQL's built-in ts_rank_cd for keyword search ranking. This is TF-IDF-like and lacks proper document-length normalization and corpus-level IDF statistics. Short, focused memories don't rank higher than long ones with the same match count.
Proposed Solution
Adopt Timescale's pg_textsearch extension for true BM25 ranking.
What changes
- Replace the
content_tsv tsvector column and GIN index with a BM25 index on content
- Replace
ts_rank_cd(content_tsv, plainto_tsquery(...)) with the <@> BM25 operator
- Feed BM25 scores into the existing RRF fusion alongside pgvector semantic scores
- Remove manual tsvector/tsquery construction — BM25 indexes raw text directly
What stays the same
- pgvector for semantic search (unchanged)
- RRF for hybrid fusion (unchanged)
- pg_trgm for entity deduplication (different use case)
- Trigram fallback for typo-tolerant search
Benefits for memory retrieval
- BM25 length normalization: short, focused memories rank higher than long rambling ones
- Tunable k1/b parameters per index for memory-specific relevance profiles
- Block-Max WAND optimization: fast top-k without scoring every match
- Simpler code: no tsvector column maintenance
Prerequisites
- PostgreSQL 17+ (currently targeting 16)
shared_preload_libraries configuration (Dockerfile change)
- Linux/macOS only for pre-built binaries (no Windows builds yet)
Migration path
- Upgrade Docker image from postgres:16-alpine to postgres:17-alpine
- Add pg_textsearch to shared_preload_libraries
- CREATE EXTENSION pg_textsearch
- CREATE INDEX warm_tier_bm25_idx ON warm_tier USING bm25 (content)
- Update queryKeyword() in memory-manager.ts to use <@> operator
- Drop content_tsv column and GIN index (migration)
- Benchmark against ts_rank_cd on realistic dataset
References
Problem
MemForge uses PostgreSQL's built-in ts_rank_cd for keyword search ranking. This is TF-IDF-like and lacks proper document-length normalization and corpus-level IDF statistics. Short, focused memories don't rank higher than long ones with the same match count.
Proposed Solution
Adopt Timescale's pg_textsearch extension for true BM25 ranking.
What changes
content_tsvtsvector column and GIN index with a BM25 index oncontentts_rank_cd(content_tsv, plainto_tsquery(...))with the<@>BM25 operatorWhat stays the same
Benefits for memory retrieval
Prerequisites
shared_preload_librariesconfiguration (Dockerfile change)Migration path
References