Skip to content

v0.9.0

@retospect retospect tagged this 24 Apr 15:06
This release bundles two independent themes:

1. Cross-corpus filtered semantic search (vector.py, store.py,
   tests/test_cross_corpus_search.py). search_text() now accepts a
   corpus_id filter (scalar or list), with pgvector doing a JOIN to
   refs.corpus_id at query time and Chroma reading a metadata field
   stamped at add_blocks time. corpus_id is threaded through every
   ingest path and surfaces in every hit's metadata alongside slug
   and ref_title.

2. Refresh-on-reingest (store.py, tests/test_store.py). ingest()
   previously early-returned on pdf_hash dedup without refreshing
   the Ref row, so garbage metadata from an earlier unverified
   ingest survived forever. New _should_upgrade_ref /
   _refresh_ref_metadata policy: fill blanks always; on upgrade
   (no Paper yet, or unverified→verified, or garbage→clean title)
   overwrite stale non-null fields. Safety guards prevent
   clean→garbage title downgrades, verified→unverified clobbers,
   and slug collisions. Fixes the 'simpson2007nmat' case I hit
   while re-ingesting nmat1849.pdf after the acatome-meta 0.3.6
   filename-DOI fix.

Tests: 10 new parametrized cases for refresh-on-reingest + full
cross-corpus search suite. 251 passed, 0 failed across sqlite,
postgres, and chroma backends locally.

Also: bump acatome-meta floor to 0.3.6 (imports is_garbage_title).
Assets 2
Loading