Skip to content

Rebuild and validate article search indexes#233

Merged
user1303836 merged 1 commit intomainfrom
codex/search-index-recovery
Mar 8, 2026
Merged

Rebuild and validate article search indexes#233
user1303836 merged 1 commit intomainfrom
codex/search-index-recovery

Conversation

@user1303836
Copy link
Owner

Summary

  • add startup health checks and rebuild-from-SQLite recovery for the article search index
  • make /index perform a full rebuild so it can repair drift instead of only upserting
  • validate existing zvec collections against the configured embedding dimensions/model and recreate incompatible derived indexes early

Testing

  • PYTHONPATH=/Users/user1303836/Development/intelstream-codex-search-index-recovery/src /Users/user1303836/Development/intelstream/.venv/bin/pytest tests/test_vector_store.py tests/test_discord/test_search.py tests/test_database.py -q
  • /Users/user1303836/Development/intelstream/.venv/bin/ruff check src/intelstream/database/vector_store.py src/intelstream/database/repository.py src/intelstream/discord/cogs/search.py src/intelstream/bot.py tests/test_vector_store.py tests/test_discord/test_search.py
  • /Users/user1303836/Development/intelstream/.venv/bin/ruff format --check src/intelstream/database/vector_store.py src/intelstream/database/repository.py src/intelstream/discord/cogs/search.py src/intelstream/bot.py tests/test_vector_store.py tests/test_discord/test_search.py

Closes #228
Closes #229

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Owner Author

@user1303836 user1303836 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Overall

Good defensive feature -- startup health checks, automatic rebuild from SQLite, and dimension/model validation for zvec collections. Well-structured with solid test coverage.

Blocking

  1. mypy type error. vector_store.py:94 -- _read_collection_metadata returns json.loads(...) which is Any, but the return type is declared as dict[str, Any] | None. CI confirms this fails type checking. Fix with an explicit cast or intermediate variable, e.g.:
    data = json.loads(path.read_text())
    assert isinstance(data, dict)
    return data

Non-blocking

  1. Redundant dimension checking in _collection_needs_recreate. It checks actual_dimension != self._dimensions from the schema, then also checks metadata.get("dimensions") != self._dimensions from the JSON file. These should always agree after the first write. The second check is only useful if someone manually edits the metadata file. Not harmful, just redundant.

  2. /index always recreates the collection now. The old behavior was additive (upsert). The new behavior calls recreate_articles_collection() which destroys and rebuilds everything. This means running /index always re-embeds all content, even if 99% is already correct. For large datasets this could be expensive in embedding API calls. Was additive-with-cleanup insufficient?

  3. Health check false positive risk. The probe embeds a sample item and searches for it in the top 10 results. If the item exists but isn't in the top 10 (e.g., many similar items), the check reports unhealthy and triggers a full rebuild. HEALTH_CHECK_TOPK = 10 is reasonable but not bulletproof.

  4. rendered_results tracking is a good catch. Previously, if vector search returned IDs that no longer exist in SQLite (orphaned references), the embed would show 0 fields with a misleading "N results" footer. The new code handles this correctly.

  5. Merge conflict with PR #234. Both PRs modify vector_store.py and repository.py from the same base commit. This PR should merge first (it's more foundational), then #234 should rebase.

CI

Lint failure is pre-existing (7 unrelated files). Type check failure is real and caused by this PR -- must fix before merging.

Verdict

Fix the mypy error, then good to merge. Should go in before PR #234.

@user1303836 user1303836 force-pushed the codex/search-index-recovery branch from a310b73 to 0315d84 Compare March 8, 2026 00:59
@user1303836 user1303836 merged commit b7e2d2f into main Mar 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant