Skip to content

Feature/add embedding storaging#23

Merged
xueyulinn merged 15 commits into
reviewfrom
feature/add-embedding-storaging
May 14, 2026
Merged

Feature/add embedding storaging#23
xueyulinn merged 15 commits into
reviewfrom
feature/add-embedding-storaging

Conversation

@xueyulinn
Copy link
Copy Markdown
Owner

@xueyulinn xueyulinn commented May 14, 2026

Summary

This PR adds code chunk embedding storage and convention-aware evidence gathering to the PR review pipeline. It indexes code chunks with OpenAI embeddings, stores vectors in PostgreSQL using pgvector, and uses semantic search
to retrieve similar repository code for convention evidence during automated reviews.

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Documentation
  • Test
  • CI/CD
  • Chore

Changes Made

  • Added code chunk embedding generation with batched OpenAI embedding requests.
  • Added PostgreSQL persistence and semantic search support for code chunk embeddings.
  • Enabled pgvector in database initialization and added the code_chunk_embeddings table plus vector indexes.
  • Updated code chunk insertion to return inserted chunk IDs for downstream embedding generation.
  • Added retrieval of persisted code chunks by ID.
  • Added a convention evidence agent that searches semantically similar code chunks for repository convention context.
  • Added a dedicated code review writer agent for producing final PR review output and inline comments.
  • Reworked the PR review flow to gather structural and convention evidence, then submit review comments through GitHub.
  • Removed the older combined code_review_agent.py implementation.
  • Updated prompts and static analysis evidence flow to support the new review pipeline.
  • Added openai dependency usage for embedding generation.
  • Added module/function documentation for the code embedding module.

How to Test

  • Run the existing test suite:

    uv run pytest
    
  • Run static type checks:

    uv run mypy codehawk

  • Verify database initialization creates the pgvector extension and code_chunk_embeddings table.

  • Run the code indexing flow against a repository archive and confirm:

    • code chunks are inserted,
    • chunk IDs are returned,
    • embeddings are generated,
    • embedding rows are persisted.
  • Run the PR review flow and confirm:

    • structural evidence is generated,
    • convention evidence can query similar code chunks,
    • the review writer produces valid PR review output,
    • GitHub review comments are submitted against changed lines.

Checklist

  • Code builds successfully
  • Tests added or updated
  • Existing tests pass
  • Documentation updated if needed
  • No secrets or sensitive data included
  • Breaking changes documented

Breaking Changes

  • insert_code_chunks now returns list[int] instead of an inserted row count.
  • Database initialization now requires pgvector support through CREATE EXTENSION IF NOT EXISTS vector.
  • The previous code_review_agent.py has been removed in favor of separate evidence and review writer agents.

Additional Notes

This PR targets the review branch from feature/add-embedding-storaging.

@xueyulinn xueyulinn merged commit b71e43d into review May 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant