Key Features:
1. Semantic Document Preparation
- Rich text creation from classification results for better embedding
- Metadata preservation for filtering and context
- Business-focused text combining purpose, rules, workflows, and integration points

2. Chroma Integration
- Persistent storage for the vector database
- Sentence-transformers for embeddings (all-MiniLM-L6-v2 default)
- Batch processing for large codebases
- Collection management with reset capabilities

3. Semantic Search Testing
- Query testing with loyalty-specific test cases
- Result ranking by semantic similarity
- Distance metrics for search quality assessment

4. Database Analytics
- Collection statistics (projects, file types, confidence scores)
- Provider comparison across OpenAI/Anthropic/CodeLlama results
- Embedding model comparison functionality

Usage Flow:
- Input: classification/results/output.csv from classification/loyalty_classifier_nb notebook
- Processing: Convert semantic classifications to embeddings
- Output: Searchable vector database for RAG queries

Test Queries Included:

"loyalty points calculation rules"
"order processing workflow"
"customer data integration"
"payment service integration"
"business rule patterns"

In [1]:
CLASSIFICATION_CSV = "D:/src/ai-agents/dev-navigator/classification/results/{llm}/output.csv"
DB_PATH = "D:/src/ai-agents/dev-navigator/vectorization/results/{embedder}/croma_db"
COLLECTION_NAME = "loyalty_code_semantics_{llm}"
# "all-MiniLM-L6-v2",      # Fast, good general purpose
# "all-mpnet-base-v2",     # Better quality, slower
# "multi-qa-MiniLM-L6-cos-v1"  # Optimized for Q&A

def setup(llm: str, embedder: str):
    config = { "llm": llm, "embedder": embedder }

    classification_csv = CLASSIFICATION_CSV.format(**config)
    db_path = DB_PATH.format(**config)
    collection_name = COLLECTION_NAME.format(**config)

    return classification_csv, db_path, collection_name

In [2]:
from vectorization.document_utils import prepare_documents_for_embedding, load_classification_data
import json

from vectorization.semantic_vector_database import SemanticVectorDatabase

def run_vectorization_pipeline(db_path: str,
                                    classification_csv: str,
                                    collection_name: str,
                                    embedding_model: str,
                                    reset_db: bool = True):
    """Main pipeline to create vector database from classification results"""

    print("=== CodeSense Vector Database Creation ===")

    # Initialize database
    vector_db = SemanticVectorDatabase(db_path, embedding_model)

    # Create collection
    collection = vector_db.create_collection(collection_name, reset_db)

    # Prepare documents from classification data
    df = load_classification_data(classification_csv)
    documents = prepare_documents_for_embedding(df)

    # Add to collection
    collection.add_documents_to_collection(documents)

    # Get collection statistics
    stats = collection.get_collection_stats()
    print(f"\n=== Collection Statistics ===")
    print(json.dumps(stats, indent=2, default=str))

    # Test semantic search
    test_queries = [
        "loyalty points calculation rules",
        "order processing workflow",
        "customer data integration",
        "payment service integration",
        "business rule patterns"
    ]

    print(f"\n=== Testing Semantic Search ===")
    for query in test_queries:
        collection.semantic_search(query, n_results=3)

In [3]:
print("\n=== Running Vectorization with Anthropic set and  all-MiniLM-L6-v2 ===")
classification_csv, db_path, collection_name = setup("claude3.5", "all-MiniLM-L6-v2")
run_vectorization_pipeline(db_path=db_path, classification_csv=classification_csv, collection_name=collection_name, embedding_model="all-MiniLM-L6-v2")


=== Running Vectorization with Anthropic set and  all-MiniLM-L6-v2 ===
=== CodeSense Vector Database Creation ===


  from .autonotebook import tqdm as notebook_tqdm


Initialized Chroma database at: D:\src\ai-agents\dev-navigator\vectorization\results\all-MiniLM-L6-v2\croma_db
Using embedding model: all-MiniLM-L6-v2
Deleted existing collection: loyalty_code_semantics_claude3.5
Collection 'loyalty_code_semantics_claude3.5' ready with 0 documents
Loaded 33 classification records from D:/src/ai-agents/dev-navigator/classification/results/claude3.5/output.csv
Prepared 33 documents for embedding
Added batch 1: 33/33 documents
Successfully added 33 documents to collection

=== Collection Statistics ===
{
  "total_documents": 33,
  "projects": {
    "PlantBasedPizza.LoyaltyPoints.Api.csproj": 6,
    "PlantBasedPizza.LoyaltyPoints.Internal.csproj": 7,
    "PlantBasedPizza.LoyaltyPoints.Shared.csproj": 13,
    "PlantBasedPizza.LoyaltyPoints.Worker.csproj": 7
  },
  "file_types": {
    "cs": 21,
    "appsettings": 12
  },
  "llm_providers": {
    "Anthropic-claude-3-5-sonnet-20241022": 33
  },
  "technical_patterns": {
    "Cross-cutting concern using Extensi

In [4]:
print("\n=== Running Vectorization with Anthropic set and  all-mpnet-base-v2 ===")
classification_csv, db_path, collection_name = setup("claude3.5", "all-mpnet-base-v2")
run_vectorization_pipeline(db_path=db_path, classification_csv=classification_csv, collection_name=collection_name, embedding_model="all-mpnet-base-v2")


=== Running Vectorization with Anthropic set and  all-mpnet-base-v2 ===
=== CodeSense Vector Database Creation ===
Initialized Chroma database at: D:\src\ai-agents\dev-navigator\vectorization\results\all-mpnet-base-v2\croma_db
Using embedding model: all-mpnet-base-v2
Deleted existing collection: loyalty_code_semantics_claude3.5
Collection 'loyalty_code_semantics_claude3.5' ready with 0 documents
Loaded 33 classification records from D:/src/ai-agents/dev-navigator/classification/results/claude3.5/output.csv
Prepared 33 documents for embedding
Added batch 1: 33/33 documents
Successfully added 33 documents to collection

=== Collection Statistics ===
{
  "total_documents": 33,
  "projects": {
    "PlantBasedPizza.LoyaltyPoints.Api.csproj": 6,
    "PlantBasedPizza.LoyaltyPoints.Internal.csproj": 7,
    "PlantBasedPizza.LoyaltyPoints.Shared.csproj": 13,
    "PlantBasedPizza.LoyaltyPoints.Worker.csproj": 7
  },
  "file_types": {
    "cs": 21,
    "appsettings": 12
  },
  "llm_providers": {


In [5]:
print("\n=== Running Vectorization with Ollama set and  all-MiniLM-L6-v2 ===")
classification_csv, db_path, collection_name = setup("codellama", "all-MiniLM-L6-v2")
run_vectorization_pipeline(db_path=db_path, classification_csv=classification_csv, collection_name=collection_name, embedding_model="all-MiniLM-L6-v2")


=== Running Vectorization with Ollama set and  all-MiniLM-L6-v2 ===
=== CodeSense Vector Database Creation ===
Initialized Chroma database at: D:\src\ai-agents\dev-navigator\vectorization\results\all-MiniLM-L6-v2\croma_db
Using embedding model: all-MiniLM-L6-v2
Deleted existing collection: loyalty_code_semantics_codellama
Collection 'loyalty_code_semantics_codellama' ready with 0 documents
Loaded 28 classification records from D:/src/ai-agents/dev-navigator/classification/results/codellama/output.csv
Prepared 28 documents for embedding
Added batch 1: 28/28 documents
Successfully added 28 documents to collection

=== Collection Statistics ===
{
  "total_documents": 28,
  "projects": {
    "PlantBasedPizza.LoyaltyPoints.Api.csproj": 4,
    "PlantBasedPizza.LoyaltyPoints.Internal.csproj": 6,
    "PlantBasedPizza.LoyaltyPoints.Shared.csproj": 11,
    "PlantBasedPizza.LoyaltyPoints.Worker.csproj": 7
  },
  "file_types": {
    "cs": 17,
    "appsettings": 11
  },
  "llm_providers": {
    "C

In [3]:
print("\n=== Running Vectorization with OpenAI set and  all-MiniLM-L6-v2 ===")
classification_csv, db_path, collection_name = setup("gpt4.1", "all-MiniLM-L6-v2")
run_vectorization_pipeline(db_path=db_path, classification_csv=classification_csv, collection_name=collection_name, embedding_model="all-MiniLM-L6-v2")


=== Running Vectorization with OpenAI set and  all-MiniLM-L6-v2 ===
=== CodeSense Vector Database Creation ===


  from .autonotebook import tqdm as notebook_tqdm


Initialized Chroma database at: D:\src\ai-agents\dev-navigator\vectorization\results\all-MiniLM-L6-v2\croma_db
Using embedding model: all-MiniLM-L6-v2
Collection 'loyalty_code_semantics_gpt4.1' ready with 0 documents
Loaded 33 classification records from D:/src/ai-agents/dev-navigator/classification/results/gpt4.1/output.csv
Prepared 33 documents for embedding
Added batch 1: 33/33 documents
Successfully added 33 documents to collection

=== Collection Statistics ===
{
  "total_documents": 33,
  "projects": {
    "PlantBasedPizza.LoyaltyPoints.Api.csproj": 6,
    "PlantBasedPizza.LoyaltyPoints.Internal.csproj": 7,
    "PlantBasedPizza.LoyaltyPoints.Shared.csproj": 13,
    "PlantBasedPizza.LoyaltyPoints.Worker.csproj": 7
  },
  "file_types": {
    "cs": 21,
    "appsettings": 12
  },
  "llm_providers": {
    "OpenAI-gpt-4.1-2025-04-14": 33
  },
  "technical_patterns": {
    "Cross-cutting concern (observability/telemetry enrichment) via extension methods": 1,
    "Microservice with RESTf