#### What we are testing here:
- Basic Semantic Retrieval Testing: Query testing with loyalty-specific queries
- Distance analysis for semantic similarity quality
- Result ranking and relevance assessment
- File and project coverage analysis
- Filtered Retrieval Testing: semantic + filtering by project type, or file type, or project name
- Edge cases: malformed queries, empty queries, very long queries, etc.
    Examples:
    - Basic queries (loyalty points microservice specific):
        - "loyalty points calculation rules"
        - "order processing workflow"
        - "customer data integration"
        - "payment service integration"
        - "business rule patterns"
    - Filtered Queries:
        - C# files only
        - Configuration files only
        - Specific project filtering
#### Reporting:
- Human readable reports to use to compare embedding models and classifier performance
For example, the reports can be used for this type of analysis:
##### Updated Performance Analysis

| Embedding Model | LLM | Avg Distance | Query Coverage | Best Query Distance |
|----------------|-----|-------------|----------------|-------------------|
| **all-MiniLM-L6-v2** | **GPT-4.1** | **1.1287** | 54.55% | **0.6485** |
| all-MiniLM-L6-v2 | CodeLlama | 1.1854 | **63.64%** | 0.7035 |
| all-MiniLM-L6-v2 | Claude 3.5 | 1.1859 | 42.42% | 0.8229 |
| all-mpnet-base-v2 | GPT-4.1 | 1.2035 | 54.55% | 0.7404 |
| all-mpnet-base-v2 | CodeLlama | 1.2334 | 54.55% | 0.8506 |
| all-mpnet-base-v2 | Claude 3.5 | 1.2502 | 51.52% | 0.9409 |

##### Key Findings:

**1. Embedding Model Performance:**
- **all-MiniLM-L6-v2 consistently outperforms all-mpnet-base-v2** across all LLM combinations
- Average distance improvement: ~0.07-0.11 points better with MiniLM
- This pattern holds regardless of which LLM generates the queries

**2. LLM Query Generation Quality:**
- **GPT-4.1 generates the highest quality queries** (lowest distances)
- **CodeLlama has the best query coverage** but with slightly higher distances
- **Claude 3.5 shows the most variation** and generally higher distances

**3. Best Combinations:**
1. **all-MiniLM-L6-v2 + GPT-4.1** - Best overall performance
2. **all-MiniLM-L6-v2 + CodeLlama** - Best coverage with good performance
3. **all-mpnet-base-v2 + GPT-4.1** - Best MPNet combination

##### Conclusion:

The results consistently confirm that **all-MiniLM-L6-v2 is indeed performing better than all-mpnet-base-v2** for this specific loyalty points codebase. This is a genuine domain-specific finding that contradicts the general benchmark superiority of MPNet.

**Winner: all-MiniLM-L6-v2 with GPT-4.1**
- Lowest average retrieval distance (1.1287)
- Best individual query performance (0.6485)
- Most reliable semantic matching for this codebase

In [1]:
from typing import Dict, Any
import json
from rag.report_utils import generate_test_report

def report_and_save(test_results: Dict[str, Any], output_file: str = "results/results.json"):
    """Save test results to file"""

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(test_results, f, indent=2, ensure_ascii=False, default=str)

    print(f"Test results saved to: {output_file}")

    # Also save readable report
    report = generate_test_report(test_results)
    report_path = output_file.replace('.json', '_report.txt')
    with open(report_path, 'w', encoding='utf-8') as f:
        f.write(report)

    print(f"Test report saved to: {report_path}")

In [None]:
from vectorization.semantic_vector_database import SemanticVectorDatabase
from rag.rag_tester import RAGTester

base_vector_db = "../vectorization/results/{model}/chroma_db"
base_collection_name = "loyalty_code_semantics_{llm}"
models = [
            { "model": "all-MiniLM-L6-v2", "llms": ["claude3.5", "codellama", "gpt4.1"] },
            { "model": "all-mpnet-base-v2", "llms": [ "claude3.5", "codellama", "gpt4.1" ] }
         ]

rag_testers = []
reports = []
for model in models:
    embedding_model = model["model"]
    print("embedding_model: ", embedding_model)
    db_path = base_vector_db.format(model=embedding_model)
    print("db_path: ", db_path)

    for llm  in model["llms"]:
        print(llm)
        collection_name = base_collection_name.format(llm=llm)

        vector_db = SemanticVectorDatabase(db_path, embedding_model)
        rag_tester = RAGTester(vector_db)

        test_results = rag_tester.run_test_suite(collection_name)
        #generate report
        report_and_save(test_results, f"results/{embedding_model}.{collection_name}.json")

#### Interactive Testing:
- Real-time query testing
- Custom query exploration
- Distance feedback per vector collection

In [4]:
# Quick interactive testing

def run_test(query):
    for model in models:
        embedding_model = model["model"]
        db_path = base_vector_db.format(model=embedding_model)

        for llm  in model["llms"]:
            vector_db = SemanticVectorDatabase(db_path, embedding_model)

            collection_name = base_collection_name.format(llm=llm)
            collection = vector_db.get_collection(collection_name)

            result = collection.semantic_search(query, n_results=3)
            print(f"{embedding_model}-{llm}: Retrieved {result['summary']['total_results']} results")
            print(f"{embedding_model}-{llm}: Average distance: {result['summary'].get('avg_distance', 0):.4f}")

print("\n" + "="*60)
print("INTERACTIVE TESTING")
print("="*60)
print("Try some custom queries (type 'quit' to exit):")

while True:
    try:
        query = input("\nEnter your query: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break

        if query:
            run_test(query)

    except KeyboardInterrupt:
        print("\nExiting...")
        break
    except Exception as e:
        print(f"Error: {e}")


INTERACTIVE TESTING
Try some custom queries (type 'quit' to exit):
Initialized Chroma database at: ..\vectorization\results\all-MiniLM-L6-v2\chroma_db
Using embedding model: all-MiniLM-L6-v2

=== Performing Semantic Search ===
Query: 'where loyalty points rules are?'
Retrieving top 3 results...

Found 3 results:

--- Result 1 (Distance: 0.8852) ---
File: D:\src\learning\dotnet\event-driven-course\module5\src\PlantBasedPizza.LoyaltyPoints\application\PlantBasedPizza.LoyaltyPoints.Shared\AssemblyInfo.cs
Business Purpose: Enable unit testing of internal components of the loyalty points system while maintaining encapsulation in production
Technical Pattern: Assembly-level test visibility pattern
Business Workflow: Quality assurance and testing workflow for loyalty points business logic...

--- Result 2 (Distance: 0.8871) ---
File: D:\src\learning\dotnet\event-driven-course\module5\src\PlantBasedPizza.LoyaltyPoints\application\PlantBasedPizza.LoyaltyPoints.Internal\Services\LoyaltyService.