#### What we are testing here:
- Basic Semantic Retrieval Testing: Query testing with loyalty-specific queries
- Distance analysis for semantic similarity quality
- Result ranking and relevance assessment
- File and project coverage analysis
- Filtered Retrieval Testing: semantic + filtering by project type, or file type, or project name
- Edge cases: malformed queries, empty queries, very long queries, etc.
    Examples:
    - Basic queries (loyalty points microservice specific):
        - "loyalty points calculation rules"
        - "order processing workflow"
        - "customer data integration"
        - "payment service integration"
        - "business rule patterns"
    - Filtered Queries:
        - C# files only
        - Configuration files only
        - Specific project filtering
#### Reporting:
- Human readable reports to use to compare embedding models and classifier performance
For example, the reports can be used for this type of analysis:
##### Updated Performance Analysis

| Embedding Model | LLM | Avg Distance | Query Coverage | Best Query Distance |
|----------------|-----|-------------|----------------|-------------------|
| **all-MiniLM-L6-v2** | **GPT-4.1** | **1.1287** | 54.55% | **0.6485** |
| all-MiniLM-L6-v2 | CodeLlama | 1.1854 | **63.64%** | 0.7035 |
| all-MiniLM-L6-v2 | Claude 3.5 | 1.1859 | 42.42% | 0.8229 |
| all-mpnet-base-v2 | GPT-4.1 | 1.2035 | 54.55% | 0.7404 |
| all-mpnet-base-v2 | CodeLlama | 1.2334 | 54.55% | 0.8506 |
| all-mpnet-base-v2 | Claude 3.5 | 1.2502 | 51.52% | 0.9409 |

##### Key Findings:

**1. Embedding Model Performance:**
- **all-MiniLM-L6-v2 consistently outperforms all-mpnet-base-v2** across all LLM combinations
- Average distance improvement: ~0.07-0.11 points better with MiniLM
- This pattern holds regardless of which LLM generates the queries

**2. LLM Query Generation Quality:**
- **GPT-4.1 generates the highest quality queries** (lowest distances)
- **CodeLlama has the best query coverage** but with slightly higher distances
- **Claude 3.5 shows the most variation** and generally higher distances

**3. Best Combinations:**
1. **all-MiniLM-L6-v2 + GPT-4.1** - Best overall performance
2. **all-MiniLM-L6-v2 + CodeLlama** - Best coverage with good performance
3. **all-mpnet-base-v2 + GPT-4.1** - Best MPNet combination

##### Conclusion:

The results consistently confirm that **all-MiniLM-L6-v2 is indeed performing better than all-mpnet-base-v2** for this specific loyalty points codebase. This is a genuine domain-specific finding that contradicts the general benchmark superiority of MPNet.

**Winner: all-MiniLM-L6-v2 with GPT-4.1**
- Lowest average retrieval distance (1.1287)
- Best individual query performance (0.6485)
- Most reliable semantic matching for this codebase

In [1]:
from typing import List
from vectorization.semantic_match import SemanticMatch
from datetime import datetime

from rag.report_utils import calculate_performance_metrics
from vectorization.semantic_vector_database import SemanticVectorDatabase

def run_test_suite(vector_db: SemanticVectorDatabase,  collection_name: str):
    """Run comprehensive RAG test suite"""

    print("\n" + "=" * 60)
    print("COMPREHENSIVE RAG TEST SUITE")
    print("=" * 60)

    collection = vector_db.get_collection(collection_name)

    test_results: Dict[str, List[SemanticMatch]] = {
        'basic_tests': [],
        'filtered_tests': [],
        'edge_case_tests': [],
    }

    # Test 1: Basic semantic queries
    print("\n1. BASIC SEMANTIC RETRIEVAL TESTS")
    # Query testing
    # Distance analysis for semantic similarity quality
    # Result ranking and relevance assessment
    # File and project coverage analysis

    basic_queries = [
        "loyalty points calculation rules",
        "order processing workflow",
        "customer data integration",
        "payment service integration",
        "business rule patterns",
        "event handlers",
        "database operations",
        "service dependencies",
        "configuration settings",
        "loyalty point rewards"
    ]

    for query in basic_queries:
        result = collection.semantic_search(query, n_results=3)
        test_results['basic_tests'].append(result)

    # Test 2: Filtered retrieval
    print("\n2. FILTERED RETRIEVAL TESTS")
    # Metadata filtering by file type, project, etc.
    # Filter effectiveness measurement

    filtered_tests = [
        {
            'query': "loyalty points calculation",
            'filters': {'file_type': 'cs'},
            'description': 'C# files only'
        },
        {
            'query': "service integration",
            'filters': {'file_type': 'appsettings'},
            'description': 'Configuration files only'
        },
        {
            'query': "business rules",
            'filters': {'project_name': 'LoyaltyPoints'},
            'description': 'Main project only'
        }
    ]

    for test in filtered_tests:
        print(f"\nTesting: {test['description']}")
        result = collection.filtered_semantic_search(
            test['query'],
            test['filters'],
            n_results=3
        )
        test_results['filtered_tests'].append(result)

    # Test 3: Edge cases
    print("\n3. EDGE CASE TESTS")
     # Empty queries and malformed input
     # Non-existent terms for robustness
     # Single character and stop words handling
     # Very long queries for boundary testing

    edge_cases = [
        "",  # Empty query
        "xyzabc123nonexistent",  # Non-existent terms
        "a",  # Single character
        "the and or but",  # Stop words only
        "loyalty" * 50,  # Very long query
    ]

    for query in edge_cases:
        result = collection.semantic_search(query, n_results=1)
        test_results['edge_case_tests'].append(result)

    # Test 4: Performance metrics
    print("\n4. PERFORMANCE METRICS")
    timestamp = datetime.now().isoformat()
    collection_stats = collection.get_collection_stats_v2()
    performance_metrics = calculate_performance_metrics(test_results, collection_stats)

    return timestamp, performance_metrics, test_results, collection_stats

In [2]:
from pathlib import Path
from typing import Dict, Any
import json
from rag.report_utils import generate_rag_report

def report_and_save(test_results: Dict[str, Any], output_file: str = "results/results.json"):
    """ Generate and save report
        Save test results to file
    """

    Path(output_file).parent.mkdir(parents=True, exist_ok=True)

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(test_results, f, indent=2, ensure_ascii=False, default=str)

    print(f"Test results saved to: {output_file}")

    # Also save readable report
    report = generate_rag_report(test_results)
    report_path = output_file.replace('.json', '_report.txt')
    with open(report_path, 'w', encoding='utf-8') as f:
        f.write(report)

    print(f"Test report saved to: {report_path}")

In [3]:
vector_results = "results/vectorization"
rag_results = "results/rag"

chroma_base = "{model}/loyalty_points_kb"

loyalty_collection = "loyalty_code_semantics_{llm}"
models = [
            { "model": "all-MiniLM-L6-v2", "llms": [ "claude3.5", "claude3.7", "claude4.0", "codellama", "gpt4.1"] },
            { "model": "all-mpnet-base-v2", "llms": [ "claude3.5", "claude3.7", "claude4.0", "codellama", "gpt4.1" ] }
         ]

In [None]:
from vectorization.semantic_vector_database import SemanticVectorDatabase

rag_testers = []
reports = []
for model in models:
    embedding_model = model["model"]
    print("embedding_model: ", embedding_model)
    db_path = f"{vector_results}/{chroma_base.format(model=embedding_model)}"
    print("db_path: ", db_path)

    for llm in model["llms"]:
        print(llm)
        collection_name = loyalty_collection.format(llm=llm)

        vector_db = SemanticVectorDatabase(db_path, embedding_model)
        timestamp, performance_metrics, test_results, collection_stats = run_test_suite(vector_db, collection_name)
        output: Dict[str, Any] = {
            "timestamp": timestamp,
            "performance_metrics": performance_metrics,
            "collection_stats": collection_stats,
        }
        output.update(test_results)
        report_and_save(output, f"{rag_results}/{embedding_model}.{collection_name}.json")

In [5]:
from rag.analyzer import RAGReportAnalyzer

def show_performance_analysis(reports_dir: str):
    """Example usage of the RAG Report Analyzer."""
    # Initialize analyzer
    analyzer = RAGReportAnalyzer(reports_dir)

    # Load reports
    print("Loading RAG reports...")
    reports = analyzer.load_reports("*report*.txt")

    if not reports:
        print("No reports found. Please ensure report files are in the current directory.")
        return

    # Generate analysis
    analyzer.print_analysis_summary()

    # Save results to CSV
    df = analyzer.generate_performance_table()
    if not df.empty:
        df.to_csv(f'{reports_dir}/rag_performance_analysis.csv', index=False)
        print(f"\nResults saved to 'rag_performance_analysis.csv'")

show_performance_analysis(rag_results)

Loading RAG reports...
RAG PERFORMANCE ANALYSIS SUMMARY
Total Reports Analyzed: 10
Embedding Models: 2
LLM Models: 5

PERFORMANCE COMPARISON TABLE:
| Embedding Model   | LLM       |   Avg Distance |   Query Coverage (%) |   Best Query Distance |
|:------------------|:----------|---------------:|---------------------:|----------------------:|
| ** all-MiniLM-L6-v2  ** | ** claude4.0 ** | **         1.1268 ** | **              54.5500 ** | **                0.7034 ** |
| all-MiniLM-L6-v2  | gpt4.1    |         1.1296 |              51.5200 |                0.6464 |
| all-MiniLM-L6-v2  | claude3.7 |         1.1399 |              54.5500 |                0.7430 |
| all-MiniLM-L6-v2  | claude3.5 |         1.1699 |              51.5200 |                0.7950 |
| all-MiniLM-L6-v2  | codellama |         1.1797 |              60.6100 |                0.7286 |
| all-mpnet-base-v2 | gpt4.1    |         1.1933 |              63.6400 |                0.7520 |
| all-mpnet-base-v2 | claude3.7 |     

#### Interactive Testing:
- Real-time query testing
- Custom query exploration
- Distance feedback per vector collection

In [None]:
# Quick interactive testing

def run_test(query):
    for model in models:
        embedding_model = model["model"]
        db_path = chroma_base.format(model=embedding_model)

        for llm  in model["llms"]:
            vector_db = SemanticVectorDatabase(db_path, embedding_model)

            collection_name = loyalty_collection.format(llm=llm)
            collection = vector_db.get_collection(collection_name)

            result = collection.semantic_search(query, n_results=3)
            print(f"{embedding_model}-{llm}: Retrieved {result['summary']['total_results']} results")
            print(f"{embedding_model}-{llm}: Average distance: {result['summary'].get('avg_distance', 0):.4f}")

print("\n" + "="*60)
print("INTERACTIVE TESTING")
print("="*60)
print("Try some custom queries (type 'quit' to exit):")

while True:
    try:
        query = input("\nEnter your query: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break

        if query:
            run_test(query)

    except KeyboardInterrupt:
        print("\nExiting...")
        break
    except Exception as e:
        print(f"Error: {e}")