#### What we are testing here:
- Basic Semantic Retrieval Testing: Query testing with loyalty-specific queries
- Distance analysis for semantic similarity quality
- Result ranking and relevance assessment
- File and project coverage analysis
- Filtered Retrieval Testing: semantic + filtering by project type, or file type, or project name
- Edge cases: malformed queries, empty queries, very long queries, etc.
    Examples:
    - Basic queries (loyalty points microservice specific):
        - "loyalty points calculation rules"
        - "order processing workflow"
        - "customer data integration"
        - "payment service integration"
        - "business rule patterns"
    - Filtered Queries:
        - C# files only
        - Configuration files only
        - Specific project filtering
#### Reporting:
- Human readable reports to use to compare embedding models and classifier performance
For example, the reports can be used for this type of analysis:
##### Updated Performance Analysis

| Embedding Model | LLM | Avg Distance | Query Coverage | Best Query Distance |
|----------------|-----|-------------|----------------|-------------------|
| **all-MiniLM-L6-v2** | **GPT-4.1** | **1.1287** | 54.55% | **0.6485** |
| all-MiniLM-L6-v2 | CodeLlama | 1.1854 | **63.64%** | 0.7035 |
| all-MiniLM-L6-v2 | Claude 3.5 | 1.1859 | 42.42% | 0.8229 |
| all-mpnet-base-v2 | GPT-4.1 | 1.2035 | 54.55% | 0.7404 |
| all-mpnet-base-v2 | CodeLlama | 1.2334 | 54.55% | 0.8506 |
| all-mpnet-base-v2 | Claude 3.5 | 1.2502 | 51.52% | 0.9409 |

##### Key Findings:

**1. Embedding Model Performance:**
- **all-MiniLM-L6-v2 consistently outperforms all-mpnet-base-v2** across all LLM combinations
- Average distance improvement: ~0.07-0.11 points better with MiniLM
- This pattern holds regardless of which LLM generates the queries

**2. LLM Query Generation Quality:**
- **GPT-4.1 generates the highest quality queries** (lowest distances)
- **CodeLlama has the best query coverage** but with slightly higher distances
- **Claude 3.5 shows the most variation** and generally higher distances

**3. Best Combinations:**
1. **all-MiniLM-L6-v2 + GPT-4.1** - Best overall performance
2. **all-MiniLM-L6-v2 + CodeLlama** - Best coverage with good performance
3. **all-mpnet-base-v2 + GPT-4.1** - Best MPNet combination

##### Conclusion:

The results consistently confirm that **all-MiniLM-L6-v2 is indeed performing better than all-mpnet-base-v2** for this specific loyalty points codebase. This is a genuine domain-specific finding that contradicts the general benchmark superiority of MPNet.

**Winner: all-MiniLM-L6-v2 with GPT-4.1**
- Lowest average retrieval distance (1.1287)
- Best individual query performance (0.6485)
- Most reliable semantic matching for this codebase

In [4]:
from pathlib import Path
from typing import Dict, Any
import json
from rag.report_utils import generate_test_report
from rag.rag_report_analysis import RAGReportAnalyzer

def report_and_save(test_results: Dict[str, Any], output_file: str = "results/results.json"):
    """ Generate and save report
        Save test results to file
    """

    Path(output_file).parent.mkdir(parents=True, exist_ok=True)

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(test_results, f, indent=2, ensure_ascii=False, default=str)

    print(f"Test results saved to: {output_file}")

    # Also save readable report
    report = generate_test_report(test_results)
    report_path = output_file.replace('.json', '_report.txt')
    with open(report_path, 'w', encoding='utf-8') as f:
        f.write(report)

    print(f"Test report saved to: {report_path}")


def show_performance_analysis(reports_dir: str):
    """Example usage of the RAG Report Analyzer."""
    # Initialize analyzer
    analyzer = RAGReportAnalyzer(reports_dir)

    # Load reports
    print("Loading RAG reports...")
    reports = analyzer.load_reports("*report*.txt")

    if not reports:
        print("No reports found. Please ensure report files are in the current directory.")
        return

    # Generate analysis
    analyzer.print_analysis_summary()

    # Save results to CSV
    df = analyzer.generate_performance_table()
    if not df.empty:
        df.to_csv(f'{reports_dir}/rag_performance_analysis.csv', index=False)
        print(f"\nResults saved to 'rag_performance_analysis.csv'")

In [2]:
vector_results = "results/vectorization"
rag_results = "results/rag"

chroma_base = "{model}/loyalty_points_kb"

loyalty_collection = "loyalty_code_semantics_{llm}"
models = [
            { "model": "all-MiniLM-L6-v2", "llms": ["claude3.5", "codellama", "gpt4.1"] },
            { "model": "all-mpnet-base-v2", "llms": [ "claude3.5", "codellama", "gpt4.1" ] }
         ]

In [None]:
from vectorization.semantic_vector_database import SemanticVectorDatabase
from rag.rag_tester import RAGTester

rag_testers = []
reports = []
for model in models:
    embedding_model = model["model"]
    print("embedding_model: ", embedding_model)
    db_path = f"{vector_results}/{chroma_base.format(model=embedding_model)}"
    print("db_path: ", db_path)

    for llm in model["llms"]:
        print(llm)
        collection_name = loyalty_collection.format(llm=llm)

        vector_db = SemanticVectorDatabase(db_path, embedding_model)
        rag_tester = RAGTester(vector_db)

        test_results = rag_tester.run_test_suite(collection_name)
        report_and_save(test_results, f"{rag_results}/{embedding_model}.{collection_name}.json")

In [5]:
show_performance_analysis(rag_results)

Loading RAG reports...
RAG PERFORMANCE ANALYSIS SUMMARY
Total Reports Analyzed: 6
Embedding Models: 3
LLM Models: 3

PERFORMANCE COMPARISON TABLE:
| Embedding Model   | LLM       |   Avg Distance |   Query Coverage (%) |   Best Query Distance |
|:------------------|:----------|---------------:|---------------------:|----------------------:|
| ** all-MiniLM-L6-v2  ** | ** gpt4.1    ** | **         1.1296 ** | **              51.5200 ** | **                0.6464 ** |
| all-MiniLM-L6-v2  | claude3.5 |         1.1699 |              51.5200 |                0.7950 |
| unknown           | codellama |         1.1797 |              60.6100 |                0.7286 |
| all-mpnet-base-v2 | gpt4.1    |         1.1933 |              63.6400 |                0.7520 |
| unknown           | codellama |         1.2374 |              48.4800 |                0.8422 |
| all-mpnet-base-v2 | claude3.5 |         1.2389 |              57.5800 |                0.9144 |

KEY FINDINGS:
• Best Overall Performan

#### Interactive Testing:
- Real-time query testing
- Custom query exploration
- Distance feedback per vector collection

In [None]:
# Quick interactive testing

def run_test(query):
    for model in models:
        embedding_model = model["model"]
        db_path = chroma_base.format(model=embedding_model)

        for llm  in model["llms"]:
            vector_db = SemanticVectorDatabase(db_path, embedding_model)

            collection_name = loyalty_collection.format(llm=llm)
            collection = vector_db.get_collection(collection_name)

            result = collection.semantic_search(query, n_results=3)
            print(f"{embedding_model}-{llm}: Retrieved {result['summary']['total_results']} results")
            print(f"{embedding_model}-{llm}: Average distance: {result['summary'].get('avg_distance', 0):.4f}")

print("\n" + "="*60)
print("INTERACTIVE TESTING")
print("="*60)
print("Try some custom queries (type 'quit' to exit):")

while True:
    try:
        query = input("\nEnter your query: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break

        if query:
            run_test(query)

    except KeyboardInterrupt:
        print("\nExiting...")
        break
    except Exception as e:
        print(f"Error: {e}")