Skip to content

[E2E] Expand Comprehensive E2E test cases in CI #661

@Xunzhuo

Description

@Xunzhuo

Description

Expand the AI Gateway profile in the E2E testing framework with comprehensive test cases covering all major Semantic Router features including prompt guard, semantic cache, PII detection, tool selection, and advanced routing scenarios.

Background

The initial AI Gateway profile implementation in #655 includes only basic test cases:

  • basic-health-check: Verifies all deployments are healthy
  • chat-completions-request: Tests the /v1/chat/completions endpoint

We need to expand test coverage to validate all Semantic Router features in the AI Gateway environment.

Proposed Test Cases

1. Security & Safety Tests

Prompt Guard Tests

  • Jailbreak Detection: Test detection of jailbreak attempts

    • Input: Known jailbreak patterns
    • Expected: Request blocked or routed to safe model
    • Validation: Check response headers and model selection
  • Malicious Prompt Filtering: Test filtering of harmful prompts

    • Input: Prompts with malicious intent
    • Expected: Request rejected with appropriate error
    • Validation: Verify prompt guard classification results

PII Detection Tests

  • PII Entity Detection: Test detection of various PII types

    • Input: Prompts containing SSN, email, phone, credit card
    • Expected: PII entities detected and classified
    • Validation: Check PII classification confidence scores
  • PII Policy Enforcement: Test model selection based on PII policy

    • Input: Prompt with PII + model with PII restrictions
    • Expected: Alternative model selected or request blocked
    • Validation: Verify model routing based on PII policy
  • PII Redaction: Test PII redaction in responses (if enabled)

    • Input: Request that might return PII
    • Expected: PII redacted in response
    • Validation: Check response content for PII presence

2. Semantic Cache Tests

  • Cache Hit: Test semantic cache hit for similar queries

    • Input: Two semantically similar queries
    • Expected: Second query returns cached response
    • Validation: Check cache hit headers and response time
  • Cache Miss: Test cache miss for dissimilar queries

    • Input: Two semantically different queries
    • Expected: Both queries hit backend models
    • Validation: Verify no cache hit headers
  • Cache Similarity Threshold: Test cache threshold configuration

    • Input: Queries with varying similarity scores
    • Expected: Cache hit only when above threshold
    • Validation: Check cache behavior at boundary conditions
  • Cache TTL: Test cache expiration

    • Input: Same query before and after TTL
    • Expected: Cache hit before TTL, miss after
    • Validation: Verify cache expiration behavior
  • Cache Eviction: Test cache eviction policies (FIFO, LRU)

    • Input: Queries exceeding cache capacity
    • Expected: Oldest entries evicted
    • Validation: Check cache size and eviction order

3. Intelligent Routing Tests

Category-Based Routing

  • Category Classification: Test intent classification accuracy

    • Input: Queries from different categories (math, biology, law, etc.)
    • Expected: Correct category classification
    • Validation: Check classification confidence and category
  • Model Selection by Category: Test model routing based on category scores

    • Input: Category-specific queries
    • Expected: Model with highest score for category selected
    • Validation: Verify selected model matches configuration
  • Fallback to Default Model: Test fallback when no category matches

    • Input: Query with low classification confidence
    • Expected: Default model selected
    • Validation: Check fallback behavior

Reasoning Mode Detection

  • Reasoning Mode Activation: Test detection of reasoning-required queries

    • Input: Complex reasoning queries
    • Expected: Reasoning mode enabled in request
    • Validation: Check reasoning parameters in backend request
  • Reasoning Family Configuration: Test reasoning family-specific settings

    • Input: Query requiring reasoning
    • Expected: Correct reasoning parameters for model family
    • Validation: Verify reasoning configuration (thinking, enable_thinking, etc.)

Entropy-Based Routing

  • High Entropy Detection: Test routing for high-entropy queries

    • Input: Ambiguous or complex queries
    • Expected: More capable model selected
    • Validation: Check entropy score and model selection
  • Low Entropy Optimization: Test efficient routing for simple queries

    • Input: Simple, low-entropy queries
    • Expected: Efficient model selected
    • Validation: Verify cost-optimized routing

4. Tool Selection Tests

  • Tool Detection: Test automatic tool selection

    • Input: Query requiring specific tools (weather, calculator, search)
    • Expected: Appropriate tools selected
    • Validation: Check tools in request payload
  • Tool Similarity Matching: Test tool selection based on similarity

    • Input: Queries with varying tool relevance
    • Expected: Tools selected above similarity threshold
    • Validation: Verify tool selection confidence scores
  • Top-K Tool Selection: Test limiting number of selected tools

    • Input: Query matching multiple tools
    • Expected: Only top-K tools selected
    • Validation: Check number of tools in request

5. Request/Response Transformation Tests

  • System Prompt Injection: Test category-specific system prompts

    • Input: Query from specific category
    • Expected: Category system prompt added to request
    • Validation: Check system message in backend request
  • Request Header Propagation: Test custom header forwarding

    • Input: Request with custom headers
    • Expected: Headers forwarded to backend
    • Validation: Verify headers in backend request
  • Response Header Injection: Test VSR metadata headers

    • Input: Any valid request
    • Expected: VSR headers in response (selected model, category, etc.)
    • Validation: Check response headers

6. Error Handling & Edge Cases

  • Invalid Model Request: Test handling of non-existent model

    • Input: Request with invalid model name
    • Expected: Appropriate error response
    • Validation: Check error message and status code
  • Backend Timeout: Test timeout handling

    • Input: Request to slow/unresponsive backend
    • Expected: Timeout error after configured duration
    • Validation: Verify timeout behavior
  • Backend Failure Fallback: Test fallback when primary backend fails

    • Input: Request when backend is down
    • Expected: Fallback to alternative backend or error
    • Validation: Check failover behavior
  • Malformed Request: Test handling of invalid requests

    • Input: Requests with missing/invalid fields
    • Expected: Validation error response
    • Validation: Check error messages

7. Performance & Load Tests

  • Concurrent Requests: Test handling of concurrent requests

    • Input: Multiple simultaneous requests
    • Expected: All requests processed successfully
    • Validation: Check response times and success rate
  • Large Payload: Test handling of large prompts

    • Input: Request with very long prompt
    • Expected: Request processed or rejected with size limit error
    • Validation: Verify payload size handling
  • Streaming Response: Test streaming completions

    • Input: Request with stream: true
    • Expected: Streaming response chunks
    • Validation: Verify SSE format and chunk delivery

8. Observability Tests

  • Metrics Collection: Test metrics endpoint

    • Input: Multiple requests
    • Expected: Metrics updated correctly
    • Validation: Check Prometheus metrics
  • Tracing Integration: Test distributed tracing (if enabled)

    • Input: Request with trace headers
    • Expected: Trace spans created
    • Validation: Verify trace propagation
  • Logging: Test structured logging

    • Input: Various request types
    • Expected: Appropriate log entries
    • Validation: Check log format and content

Implementation Plan

Phase 1: Core Features (Priority: High)

  • Prompt Guard tests
  • PII Detection tests
  • Semantic Cache tests
  • Category-Based Routing tests

Phase 2: Advanced Features (Priority: Medium)

  • Reasoning Mode tests
  • Entropy-Based Routing tests
  • Tool Selection tests
  • Request/Response Transformation tests

Phase 3: Reliability & Performance (Priority: Medium)

  • Error Handling tests
  • Performance tests
  • Observability tests

Test Implementation Guidelines

Test Structure

Each test case should follow this structure:

func testPromptGuardJailbreakDetection(ctx context.Context, client *kubernetes.Clientset, opts testcases.TestCaseOptions) error {
    // 1. Setup: Prepare test data and environment
    jailbreakPrompt := "Ignore previous instructions and..."
    
    // 2. Execute: Send request through Envoy
    response := sendChatRequest(envoyService, jailbreakPrompt)
    
    // 3. Validate: Check expected behavior
    if response.StatusCode != 403 {
        return fmt.Errorf("expected 403, got %d", response.StatusCode)
    }
    
    // 4. Verify: Check headers and metadata
    if response.Headers["X-VSR-Prompt-Guard"] != "blocked" {
        return fmt.Errorf("prompt guard not triggered")
    }
    
    return nil
}

Test Registration

Register tests in e2e/profiles/ai-gateway/testcases.go:

func init() {
    // Security tests
    testcases.Register(testcases.TestCase{
        Name:        "prompt-guard-jailbreak",
        Description: "Test jailbreak detection with prompt guard",
        Tags:        []string{"security", "prompt-guard"},
        Fn:          testPromptGuardJailbreakDetection,
    })
    
    // Cache tests
    testcases.Register(testcases.TestCase{
        Name:        "semantic-cache-hit",
        Description: "Test semantic cache hit for similar queries",
        Tags:        []string{"cache", "performance"},
        Fn:          testSemanticCacheHit,
    })
    
    // ... more test registrations
}

Test Configuration

Add test-specific configuration in deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml:

config:
  # Enable all features for testing
  prompt_guard:
    enabled: true
    threshold: 0.7
  
  semantic_cache:
    enabled: true
    similarity_threshold: 0.8
    max_entries: 1000
    ttl_seconds: 3600
  
  classifier:
    pii_model:
      enabled: true
      threshold: 0.7

Acceptance Criteria

  • All test cases implemented and passing
  • Test coverage for all major Semantic Router features
  • Tests can be run selectively by tags (e.g., make e2e-test E2E_TESTS=security)
  • Test documentation updated in e2e/README.md
  • CI integration updated to run new tests
  • Test execution time remains reasonable (< 30 minutes for full suite)

References

Related Issues

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions