-
Notifications
You must be signed in to change notification settings - Fork 296
Description
Description
Expand the AI Gateway profile in the E2E testing framework with comprehensive test cases covering all major Semantic Router features including prompt guard, semantic cache, PII detection, tool selection, and advanced routing scenarios.
Background
The initial AI Gateway profile implementation in #655 includes only basic test cases:
basic-health-check: Verifies all deployments are healthychat-completions-request: Tests the/v1/chat/completionsendpoint
We need to expand test coverage to validate all Semantic Router features in the AI Gateway environment.
Proposed Test Cases
1. Security & Safety Tests
Prompt Guard Tests
-
Jailbreak Detection: Test detection of jailbreak attempts
- Input: Known jailbreak patterns
- Expected: Request blocked or routed to safe model
- Validation: Check response headers and model selection
-
Malicious Prompt Filtering: Test filtering of harmful prompts
- Input: Prompts with malicious intent
- Expected: Request rejected with appropriate error
- Validation: Verify prompt guard classification results
PII Detection Tests
-
PII Entity Detection: Test detection of various PII types
- Input: Prompts containing SSN, email, phone, credit card
- Expected: PII entities detected and classified
- Validation: Check PII classification confidence scores
-
PII Policy Enforcement: Test model selection based on PII policy
- Input: Prompt with PII + model with PII restrictions
- Expected: Alternative model selected or request blocked
- Validation: Verify model routing based on PII policy
-
PII Redaction: Test PII redaction in responses (if enabled)
- Input: Request that might return PII
- Expected: PII redacted in response
- Validation: Check response content for PII presence
2. Semantic Cache Tests
-
Cache Hit: Test semantic cache hit for similar queries
- Input: Two semantically similar queries
- Expected: Second query returns cached response
- Validation: Check cache hit headers and response time
-
Cache Miss: Test cache miss for dissimilar queries
- Input: Two semantically different queries
- Expected: Both queries hit backend models
- Validation: Verify no cache hit headers
-
Cache Similarity Threshold: Test cache threshold configuration
- Input: Queries with varying similarity scores
- Expected: Cache hit only when above threshold
- Validation: Check cache behavior at boundary conditions
-
Cache TTL: Test cache expiration
- Input: Same query before and after TTL
- Expected: Cache hit before TTL, miss after
- Validation: Verify cache expiration behavior
-
Cache Eviction: Test cache eviction policies (FIFO, LRU)
- Input: Queries exceeding cache capacity
- Expected: Oldest entries evicted
- Validation: Check cache size and eviction order
3. Intelligent Routing Tests
Category-Based Routing
-
Category Classification: Test intent classification accuracy
- Input: Queries from different categories (math, biology, law, etc.)
- Expected: Correct category classification
- Validation: Check classification confidence and category
-
Model Selection by Category: Test model routing based on category scores
- Input: Category-specific queries
- Expected: Model with highest score for category selected
- Validation: Verify selected model matches configuration
-
Fallback to Default Model: Test fallback when no category matches
- Input: Query with low classification confidence
- Expected: Default model selected
- Validation: Check fallback behavior
Reasoning Mode Detection
-
Reasoning Mode Activation: Test detection of reasoning-required queries
- Input: Complex reasoning queries
- Expected: Reasoning mode enabled in request
- Validation: Check reasoning parameters in backend request
-
Reasoning Family Configuration: Test reasoning family-specific settings
- Input: Query requiring reasoning
- Expected: Correct reasoning parameters for model family
- Validation: Verify reasoning configuration (thinking, enable_thinking, etc.)
Entropy-Based Routing
-
High Entropy Detection: Test routing for high-entropy queries
- Input: Ambiguous or complex queries
- Expected: More capable model selected
- Validation: Check entropy score and model selection
-
Low Entropy Optimization: Test efficient routing for simple queries
- Input: Simple, low-entropy queries
- Expected: Efficient model selected
- Validation: Verify cost-optimized routing
4. Tool Selection Tests
-
Tool Detection: Test automatic tool selection
- Input: Query requiring specific tools (weather, calculator, search)
- Expected: Appropriate tools selected
- Validation: Check tools in request payload
-
Tool Similarity Matching: Test tool selection based on similarity
- Input: Queries with varying tool relevance
- Expected: Tools selected above similarity threshold
- Validation: Verify tool selection confidence scores
-
Top-K Tool Selection: Test limiting number of selected tools
- Input: Query matching multiple tools
- Expected: Only top-K tools selected
- Validation: Check number of tools in request
5. Request/Response Transformation Tests
-
System Prompt Injection: Test category-specific system prompts
- Input: Query from specific category
- Expected: Category system prompt added to request
- Validation: Check system message in backend request
-
Request Header Propagation: Test custom header forwarding
- Input: Request with custom headers
- Expected: Headers forwarded to backend
- Validation: Verify headers in backend request
-
Response Header Injection: Test VSR metadata headers
- Input: Any valid request
- Expected: VSR headers in response (selected model, category, etc.)
- Validation: Check response headers
6. Error Handling & Edge Cases
-
Invalid Model Request: Test handling of non-existent model
- Input: Request with invalid model name
- Expected: Appropriate error response
- Validation: Check error message and status code
-
Backend Timeout: Test timeout handling
- Input: Request to slow/unresponsive backend
- Expected: Timeout error after configured duration
- Validation: Verify timeout behavior
-
Backend Failure Fallback: Test fallback when primary backend fails
- Input: Request when backend is down
- Expected: Fallback to alternative backend or error
- Validation: Check failover behavior
-
Malformed Request: Test handling of invalid requests
- Input: Requests with missing/invalid fields
- Expected: Validation error response
- Validation: Check error messages
7. Performance & Load Tests
-
Concurrent Requests: Test handling of concurrent requests
- Input: Multiple simultaneous requests
- Expected: All requests processed successfully
- Validation: Check response times and success rate
-
Large Payload: Test handling of large prompts
- Input: Request with very long prompt
- Expected: Request processed or rejected with size limit error
- Validation: Verify payload size handling
-
Streaming Response: Test streaming completions
- Input: Request with
stream: true - Expected: Streaming response chunks
- Validation: Verify SSE format and chunk delivery
- Input: Request with
8. Observability Tests
-
Metrics Collection: Test metrics endpoint
- Input: Multiple requests
- Expected: Metrics updated correctly
- Validation: Check Prometheus metrics
-
Tracing Integration: Test distributed tracing (if enabled)
- Input: Request with trace headers
- Expected: Trace spans created
- Validation: Verify trace propagation
-
Logging: Test structured logging
- Input: Various request types
- Expected: Appropriate log entries
- Validation: Check log format and content
Implementation Plan
Phase 1: Core Features (Priority: High)
- Prompt Guard tests
- PII Detection tests
- Semantic Cache tests
- Category-Based Routing tests
Phase 2: Advanced Features (Priority: Medium)
- Reasoning Mode tests
- Entropy-Based Routing tests
- Tool Selection tests
- Request/Response Transformation tests
Phase 3: Reliability & Performance (Priority: Medium)
- Error Handling tests
- Performance tests
- Observability tests
Test Implementation Guidelines
Test Structure
Each test case should follow this structure:
func testPromptGuardJailbreakDetection(ctx context.Context, client *kubernetes.Clientset, opts testcases.TestCaseOptions) error {
// 1. Setup: Prepare test data and environment
jailbreakPrompt := "Ignore previous instructions and..."
// 2. Execute: Send request through Envoy
response := sendChatRequest(envoyService, jailbreakPrompt)
// 3. Validate: Check expected behavior
if response.StatusCode != 403 {
return fmt.Errorf("expected 403, got %d", response.StatusCode)
}
// 4. Verify: Check headers and metadata
if response.Headers["X-VSR-Prompt-Guard"] != "blocked" {
return fmt.Errorf("prompt guard not triggered")
}
return nil
}Test Registration
Register tests in e2e/profiles/ai-gateway/testcases.go:
func init() {
// Security tests
testcases.Register(testcases.TestCase{
Name: "prompt-guard-jailbreak",
Description: "Test jailbreak detection with prompt guard",
Tags: []string{"security", "prompt-guard"},
Fn: testPromptGuardJailbreakDetection,
})
// Cache tests
testcases.Register(testcases.TestCase{
Name: "semantic-cache-hit",
Description: "Test semantic cache hit for similar queries",
Tags: []string{"cache", "performance"},
Fn: testSemanticCacheHit,
})
// ... more test registrations
}Test Configuration
Add test-specific configuration in deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml:
config:
# Enable all features for testing
prompt_guard:
enabled: true
threshold: 0.7
semantic_cache:
enabled: true
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
classifier:
pii_model:
enabled: true
threshold: 0.7Acceptance Criteria
- All test cases implemented and passing
- Test coverage for all major Semantic Router features
- Tests can be run selectively by tags (e.g.,
make e2e-test E2E_TESTS=security) - Test documentation updated in
e2e/README.md - CI integration updated to run new tests
- Test execution time remains reasonable (< 30 minutes for full suite)
References
- E2E Framework PR: [Feat] Add automate e2e test framework for extensible integration tests #655
- AI Gateway Profile:
e2e/profiles/ai-gateway/ - Semantic Router Configuration:
config/config.yaml - E2E Framework README:
e2e/README.md
Related Issues
- [Feat] Add automate e2e test framework for extensible integration tests #655 - E2E Framework PR
- [E2E] Add Istio profile for E2E testing framework #656 - Istio profile
- [E2E] Add production-stack profile for E2E testing framework #657 - Production-stack profile
- [E2E] Add llm-d profile for E2E testing framework #658 - LLM-D profile
- [E2E] Add dynamo profile for E2E testing framework #659 - Dynamo profile
- [E2E] Add aibrix profile for E2E testing framework #660 - AIBrix profile