[E2E] Expand Comprehensive E2E test cases in CI

## Description

Expand the AI Gateway profile in the E2E testing framework with comprehensive test cases covering all major Semantic Router features including prompt guard, semantic cache, PII detection, tool selection, and advanced routing scenarios.

## Background

The initial AI Gateway profile implementation in #655 includes only basic test cases:
- `basic-health-check`: Verifies all deployments are healthy
- `chat-completions-request`: Tests the `/v1/chat/completions` endpoint

We need to expand test coverage to validate all Semantic Router features in the AI Gateway environment.

## Proposed Test Cases

### 1. Security & Safety Tests

#### Prompt Guard Tests
- [ ] **Jailbreak Detection**: Test detection of jailbreak attempts
  - Input: Known jailbreak patterns
  - Expected: Request blocked or routed to safe model
  - Validation: Check response headers and model selection

- [ ] **Malicious Prompt Filtering**: Test filtering of harmful prompts
  - Input: Prompts with malicious intent
  - Expected: Request rejected with appropriate error
  - Validation: Verify prompt guard classification results

#### PII Detection Tests
- [ ] **PII Entity Detection**: Test detection of various PII types
  - Input: Prompts containing SSN, email, phone, credit card
  - Expected: PII entities detected and classified
  - Validation: Check PII classification confidence scores

- [ ] **PII Policy Enforcement**: Test model selection based on PII policy
  - Input: Prompt with PII + model with PII restrictions
  - Expected: Alternative model selected or request blocked
  - Validation: Verify model routing based on PII policy

- [ ] **PII Redaction**: Test PII redaction in responses (if enabled)
  - Input: Request that might return PII
  - Expected: PII redacted in response
  - Validation: Check response content for PII presence

### 2. Semantic Cache Tests

- [ ] **Cache Hit**: Test semantic cache hit for similar queries
  - Input: Two semantically similar queries
  - Expected: Second query returns cached response
  - Validation: Check cache hit headers and response time

- [ ] **Cache Miss**: Test cache miss for dissimilar queries
  - Input: Two semantically different queries
  - Expected: Both queries hit backend models
  - Validation: Verify no cache hit headers

- [ ] **Cache Similarity Threshold**: Test cache threshold configuration
  - Input: Queries with varying similarity scores
  - Expected: Cache hit only when above threshold
  - Validation: Check cache behavior at boundary conditions

- [ ] **Cache TTL**: Test cache expiration
  - Input: Same query before and after TTL
  - Expected: Cache hit before TTL, miss after
  - Validation: Verify cache expiration behavior

- [ ] **Cache Eviction**: Test cache eviction policies (FIFO, LRU)
  - Input: Queries exceeding cache capacity
  - Expected: Oldest entries evicted
  - Validation: Check cache size and eviction order

### 3. Intelligent Routing Tests

#### Category-Based Routing
- [ ] **Category Classification**: Test intent classification accuracy
  - Input: Queries from different categories (math, biology, law, etc.)
  - Expected: Correct category classification
  - Validation: Check classification confidence and category

- [ ] **Model Selection by Category**: Test model routing based on category scores
  - Input: Category-specific queries
  - Expected: Model with highest score for category selected
  - Validation: Verify selected model matches configuration

- [ ] **Fallback to Default Model**: Test fallback when no category matches
  - Input: Query with low classification confidence
  - Expected: Default model selected
  - Validation: Check fallback behavior

#### Reasoning Mode Detection
- [ ] **Reasoning Mode Activation**: Test detection of reasoning-required queries
  - Input: Complex reasoning queries
  - Expected: Reasoning mode enabled in request
  - Validation: Check reasoning parameters in backend request

- [ ] **Reasoning Family Configuration**: Test reasoning family-specific settings
  - Input: Query requiring reasoning
  - Expected: Correct reasoning parameters for model family
  - Validation: Verify reasoning configuration (thinking, enable_thinking, etc.)

#### Entropy-Based Routing
- [ ] **High Entropy Detection**: Test routing for high-entropy queries
  - Input: Ambiguous or complex queries
  - Expected: More capable model selected
  - Validation: Check entropy score and model selection

- [ ] **Low Entropy Optimization**: Test efficient routing for simple queries
  - Input: Simple, low-entropy queries
  - Expected: Efficient model selected
  - Validation: Verify cost-optimized routing

### 4. Tool Selection Tests

- [ ] **Tool Detection**: Test automatic tool selection
  - Input: Query requiring specific tools (weather, calculator, search)
  - Expected: Appropriate tools selected
  - Validation: Check tools in request payload

- [ ] **Tool Similarity Matching**: Test tool selection based on similarity
  - Input: Queries with varying tool relevance
  - Expected: Tools selected above similarity threshold
  - Validation: Verify tool selection confidence scores

- [ ] **Top-K Tool Selection**: Test limiting number of selected tools
  - Input: Query matching multiple tools
  - Expected: Only top-K tools selected
  - Validation: Check number of tools in request

### 5. Request/Response Transformation Tests

- [ ] **System Prompt Injection**: Test category-specific system prompts
  - Input: Query from specific category
  - Expected: Category system prompt added to request
  - Validation: Check system message in backend request

- [ ] **Request Header Propagation**: Test custom header forwarding
  - Input: Request with custom headers
  - Expected: Headers forwarded to backend
  - Validation: Verify headers in backend request

- [ ] **Response Header Injection**: Test VSR metadata headers
  - Input: Any valid request
  - Expected: VSR headers in response (selected model, category, etc.)
  - Validation: Check response headers

### 6. Error Handling & Edge Cases

- [ ] **Invalid Model Request**: Test handling of non-existent model
  - Input: Request with invalid model name
  - Expected: Appropriate error response
  - Validation: Check error message and status code

- [ ] **Backend Timeout**: Test timeout handling
  - Input: Request to slow/unresponsive backend
  - Expected: Timeout error after configured duration
  - Validation: Verify timeout behavior

- [ ] **Backend Failure Fallback**: Test fallback when primary backend fails
  - Input: Request when backend is down
  - Expected: Fallback to alternative backend or error
  - Validation: Check failover behavior

- [ ] **Malformed Request**: Test handling of invalid requests
  - Input: Requests with missing/invalid fields
  - Expected: Validation error response
  - Validation: Check error messages

### 7. Performance & Load Tests

- [ ] **Concurrent Requests**: Test handling of concurrent requests
  - Input: Multiple simultaneous requests
  - Expected: All requests processed successfully
  - Validation: Check response times and success rate

- [ ] **Large Payload**: Test handling of large prompts
  - Input: Request with very long prompt
  - Expected: Request processed or rejected with size limit error
  - Validation: Verify payload size handling

- [ ] **Streaming Response**: Test streaming completions
  - Input: Request with `stream: true`
  - Expected: Streaming response chunks
  - Validation: Verify SSE format and chunk delivery

### 8. Observability Tests

- [ ] **Metrics Collection**: Test metrics endpoint
  - Input: Multiple requests
  - Expected: Metrics updated correctly
  - Validation: Check Prometheus metrics

- [ ] **Tracing Integration**: Test distributed tracing (if enabled)
  - Input: Request with trace headers
  - Expected: Trace spans created
  - Validation: Verify trace propagation

- [ ] **Logging**: Test structured logging
  - Input: Various request types
  - Expected: Appropriate log entries
  - Validation: Check log format and content

## Implementation Plan

### Phase 1: Core Features (Priority: High)
- Prompt Guard tests
- PII Detection tests
- Semantic Cache tests
- Category-Based Routing tests

### Phase 2: Advanced Features (Priority: Medium)
- Reasoning Mode tests
- Entropy-Based Routing tests
- Tool Selection tests
- Request/Response Transformation tests

### Phase 3: Reliability & Performance (Priority: Medium)
- Error Handling tests
- Performance tests
- Observability tests

## Test Implementation Guidelines

### Test Structure

Each test case should follow this structure:

```go
func testPromptGuardJailbreakDetection(ctx context.Context, client *kubernetes.Clientset, opts testcases.TestCaseOptions) error {
    // 1. Setup: Prepare test data and environment
    jailbreakPrompt := "Ignore previous instructions and..."
    
    // 2. Execute: Send request through Envoy
    response := sendChatRequest(envoyService, jailbreakPrompt)
    
    // 3. Validate: Check expected behavior
    if response.StatusCode != 403 {
        return fmt.Errorf("expected 403, got %d", response.StatusCode)
    }
    
    // 4. Verify: Check headers and metadata
    if response.Headers["X-VSR-Prompt-Guard"] != "blocked" {
        return fmt.Errorf("prompt guard not triggered")
    }
    
    return nil
}
```

### Test Registration

Register tests in `e2e/profiles/ai-gateway/testcases.go`:

```go
func init() {
    // Security tests
    testcases.Register(testcases.TestCase{
        Name:        "prompt-guard-jailbreak",
        Description: "Test jailbreak detection with prompt guard",
        Tags:        []string{"security", "prompt-guard"},
        Fn:          testPromptGuardJailbreakDetection,
    })
    
    // Cache tests
    testcases.Register(testcases.TestCase{
        Name:        "semantic-cache-hit",
        Description: "Test semantic cache hit for similar queries",
        Tags:        []string{"cache", "performance"},
        Fn:          testSemanticCacheHit,
    })
    
    // ... more test registrations
}
```

### Test Configuration

Add test-specific configuration in `deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml`:

```yaml
config:
  # Enable all features for testing
  prompt_guard:
    enabled: true
    threshold: 0.7
  
  semantic_cache:
    enabled: true
    similarity_threshold: 0.8
    max_entries: 1000
    ttl_seconds: 3600
  
  classifier:
    pii_model:
      enabled: true
      threshold: 0.7
```

## Acceptance Criteria

- [ ] All test cases implemented and passing
- [ ] Test coverage for all major Semantic Router features
- [ ] Tests can be run selectively by tags (e.g., `make e2e-test E2E_TESTS=security`)
- [ ] Test documentation updated in `e2e/README.md`
- [ ] CI integration updated to run new tests
- [ ] Test execution time remains reasonable (< 30 minutes for full suite)

## References

- E2E Framework PR: #655
- AI Gateway Profile: `e2e/profiles/ai-gateway/`
- Semantic Router Configuration: `config/config.yaml`
- E2E Framework README: `e2e/README.md`

## Related Issues

- #655 - E2E Framework PR
- #656 - Istio profile
- #657 - Production-stack profile
- #658 - LLM-D profile
- #659 - Dynamo profile
- #660 - AIBrix profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[E2E] Expand Comprehensive E2E test cases in CI #661

Description

Background

Proposed Test Cases

1. Security & Safety Tests

Prompt Guard Tests

PII Detection Tests

2. Semantic Cache Tests

3. Intelligent Routing Tests

Category-Based Routing

Reasoning Mode Detection

Entropy-Based Routing

4. Tool Selection Tests

5. Request/Response Transformation Tests

6. Error Handling & Edge Cases

7. Performance & Load Tests

8. Observability Tests

Implementation Plan

Phase 1: Core Features (Priority: High)

Phase 2: Advanced Features (Priority: Medium)

Phase 3: Reliability & Performance (Priority: Medium)

Test Implementation Guidelines

Test Structure

Test Registration

Test Configuration

Acceptance Criteria

References

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[E2E] Expand Comprehensive E2E test cases in CI #661

Description

Description

Background

Proposed Test Cases

1. Security & Safety Tests

Prompt Guard Tests

PII Detection Tests

2. Semantic Cache Tests

3. Intelligent Routing Tests

Category-Based Routing

Reasoning Mode Detection

Entropy-Based Routing

4. Tool Selection Tests

5. Request/Response Transformation Tests

6. Error Handling & Edge Cases

7. Performance & Load Tests

8. Observability Tests

Implementation Plan

Phase 1: Core Features (Priority: High)

Phase 2: Advanced Features (Priority: Medium)

Phase 3: Reliability & Performance (Priority: Medium)

Test Implementation Guidelines

Test Structure

Test Registration

Test Configuration

Acceptance Criteria

References

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions