Skip to content

Conversation

yossiovadia
Copy link
Collaborator

The commit includes the comprehensive jailbreak detection test suite that:

  • Tests 4 different classifier paths (Batch API, Direct Security, ExtProc, Pattern Analysis)
  • Includes cache-busting mechanisms with unique test cases
  • Documents expected behaviors and security gaps
  • Exposes the critical jailbreak detection failures we discovered

at this stage there are multiple misclassifications where jailbreak prompt isnt being detected properly.

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:

1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
   - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
   - Security risk: harmful content bypasses jailbreak detection

2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
   - Direct jailbreak classification endpoint not implemented
   - Forces users to rely on batch endpoint with broken routing

3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
   - Validates end-to-end security filtering in LLM completion pipeline
   - Documents security bypass where harmful instructions can be generated

Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency

This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:

CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior

VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"

ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)

The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:

- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases

Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis

Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
@yossiovadia yossiovadia requested a review from rootfs as a code owner October 3, 2025 18:27
Copy link

netlify bot commented Oct 3, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit d28eb56
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68e027f226a04a000885db7e
😎 Deploy Preview https://deploy-preview-331--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

github-actions bot commented Oct 3, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 e2e-tests

Owners: @yossiovadia
Files changed:

  • e2e-tests/05-jailbreak-test.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs rootfs merged commit 0270850 into vllm-project:main Oct 3, 2025
9 checks passed
@yossiovadia yossiovadia deleted the feature/add-jailbreak-detection-test branch October 3, 2025 19:55
Aias00 pushed a commit to Aias00/semantic-router that referenced this pull request Oct 4, 2025
* feat: add comprehensive jailbreak detection test

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:

1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
   - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
   - Security risk: harmful content bypasses jailbreak detection

2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
   - Direct jailbreak classification endpoint not implemented
   - Forces users to rely on batch endpoint with broken routing

3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
   - Validates end-to-end security filtering in LLM completion pipeline
   - Documents security bypass where harmful instructions can be generated

Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency

This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* fix: correct jailbreak test to use proper API parameters

Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:

CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior

VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"

ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)

The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* feat: add comprehensive jailbreak detection tests

Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:

- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases

Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis

Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

---------

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com>
Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00 pushed a commit to Aias00/semantic-router that referenced this pull request Oct 4, 2025
* feat: add comprehensive jailbreak detection test

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:

1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
   - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
   - Security risk: harmful content bypasses jailbreak detection

2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
   - Direct jailbreak classification endpoint not implemented
   - Forces users to rely on batch endpoint with broken routing

3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
   - Validates end-to-end security filtering in LLM completion pipeline
   - Documents security bypass where harmful instructions can be generated

Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency

This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* fix: correct jailbreak test to use proper API parameters

Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:

CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior

VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"

ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)

The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* feat: add comprehensive jailbreak detection tests

Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:

- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases

Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis

Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

---------

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com>
Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00 pushed a commit to Aias00/semantic-router that referenced this pull request Oct 4, 2025
* feat: add comprehensive jailbreak detection test

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:

1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
   - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
   - Security risk: harmful content bypasses jailbreak detection

2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
   - Direct jailbreak classification endpoint not implemented
   - Forces users to rely on batch endpoint with broken routing

3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
   - Validates end-to-end security filtering in LLM completion pipeline
   - Documents security bypass where harmful instructions can be generated

Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency

This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* fix: correct jailbreak test to use proper API parameters

Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:

CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior

VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"

ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)

The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* feat: add comprehensive jailbreak detection tests

Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:

- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases

Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis

Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

---------

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com>
Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00 pushed a commit to Aias00/semantic-router that referenced this pull request Oct 4, 2025
* feat: add comprehensive jailbreak detection test

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:

1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
   - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
   - Security risk: harmful content bypasses jailbreak detection

2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
   - Direct jailbreak classification endpoint not implemented
   - Forces users to rely on batch endpoint with broken routing

3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
   - Validates end-to-end security filtering in LLM completion pipeline
   - Documents security bypass where harmful instructions can be generated

Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency

This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* fix: correct jailbreak test to use proper API parameters

Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:

CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior

VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"

ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)

The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

* feat: add comprehensive jailbreak detection tests

Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:

- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases

Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis

Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

---------

Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com>
Signed-off-by: liuhy <liuhongyu@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants