-
Notifications
You must be signed in to change notification settings - Fork 181
Feature/add jailbreak detection test #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rootfs
merged 4 commits into
vllm-project:main
from
yossiovadia:feature/add-jailbreak-detection-test
Oct 3, 2025
Merged
Feature/add jailbreak detection test #331
rootfs
merged 4 commits into
vllm-project:main
from
yossiovadia:feature/add-jailbreak-detection-test
Oct 3, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com>
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
Aias00
pushed a commit
to Aias00/semantic-router
that referenced
this pull request
Oct 4, 2025
* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> --------- Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00
pushed a commit
to Aias00/semantic-router
that referenced
this pull request
Oct 4, 2025
* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> --------- Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00
pushed a commit
to Aias00/semantic-router
that referenced
this pull request
Oct 4, 2025
* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> --------- Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Signed-off-by: liuhy <liuhongyu@apache.org>
Aias00
pushed a commit
to Aias00/semantic-router
that referenced
this pull request
Oct 4, 2025
* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> --------- Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Signed-off-by: liuhy <liuhongyu@apache.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The commit includes the comprehensive jailbreak detection test suite that:
at this stage there are multiple misclassifications where jailbreak prompt isnt being detected properly.