Feature/add jailbreak detection test #331

yossiovadia · 2025-10-03T18:27:52Z

The commit includes the comprehensive jailbreak detection test suite that:

Tests 4 different classifier paths (Batch API, Direct Security, ExtProc, Pattern Analysis)
Includes cache-busting mechanisms with unique test cases
Documents expected behaviors and security gaps
Exposes the critical jailbreak detection failures we discovered

at this stage there are multiple misclassifications where jailbreak prompt isnt being detected properly.

Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com>

netlify · 2025-10-03T18:27:57Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`d28eb56`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68e027f226a04a000885db7e
😎 Deploy Preview	https://deploy-preview-331--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-10-03T19:46:05Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `e2e-tests`

Owners: @yossiovadia
Files changed:

e2e-tests/05-jailbreak-test.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <yovadia@redhat.com> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> --------- Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Signed-off-by: liuhy <liuhongyu@apache.org>

yossiovadia added 3 commits October 2, 2025 12:34

yossiovadia requested a review from rootfs as a code owner October 3, 2025 18:27

Merge branch 'main' into feature/add-jailbreak-detection-test

d28eb56

github-actions bot assigned yossiovadia Oct 3, 2025

rootfs merged commit 0270850 into vllm-project:main Oct 3, 2025
9 checks passed

yossiovadia deleted the feature/add-jailbreak-detection-test branch October 3, 2025 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/add jailbreak detection test #331

Feature/add jailbreak detection test #331

Uh oh!

yossiovadia commented Oct 3, 2025

Uh oh!

netlify bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Feature/add jailbreak detection test #331

Feature/add jailbreak detection test #331

Uh oh!

Conversation

yossiovadia commented Oct 3, 2025

Uh oh!

netlify bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Oct 3, 2025

👥 vLLM Semantic Team Notification

📁 e2e-tests

🎉 Thanks for your contributions!

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Oct 3, 2025 •

edited

Loading

📁 `e2e-tests`