-
Notifications
You must be signed in to change notification settings - Fork 297
Open
Description
Summary
Following the request from @Xunzhuo in PR #718, this issue tracks the migration of Jailbreak Detection to support LoRA auto-detection.
Background
Currently, PII detection has LoRA auto-detection (PR #709), and Intent Classification will have it (#724), but Jailbreak Detection does not. This creates an inconsistency where some classification features can leverage LoRA models while others cannot.
Note: This work depends on #724 (Intent Classification LoRA) being merged first, as it will establish the pattern to follow.
Current Behavior (BEFORE)
Problem: Jailbreak detection cannot automatically use LoRA models.
How it works:
- Configuration has a
use_modernbertflag that determines which initializer to use - System makes a hardcoded choice between two paths:
use_modernbert: false→ UsesLinearJailbreakInitializer(Traditional BERT only)use_modernbert: true→ UsesModernBertJailbreakInitializer(ModernBERT only)
- Neither path can detect or use LoRA models automatically
- Even if you point
model_idto a LoRA jailbreak model, it will fail
Current config:
prompt_guard:
use_modernbert: true
model_id: "models/jailbreak_classifier_modernbert-base_model"
# Cannot use: models/lora_jailbreak_classifier_bert-base-uncased_modelExpected Behavior (AFTER)
Solution: Jailbreak detection should auto-detect LoRA models (just like PII and Intent).
How it should work:
- Single auto-detecting initializer that intelligently routes based on model type
- Detection happens automatically by checking:
- LoRA weights in model.safetensors file
- Presence of lora_config.json
- Smart fallback chain: LoRA → Traditional BERT → ModernBERT
- The
use_modernbertconfig flag becomes optional/ignored (backward compatible) - Zero configuration needed - just point to model path and system figures it out
Example (can now use LoRA models):
prompt_guard:
model_id: "models/lora_jailbreak_classifier_bert-base-uncased_model"
use_modernbert: false # Ignored - auto-detection finds LoRA and uses itImplementation Notes
- Depends on: Issue Enable LoRA auto-detection for Intent/Category Classification #724 being merged first (provides the pattern)
- Follow the same pattern as Intent Classification and PII detection
- Update both Go layer (classifier.go) and Rust layer (init.rs, classify.rs)
- LoRA jailbreak models already exist in models/ directory:
lora_jailbreak_classifier_bert-base-uncased_modellora_jailbreak_classifier_modernbert-base_modellora_jailbreak_classifier_roberta-base_model
- Add auto-detection test similar to Intent Classification
- Check if Rust FFI functions need to be added (like
InitCandleBertJailbreakClassifier)
Implementation Approach
Go Layer Changes (classifier.go)
- Replace
LinearJailbreakInitializerandModernBertJailbreakInitializerwithJailbreakInitializerImpl - Replace
LinearJailbreakInferenceandModernBertJailbreakInferencewithJailbreakInferenceImpl - Remove
useModernBERTparameter from factory functions - Update call sites (similar to Intent Classification changes)
Rust Layer Changes
- Add
LORA_JAILBREAK_CLASSIFIERstatic variable to init.rs - Update
init_candle_bert_jailbreak_classifier(or create it) with intelligent routing - Update
classify_candle_bert_jailbreak_textto try LoRA first, then fallback - May need to add helper method to existing LoRA jailbreak classifier
Related Work
- PR fix(647): enable LoRA PII auto-detection with minimal changes #709: PII LoRA auto-detection (original pattern)
- Issue Enable LoRA auto-detection for Intent/Category Classification #724: Intent Classification LoRA auto-detection (must merge first)
- PR fix(api): expose actual PII confidence scores instead of hardcoded 0.9 #718: Fixed PII API confidence scores, comment requesting this migration
Success Criteria
- LoRA jailbreak models are automatically detected and used
- Traditional BERT and ModernBERT fallback paths still work
- No configuration changes required for users
- All existing tests continue to pass
- New tests demonstrate LoRA auto-detection working
- Consistent pattern across all three classification types (PII, Intent, Jailbreak)
Metadata
Metadata
Assignees
Labels
No labels